I'm a little confused by the meaning of the second column in Ensembl's GTF annotation sets. According to the README and the online documentation, the second column is supposed to be the source of annotation (e.g. "havana"). However, when I actually look at the release 75 GTF (ftp directory), it looks like this:
#!genome-build GRCh37.p13
#!genome-version GRCh37
#!genome-date 2009-02
#!genome-build-accession NCBI:GCA_000001405.14
#!genebuild-last-updated 2013-09
1 pseudogene gene 11869 14412 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene";
1 processed_transcript transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";
Notice how the second column actually seems to contain the transcript_biotype
(which is missing from the attributes), and the gene_source
is in the attributes? Is this a bug in their GTF generation? Is the documentation for some older version of GTF which is no longer supposed to be used?
I would guess that somewhere along the line Ensembl people decided to use the second column to store "bio_type" rather than the "source". I don't think it is a bug or something to do with old or new GTF format.
Wouldn't changing the meaning of the columns be a change in the GTF format? Otherwise, if the columns can mean arbitrary things, how is it a format at all?
I would still regard it as a format with loose structure with some columns following strict definitions while some not. All the columns were created to represent specific information or had some specific purpose at the time the GFF format was created. Later on some of these columns became non-useful but as this format was so widely adopted, people thought it won't be a good idea just to remove some of these columns. The first (chr), third (genic feature), fourth (start), fifth (end) and seventh columns (strand) have strict definitions and should contain the same information disregard of the source of the gtf file. I guess the information in second column or source column was used by people in the beginning but now it is not that important. Most of the current program that use gtf file use chromosome, start, end, strand information to extract positions of the genic feature. The third column and information in the ninth column is used to create hierarchy that relates exons to transcripts and transcripts to genes. I think pretty much most of the tools like snpEff (annotate variants) or RNA-seq count or RPKM generators only depend on columns that follow strict definitions.Whereas columns such as sixth column that was used to be a score column is not used anymore and contains ".". You can pretty much store any numeric information there.