Second column in Ensembl GTF: source or biotype?
1
3
Entering edit mode
10.0 years ago

I'm a little confused by the meaning of the second column in Ensembl's GTF annotation sets. According to the README and the online documentation, the second column is supposed to be the source of annotation (e.g. "havana"). However, when I actually look at the release 75 GTF (ftp directory), it looks like this:

#!genome-build GRCh37.p13
#!genome-version GRCh37
#!genome-date 2009-02
#!genome-build-accession NCBI:GCA_000001405.14
#!genebuild-last-updated 2013-09
1    pseudogene    gene    11869    14412    .    +    .    gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene";
1    processed_transcript    transcript    11869    14409    .    +    .    gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";

Notice how the second column actually seems to contain the transcript_biotype (which is missing from the attributes), and the gene_source is in the attributes? Is this a bug in their GTF generation? Is the documentation for some older version of GTF which is no longer supposed to be used?

gtf ensembl • 4.1k views
ADD COMMENT
0
Entering edit mode

I would guess that somewhere along the line Ensembl people decided to use the second column to store "bio_type" rather than the "source". I don't think it is a bug or something to do with old or new GTF format.

ADD REPLY
1
Entering edit mode

Wouldn't changing the meaning of the columns be a change in the GTF format? Otherwise, if the columns can mean arbitrary things, how is it a format at all?

ADD REPLY
1
Entering edit mode

I would still regard it as a format with loose structure with some columns following strict definitions while some not. All the columns were created to represent specific information or had some specific purpose at the time the GFF format was created. Later on some of these columns became non-useful but as this format was so widely adopted, people thought it won't be a good idea just to remove some of these columns. The first (chr), third (genic feature), fourth (start), fifth (end) and seventh columns (strand) have strict definitions and should contain the same information disregard of the source of the gtf file. I guess the information in second column or source column was used by people in the beginning but now it is not that important. Most of the current program that use gtf file use chromosome, start, end, strand information to extract positions of the genic feature. The third column and information in the ninth column is used to create hierarchy that relates exons to transcripts and transcripts to genes. I think pretty much most of the tools like snpEff (annotate variants) or RNA-seq count or RPKM generators only depend on columns that follow strict definitions.Whereas columns such as sixth column that was used to be a score column is not used anymore and contains ".". You can pretty much store any numeric information there.

ADD REPLY
1
Entering edit mode
10.0 years ago
Denise CS ★ 5.2k

You are right there have been inconsistencies between the GTF file and the documentation. The second column was displaying either the status or the biotype whereas the documentation had always the second column as the status. From release 77 onwards the inconsistency is no longer in place though. It should be status, always. Having the second column as the status is in accordance with the GENCODE GTF format. Apologies for the confusion.

ADD COMMENT

Login before adding your answer.

Traffic: 1786 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6