Question

GTF format feature definitions

0

Entering edit mode

4.5 years ago

robert.murphy ▴ 110

I am trying to analyse some GTF annotation files from a Braker2 run but am not fully comprehending the definitions of the features within the feature column. I know individual what they each are outside of GTF but when looking within the files I am getting confused.

For example I see a feature labelled gene but only have a length of 42:

PseudaA_3172    AUGUSTUS    intron  1   7   0.77    -   .   transcript_id "file_1_file_1_g18847.t1"; gene_id "file_1_file_1_g18847";
PseudaA_3172    AUGUSTUS    CDS 8   43  0.42    -   0   transcript_id "file_1_file_1_g18847.t1"; gene_id "file_1_file_1_g18847";
PseudaA_3172    AUGUSTUS    gene    1   43  0.42    -   .   g18847
PseudaA_3172    AUGUSTUS    transcript  1   43  0.42    -   .   g18847.t1
PseudaA_3172    AUGUSTUS    exon    8   43  .   -   .   transcript_id "file_1_file_1_g18847.t1"; gene_id "file_1_file_1_g18847";
PseudaA_3172    AUGUSTUS    start_codon 41  43  .   -   0   transcript_id "file_1_file_1_g18847.t1"; gene_id "file_1_file_1_g18847";

Where can i become informed about the exact definitions of each feature and why am I seeing supposedly gene lengths that are this short?

My understanding of the GTF format was that they are hierarchical, so every CDS/intron/exon ect will be contained within the length of a parent gene feature which we see here, but this must be wrong?

annotation • 1.8k views

ADD COMMENT • link updated 4.5 years ago by i.sudbery 22k • written 4.5 years ago by robert.murphy ▴ 110

score 2 · Answer 1 · 2021-03-23

2

Entering edit mode

4.5 years ago

i.sudbery 22k

Your understanding is correct. This GTF fragment you have posted does indeed represent a 43bp long gene. These genes come from AUGUSTUS, which is a gene prediction program, so I'd guess this is probably a false positive.

Unfortunately there isn't a formal definition of the GTF format (and as such it isn't a "standard"). This goes doubly for the features column, which people use pretty much however they see if.

ADD COMMENT • link 4.5 years ago by i.sudbery 22k

0

Entering edit mode

Thank you for this and that would make sense that it is a false positive. It is a shame it is not standardised!

Do you have any advice on "thresholds" for gene length cut-offs? This will of course be very species dependent but how would I begin to estimate this?

ADD REPLY • link 4.5 years ago by robert.murphy ▴ 110

0

Entering edit mode

I don't know what the standard is now, but in the old days you would get rid of anything where the CDS was less than 30 amino acids (or 90bp). That might have been for prokaryotes as well.

I would also discard anything that didn't have a complete start and stop codon and at least some t5' and 3' UTR

ADD REPLY • link 4.5 years ago by i.sudbery 22k

score 0 · Answer 2 · 2021-03-23

There is several standards for the GTF format see here

But it is more about the syntax. There is no specification such as intron length or gene length. Augustus can predict partial genes, and it is the case here, you only have the first exon (see the intron finishing at the beginning of the sequence (minus strand!)).... so the real gene might be longer.