I am trying to analyse some GTF
annotation files from a Braker2
run but am not fully comprehending the definitions of the features within the feature column. I know individual what they each are outside of GTF but when looking within the files I am getting confused.
For example I see a feature labelled gene
but only have a length of 42:
PseudaA_3172 AUGUSTUS intron 1 7 0.77 - . transcript_id "file_1_file_1_g18847.t1"; gene_id "file_1_file_1_g18847";
PseudaA_3172 AUGUSTUS CDS 8 43 0.42 - 0 transcript_id "file_1_file_1_g18847.t1"; gene_id "file_1_file_1_g18847";
PseudaA_3172 AUGUSTUS gene 1 43 0.42 - . g18847
PseudaA_3172 AUGUSTUS transcript 1 43 0.42 - . g18847.t1
PseudaA_3172 AUGUSTUS exon 8 43 . - . transcript_id "file_1_file_1_g18847.t1"; gene_id "file_1_file_1_g18847";
PseudaA_3172 AUGUSTUS start_codon 41 43 . - 0 transcript_id "file_1_file_1_g18847.t1"; gene_id "file_1_file_1_g18847";
Where can i become informed about the exact definitions of each feature and why am I seeing supposedly gene lengths that are this short?
My understanding of the GTF format was that they are hierarchical, so every CDS/intron/exon ect will be contained within the length of a parent gene feature which we see here, but this must be wrong?
Thank you for this and that would make sense that it is a false positive. It is a shame it is not standardised!
Do you have any advice on "thresholds" for gene length cut-offs? This will of course be very species dependent but how would I begin to estimate this?
I don't know what the standard is now, but in the old days you would get rid of anything where the CDS was less than 30 amino acids (or 90bp). That might have been for prokaryotes as well.
I would also discard anything that didn't have a complete start and stop codon and at least some t5' and 3' UTR