Good morning,
I have gtf
file with genome annotation (source - AUGUSTUS). There are only genes, transcripts, stop_codons, CDSs and start_codons as features. But I need to add strings with exons (because I want to run hisat2_extract_splice_sites.py
, hisat2_extract_exons.py
, which produce files used in hisat2-build
).
Could you please tell me what to do?
Thank in advance, Poecile
Could you please try a conversion of gtf file using convert_ensembl command in gtftk tool. I expect the resulting gtf to have the exon entries mentioned required.
Thank you very much for your reply!
I will try to install this tool, but I am confused by the logic of adding exons in the absence of annotated UTRs. I described it in the comment below. Could you explain if it's not difficult for you?
It seems to me that expecting "to divide the gap between CDS into UTRs" is the problem here.
There may be exceptions and exclusions, but generally speaking for a eukaryote,
Expect a utr in the beginning of the first exon alone of a transcript .
Expect another utr at the end of last exon alone of a transcript.
Lets call the region between any two consecutive exons of a transcript as intron.
First exon of a transcript would have utr+cds
Last exon of a transcript would have cds+utr
The other exons would be cds alone.
Also, do not expect all these utr, cds, exon, intron to be present in a gtf. However at the minimum exon or cds should be there.
Now read the explanation Dr.Dainat (@Juke34) has given below on how the tools like AGAT arrive at the coordinates of other gene features.
I would suggest to do a gtf conversion first to have gtf with exon entries as required and then inspect those added exon coordinates in light of the explanations. That should be more helpful.
I can’t thank you enough,
Now all is clear!
Let me ask 2 more questions, please.
The only thing that confuses me in my gtf file is that the coordinates of the beginning of the gene* coincide with the coordinates of the beginning of the start codon, and the coordinates of the end of the gene coincide with the coordinates of the end of the stop codon. For example,
Where to cram UTRs here?
*In my gtf there are "." instead of the transcript coordinates, but I planned to replace them with the coordinates of the gene - can you tell me, please, is this correct? It seems to me that no, because the gene includes the regulatory region! But in another arbitrary gtf that I downloaded, the coordinates of the gene repeat the coordinates of the transcript.
More details would help.
If from a publically accessible source, please share a link to the particular gtf.
Else,
Please mention what is the organism and provide the snippet from gtf corresponding to the above mentioned feature and coordinates.