gtf file without exons
1
0
Entering edit mode
2.3 years ago
poecile.pal ▴ 50

Good morning,

I have gtf file with genome annotation (source - AUGUSTUS). There are only genes, transcripts, stop_codons, CDSs and start_codons as features. But I need to add strings with exons (because I want to run hisat2_extract_splice_sites.py, hisat2_extract_exons.py, which produce files used in hisat2-build). Could you please tell me what to do?

Thank in advance, Poecile

gtf exons annotation augustus hisat2 • 2.7k views
ADD COMMENT
0
Entering edit mode

Could you please try a conversion of gtf file using convert_ensembl command in gtftk tool. I expect the resulting gtf to have the exon entries mentioned required.

ADD REPLY
0
Entering edit mode

Thank you very much for your reply!

I will try to install this tool, but I am confused by the logic of adding exons in the absence of annotated UTRs. I described it in the comment below. Could you explain if it's not difficult for you?

ADD REPLY
2
Entering edit mode

It seems to me that expecting "to divide the gap between CDS into UTRs" is the problem here.

There may be exceptions and exclusions, but generally speaking for a eukaryote,

Expect a utr in the beginning of the first exon alone of a transcript .

Expect another utr at the end of last exon alone of a transcript.

Lets call the region between any two consecutive exons of a transcript as intron.

First exon of a transcript would have utr+cds

Last exon of a transcript would have cds+utr

The other exons would be cds alone.

Also, do not expect all these utr, cds, exon, intron to be present in a gtf. However at the minimum exon or cds should be there.

Now read the explanation Dr.Dainat (@Juke34) has given below on how the tools like AGAT arrive at the coordinates of other gene features.

I would suggest to do a gtf conversion first to have gtf with exon entries as required and then inspect those added exon coordinates in light of the explanations. That should be more helpful.

ADD REPLY
0
Entering edit mode

I can’t thank you enough,

Now all is clear!

Let me ask 2 more questions, please.

The only thing that confuses me in my gtf file is that the coordinates of the beginning of the gene* coincide with the coordinates of the beginning of the start codon, and the coordinates of the end of the gene coincide with the coordinates of the end of the stop codon. For example,

  • gene 200-300
  • transcript . - .
  • 1st CDS 200-230
  • start_codone 200-202
  • 2nd CDS 270-300
  • stop_codone 288-300

Where to cram UTRs here?

*In my gtf there are "." instead of the transcript coordinates, but I planned to replace them with the coordinates of the gene - can you tell me, please, is this correct? It seems to me that no, because the gene includes the regulatory region! But in another arbitrary gtf that I downloaded, the coordinates of the gene repeat the coordinates of the transcript.

ADD REPLY
0
Entering edit mode

More details would help.

If from a publically accessible source, please share a link to the particular gtf.

Else,

Please mention what is the organism and provide the snippet from gtf corresponding to the above mentioned feature and coordinates.

ADD REPLY
2
Entering edit mode
2.3 years ago
Juke34 8.9k

AGAT will definitely add the exons. Try the gxf2gxf converter.

ADD COMMENT
0
Entering edit mode

Thank you so much for your help!

I will necessarily try this tool.

Could you please explain how exons can be added (if I understand correctly, UTRs + CDS) if only CDS are known, and UTRs are missing in the gtf file? Or does this tool perform UTRs annotation at the same time? For example, how to divide the gap between CDS into UTRs belonging to different exons?

ADD REPLY
0
Entering edit mode

This not a prediction tool. It just deduce features from others. Like if you have exon and CDS described in your input file, AGAT can create UTRs (e.g when the exon is longer than CDS). In the same logic it can create exon from UTR+CDS. If you do not have any UTR in your file, it will create exons only based on CDS.
One more case, if you only have CDS but mRNA are described longer than the CDS, UTR and exon can be deduced.

ADD REPLY
0
Entering edit mode

Thanks a lot!

But if we see in gtf:

  • gene with positions 200-300 and
  • 2 CDSs 200-230, 270-300,

how can we unambiguously divide the gap between 230 and 270 into 2 UTRs belonging to different exons? I took arbitrary numbers, but their relative position for all genes and CDSs is exactly the same.

I'm sorry for my inexperience.

ADD REPLY
0
Entering edit mode

After @Jeffin Rockey comment all become clear: I have understood how AGAT works, thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 1326 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6