How to add the start and stop position information to a gff file?
4
0
Entering edit mode
7.8 years ago
I0110 ▴ 160

Hi,

I am using the new tomato ITAG3.0 annotation, but the gff file does not contain the rows for the start and stop codon positions. Is there a way to fix that with R or Python? In other words, is there a way to use the existing range information, including "gene","mRNA","CDS", "exon" to generate a gff file with the start and stop positions?

Thanks! Larry

genome R Assembly Python • 5.6k views
ADD COMMENT
1
Entering edit mode

Hi, did you solve the problem? would you mind to share the solution, pls? Thanks..

ADD REPLY
3
Entering edit mode
5.2 years ago
Juke34 8.9k

I have a perl script for that purpose called

gff3_sp_add_start_and_stop.pl available at the GAAS repository.

gff3_sp_add_start_and_stop.pl --gff infile.gff --fasta infile.fasta -o output.gff

agat_sp_add_start_and_stop.pl within the gff toolkit AGAT:

agat_sp_add_start_and_stop.pl --gff infile.gff --fasta infile.fasta -o  output.gff

You can specify the codon table to use (1 by default). It deals with start or stop codon that would be split over several exons.

ADD COMMENT
1
Entering edit mode
6.6 years ago
brendanmwee ▴ 10

14 months later... I am currently dealing with this same issue. The approaches I have come up with are just imputing the beginning and end codons of the exon. This doesn't work very well, but it allows my pipeline to progress.

The other ideas I have are using the exon interval to find the nearest AUG to the exon and write in a Start and stop codon entry to the gtf. or use the entries in my current GTF that match CCDS entries and find the start and stop codons in the reference GTF.

I will post whatever we come up with in the end

ADD COMMENT
1
Entering edit mode
5.3 years ago

It's a bit old but I use aegean/CanonGFF and genometools to add the features that are typically meant to be inferred. E.G. UTRs, introns, start, and stop

It does rely on the GFF adhering the standards/conventions though, so you'll need gene>mRNA>CDS/exon gene structures, and the CDS should contain the stop codon.

Typically I would do something like this.

gt gff3 -tidy -sort -retainids my.gff3 | canon-gff3 -i - > my_with_stops.gff3
ADD COMMENT
0
Entering edit mode
7.8 years ago
Michael 55k

That is implicit, the start codon should be at the first 5' CDS position, the stop codon at the 3' postion of the last CDS. Mind strand and eventual phase. Genome browsers normally don't require the start and stop codon information. Also, there is possibly a reason to not annotate the start and stop codons. Current gene models are notoriously error prone and often based on automatic prediction only, stating an exact start codon f.e. implies a possibly undue confidence. Experimental techniques such as ribosome profiling have often delivered surprising result with respect to unexpected translation initiation sites (I'll find you a citation for that...).

(If CDS are not annotated, you have to subtract the 5'/3'Utr's from the terminal exons.)

ADD COMMENT
0
Entering edit mode

I know these facts, but is there a way to automatically do that with R or python? Thanks a lot!

ADD REPLY
0
Entering edit mode

Sure there is, you just need to write a little script e.g. based on GRanges in R ;)

ADD REPLY
0
Entering edit mode

A little hint will be much appreciated. Thanks!

ADD REPLY
0
Entering edit mode

It's a few lines in perl but I don't have time now. If you don't have a solution by tomorrow, send me a gentle reminder.

ADD REPLY
0
Entering edit mode

Thanks! I tried to write it in R, but it is more difficult than I thought. The problem I have is the start and stop codons on the exon-exon junction. In these rare occasions, the start and stop positions need to be split into two ranges.

ADD REPLY
0
Entering edit mode

why not use the CDS annotation instead?

ADD REPLY
0
Entering edit mode

Did you mean using CDS to annotate the start and stop codon positions? I think we would still encounter the same issue. For example, if I use the first three nucleotides of the starting CDS, it could still across exon-exon junction so I cannot simply create a range from the first CDS row of the gene. I would have to use information from two CDS rows.

ADD REPLY

Login before adding your answer.

Traffic: 1771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6