Question

Predicting coding sequence region from GFF with exons + reference genome

0

Entering edit mode

5.5 years ago

grey ▴ 40

I have inherited a collection genome annotation files (gff3) for several newly assembled genomes in which I discovered that the coding region coordinates are incorrect. I'd like to remove the coding region coordinates and re-predict from the exons.

The gff files were created using Comparative-Annotation-Toolkit (CAT / Augustus), using a combination of RNA-seq data and lift over from the reference genome for this species. The exon-intron structure appears to be correct in the new genomes. However, the problem seems to be that the start and stop coordinates for the coding regions (CDS) have been forced onto the new genomes even in cases where they produce amino acid sequences that don't make any sense (ie does not begin with Met, has stop codons in the middle of sequence, or does not end with stop codon).

I would be open to other suggestions, but having spent some time working on it, I've decided to try to re-predict the CDS coordinates from the gff file (remove CDS regions and re-predict reading frame in the exons).

Can someone point me to a method in which the input files are a GFF with exons + reference genome to call coding region coordinates?

Thank you!

genome annotation gff augustus maker • 2.2k views

ADD COMMENT • link updated 5.5 years ago by Juke34 9.2k • written 5.5 years ago by grey ▴ 40

score 0 · Answer 1 · 2019-10-24

0

Entering edit mode

5.5 years ago

Juke34 9.2k

I have a perl script for that purpose in the AGAT toolkit (conda install -c bioconda agat ):
agat_sp_fix_longest_ORF.pl

ADD COMMENT • link 5.2 years ago by Juke34 9.2k

0

Entering edit mode

Thanks Juke, so from the name of the script, I assume it just looks at all potential ATG start and selects the one with the longest sequence before a *stop codon?

ADD REPLY • link 5.5 years ago by grey ▴ 40

0

Entering edit mode

It extract the current CDS to look at the size (doesn't look at the presence of stop, start), then it extracts the exons, and does a prediction. It compares the length of the new prediction and classify them into 5 different cases (called model):

Model1 = original sequence is part of new prediction; the predicted one is longest
Model2 = sequence original predicted are different; the  predicted one is longest, they don't overlap each other. 
Model3 = original protein and predicted one are different; the predicted one is longest, they overlap each other. 
Model4 = The prediction is shorter.
Model5 = The prediction is same size but not correct frame (+1 or +2 bp gives frame shift).

According to the model you activate (e.g. --model 1,4), if a prediction in a locus fall in one of this case it will replace the CDS.

P.S: (I just update the repo the link was broken (it was called gff3_sp_fix_longestORF.pl but I had changed it to gff3_sp_fix_longest_ORF.pl), so do a git pull )

ADD REPLY • link 5.5 years ago by Juke34 9.2k

0

Entering edit mode

do a prediction

What do you mean? Does it find a CDS that makes sense (ie starts with start codon and ends with stop codon?)