I have inherited a collection genome annotation files (gff3) for several newly assembled genomes in which I discovered that the coding region coordinates are incorrect. I'd like to remove the coding region coordinates and re-predict from the exons.
The gff files were created using Comparative-Annotation-Toolkit (CAT / Augustus), using a combination of RNA-seq data and lift over from the reference genome for this species. The exon-intron structure appears to be correct in the new genomes. However, the problem seems to be that the start and stop coordinates for the coding regions (CDS) have been forced onto the new genomes even in cases where they produce amino acid sequences that don't make any sense (ie does not begin with Met, has stop codons in the middle of sequence, or does not end with stop codon).
I would be open to other suggestions, but having spent some time working on it, I've decided to try to re-predict the CDS coordinates from the gff file (remove CDS regions and re-predict reading frame in the exons).
Can someone point me to a method in which the input files are a GFF with exons + reference genome to call coding region coordinates?
Thank you!
Thanks Juke, so from the name of the script, I assume it just looks at all potential ATG start and selects the one with the longest sequence before a *stop codon?
It extract the current CDS to look at the size (doesn't look at the presence of stop, start), then it extracts the exons, and does a prediction. It compares the length of the new prediction and classify them into 5 different cases (called model):
According to the model you activate (e.g.
--model 1,4
), if a prediction in a locus fall in one of this case it will replace the CDS.P.S: (I just update the repo the link was broken (it was called gff3_sp_fix_longestORF.pl but I had changed it to gff3_sp_fix_longest_ORF.pl), so do a
git pull
)What do you mean? Does it find a CDS that makes sense (ie starts with start codon and ends with stop codon?)
Yes it predicts a CDS with start and stop