Hello,
Does anyone know a good way to get premature stop codons from exonerate's protein2genome model??
Unfortunately the Vulgar output doesn't record stop codons. You also can't just get the protein sequence from the genomic DNA input (in my case the target sequence), because the --ryo option always seems to deliver the DNA sequence. I can't simply translate this because it contains frameshifts and split codons.
Next I tried using Biopython's SearchIO package. I see that it inserts 'X's where the split codons are, which is fine, but eventually if there are too many frameshifts or split codons, it just gives up on the sequence and returns all 'X's, even though those regions were still alignable with exonerate. There's definitely a lot of information here that SearchIO is just discarding.
Here's the bottom of my exonerate alignment, just to show that there is still alignable stuff there:
And here's what SearchIO extracts. There are 11 exons in the exonerate alignment, but the parser gives up on the last 5: