Question

Getting premature stop codons from exonerate output?

0

Entering edit mode

3.1 years ago

virginia.s.markham ▴ 80

Hello,

Does anyone know a good way to get premature stop codons from exonerate's protein2genome model??

Unfortunately the Vulgar output doesn't record stop codons. You also can't just get the protein sequence from the genomic DNA input (in my case the target sequence), because the --ryo option always seems to deliver the DNA sequence. I can't simply translate this because it contains frameshifts and split codons.

Next I tried using Biopython's SearchIO package. I see that it inserts 'X's where the split codons are, which is fine, but eventually if there are too many frameshifts or split codons, it just gives up on the sequence and returns all 'X's, even though those regions were still alignable with exonerate. There's definitely a lot of information here that SearchIO is just discarding.

Here's the bottom of my exonerate alignment, just to show that there is still alignable stuff there:

Here's the bottom of my exonerate alignment, just to show that there is still alignable stuff there

And here's what SearchIO extracts. There are 11 exons in the exonerate alignment, but the parser gives up on the last 5:

And here's what SearchIO extracts

exonerate biopython • 1.1k views

ADD COMMENT • link 3.1 years ago by virginia.s.markham ▴ 80

score 2 · Answer 1 · 2021-11-07

Okay I just wrote a very simple string parsing script to get premature termination codons ('***') from the exonerate alignment. Shame that there's not a simple way to do it with the exonerate options nor with Biopython.

import re

#THIS ONLY WORKS IF YOU HAVE NO EXTRA CUSTOM OUTPUT
#SO NO SUGAR, CIGAR, OR VULGAR, AND NO --ryo OR --showquerygff OR --showtargetgff
#Also expects that the query was a peptide and target was a DNA sequence

filename = 'myalignment.aln'
read = []
with open(filename, 'r') as alnf:
    read = alnf.readlines()

#remove some of the junk
lines = [ line for line in read if line != '\n' ]
lines = [ line for line in lines if line != '-- completed exonerate analysis\n' ]

pattern = re.compile(r'\s+') #matches all whitespace (space, tab, newline, and so on). we'll remove this.
queryAA = ''
for i in range(10, len(lines), 4):
    queryAA += re.sub(pattern, '', lines[i]).split(':')[1]

targetAA = ''
for i in range(12, len(lines), 4):
    targetAA += re.sub(pattern, '', lines[i])

#remove the real termination codon if there is one
if queryAA.endswith('***') and targetAA.endswith('***'):
    queryAA = queryAA[:-3]
    targetAA = targetAA[:-3]

# I'm just recording whether '***' occurs, (True/False) not the number of occurrences.
print('{}\t{}\t{}'.format(filename, '***' in queryAA, '***' in targetAA) )