Question

Extracting Target Sequence From Exonerate Output

4

Entering edit mode

12.5 years ago

Woa ★ 2.9k

Hello All,

I'm trying to parse an Exonerate generated alignment output file. The alignment was generated using the "protein2dna | bestfit" model and contains a single 'HSP' and multiple "HSPFragments" object within the HSP.

The alignment looks as follows: Pastebin Link . I wish to extract the 3 letter amino acid sequence of the target sequence from the alignment. However I don't know if there is any direct way to fetch the concatenated "HSPfragment" sequences and removing the gap characters.

I'm using the following code to extract fasta format sequences for each of the "HSPFragment"( seven for this case), writing them to a file and then concatanate individually, after removing the "X" characters which represents gap in the alignment.The "#" characters denote frameshifts and they don't appear in the fasta format sequences.The output can be found in this Pastebin Link

from Bio import SearchIO
qresult = SearchIO.read('bestfit.aln', 'exonerate-text')
all_frag= len(qresult.fragments)

for i in range(0,all_frag):
    qfrag= qresult.fragments[i].hit
    print qfrag.format("fasta")

Thanks

biopython parsing • 7.0k views

ADD COMMENT • link updated 9.1 years ago by Biostar 20 • written 12.5 years ago by Woa ★ 2.9k

score 2 · Answer 1 · 2013-04-24

2

Entering edit mode

12.5 years ago

bow ▴ 790

Hi Woa,

Thanks for posting the question with the input ~ it's always handy to have them :).

Answering your question, in short: no. There's no way to do this automatically with Biopython's SearchIO. For now, you have to edit / remove the 'X's manually I'm afraid.

The Exonerate parser here first tries to determine the fragments based on whether Exonerate outputs any exon/intron information. Then, if any frameshifts occured, it will split further based on these frameshifts. The '#' character was indeed removed, but the frameshift information is still retained. You can see them on the fragments using the .hit_frame / .query_frame or .hit_frames / .query_frames in the HSP objects.

Hope that answers your question :).

ADD COMMENT • link 12.5 years ago by bow ▴ 790

0

Entering edit mode

Thanks Bow and lxe. Removing "X"s are trivial but is there any way to get just the sequences for each HSPFragments other than printing them in the FASTA format? I've tried the methods 'seq' and '_seq' available for a 'qfrag' object as found by dir(qfrag). But both of them produced some output like this containing truncated sequence:

>>> qfrag.seq

Seq('MSLIRVNRFGSRGGGRKTLKVKKKASVRQEWDNTVNDLTVHRATPEDLVRRHEI...EKR', ProteinAlphabet())

ADD REPLY • link 12.5 years ago by Woa ★ 2.9k

0

Entering edit mode

What you're seeing is the representation of Biopython's Seq object. You can get the actual string of the sequence of that object by using str, e.g. str(qfrag.seq).

ADD REPLY • link 12.5 years ago by bow ▴ 790

score 0 · Answer 2 · 2013-04-24

0

Entering edit mode

12.5 years ago

pd3 ▴ 350

Take a look at the --ryo option of exonerate. One can customize the output to be more parsable. http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.html

ADD COMMENT • link 12.5 years ago by pd3 ▴ 350

1

Entering edit mode

the link above disappeared, you can use the ubuntu man page instead: http://manpages.ubuntu.com/manpages/hardy/man1/exonerate.1.html

ADD REPLY • link 10.2 years ago by Michael 56k