Question

how to parse blast output using biopython

1

Entering edit mode

11.0 years ago

Camilla ▴ 10

Dear all,

I am new to Biopython and have tried to parse my output blast file.

My script is:

from Bio.Blast import NCBIXML
result=open("blast_anoC.xml","r")
records= NCBIXML.parse(result)
item=next(records)
for alignment in item.alignments:
          for hsp in alignment.hsps:
                 if hsp.expect <0.01:
                         print('****Alignment****')
                         print('sequence:', alignment.title) 
                         print('length:', alignment.length)
                         print('score:', alignment.score)
                         print('gaps:', alignment.gaps)
                         print('e value:', hsp.expect)
                         print(hsp.query[0:90] + '...')
                         print(hsp.match[0:90] + '...')
                         print(hsp.sbjct[0:90] + '...')

When I run it, Biopython executes it but doesn't print nothing...Why?

Many thanks,
Camilla

biopython blast • 24k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 11.0 years ago by Camilla ▴ 10

0

Entering edit mode

Is there a reason you're using XML output? A tab-delimited output (-outfmt 6), would be much easier to parse with standard python, or Linux commands. You can also run BLAST from inside your python script with BioPython.

ADD REPLY • link 8.2 years ago by st.ph.n ★ 2.7k

1

Entering edit mode

Biopython doesn't totally support the plain text parsing now, not enough stable, so XML is the best recommanded output for Biopython parsing.

ADD REPLY • link 8.2 years ago by Rob ▴ 150

0

Entering edit mode

So, how to extract the query coverage from XML output?

ADD REPLY • link 7.3 years ago by Andrea Spinelli ▴ 10

0

Entering edit mode

Hello, I'm trying to get also the features of the alignment (the one is written in the picture) but I can't figure out how to do it.

My file is a xml that corresponds to several blast sequences, so I also have the question how to distinguish between queries. For each query I would have to get the one with highest score from genomics sequences (no transcript) and the feature corresponding to the name of the transcript (name of the gene that codify for protein, in this example Apoptosis regulator BCL2).

Thank you in advance,

Miguel

Example BCL2

ADD REPLY • link 6.5 years ago by malj ▴ 20

0

Entering edit mode

Please open a new question and reference this post (how to parse blast output using biopython) there. Do not add an answer unless you're answering the top level question.

ADD REPLY • link 6.5 years ago by Ram 45k

Ram · Answer 1 · 2014-04-23

Hi Camilla,

There are two BUGS in your code. score and gaps are not member of alignment object. Correct code should be:

print('score:', hsp.score)
print('gaps:', hsp.gaps)

Use dir() to check members:

>>> dir(item.alignments[0])
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'accession', 'hit_def', 'hit_id', 'hsps', 'length', 'title']
>>> dir(item.alignments[0].hsps[0])
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'align_length', 'bits', 'expect', 'frame', 'gaps', 'identities', 'match', 'num_alignments', 'positives', 'query', 'query_end', 'query_start', 'sbjct', 'sbjct_end', 'sbjct_start', 'score', 'strand']

And here is how your code would run if you fix the bugs (I am using it on my BLAST file):

>>> for alignment in item.alignments:
... for hsp in alignment.hsps:
... if hsp.expect < 0.01:
... print('****Alignment****')
... print('sequence:', alignment.title)
... print('length:', alignment.length)
... print('score:', hsp.score)
... print('gaps:', hsp.gaps)
... print(hsp.query[0:90] + '...')
... print(hsp.match[0:90] + '...')
... print(hsp.sbjct[0:90] + '...')
...
****Alignment****
('sequence:', u'gi|568824607|ref|XM_006466626.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X5, mRNA')
('length:', 878)
('score:', 412.0)
('gaps:', 11)
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATATCAATGAGCTTAAAA...
||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ||||||||||||| |...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATTTCAATGAGCTTAAGA...
****Alignment****
('sequence:', u'gi|568824605|ref|XM_006466625.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X4, mRNA')
('length:', 911)
('score:', 412.0)
('gaps:', 11)
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATATCAATGAGCTTAAAA...
||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ||||||||||||| |...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATTTCAATGAGCTTAAGA...
****Alignment****
('sequence:', u'gi|568824603|ref|XM_006466624.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X3, mRNA')
('length:', 894)
('score:', 412.0)
('gaps:', 11)
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATATCAATGAGCTTAAAA...
||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ||||||||||||| |...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATTTCAATGAGCTTAAGA...
****Alignment****
('sequence:', u'gi|568824601|ref|XM_006466623.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X2, mRNA')
('length:', 922)
('score:', 412.0)
('gaps:', 11)
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATATCAATGAGCTTAAAA...
||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ||||||||||||| |...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATTTCAATGAGCTTAAGA...

Best Wishes,
Umer

score 1 · Answer 2 · 2014-04-24

1

Entering edit mode

11.0 years ago

Peter 6.0k

Your code only looks at the first BLAST query, and presumably that gave no matches (under the e-value threshold). You could replace ``item=next(records)`` with ``for item in records:`` (and indent the rest of the code) to try looking at all the records.

ADD COMMENT • link 11.0 years ago by Peter 6.0k

score 0 · Answer 3 · 2014-04-23

0

Entering edit mode

11.0 years ago

mccannj0908 • 0

Maybe the threshold value in the if statement is too small.

ADD COMMENT • link 11.0 years ago by mccannj0908 • 0

score 0 · Answer 4 · 2014-04-23

0

Entering edit mode

11.0 years ago

Camilla ▴ 10

I also tried with "hsp.expect <100"

...but is the same

:(

ADD COMMENT • link 11.0 years ago by Camilla ▴ 10

0

Entering edit mode

did the blast return any significant hits?

ADD REPLY • link 11.0 years ago by mccannj0908 • 0

score 0 · Answer 5 · 2017-02-17

0

Entering edit mode

8.2 years ago

krish.krishnan67 • 0

I used the same script above to parse my blast output in XML. The error in the script was the line item=next(records). Instead I used item=records.next() and I got the output on screen for the same e-value without any compromise. Maybe you should try this!