Question

Extracting Query Sequence From Blast Xml Output Using Ncbixml Parser

1

Entering edit mode

12.4 years ago

viv_bio ▴ 50

I want to Blast my data base of sequence (Aminoacid) with a data base. And get sequences which have a given score and e value less than 0.5 . So I ran blast with e-value 0.5 and now i want to isolate those sequences which have scores higher than given value . So i parse it using NCBIXML parser from Biopython. So far i have -

from Bio.Blast import NCBIXML 
blast = NCBIXML.parse(open(raw_input("Blast file location :"),"rU")) 
for record in blast :
            if record.alignments:
                if record.alignments[0].hsps[0].score >= 1000:
                    print record.alignments[0].hsps[0].query

This gives me all the sequence but i want it with there respective gene id in seq format so i can use it for further purpose .

biopython blast • 8.5k views

ADD COMMENT • link updated 12.4 years ago by Damian Kao 16k • written 12.4 years ago by viv_bio ▴ 50

0

Entering edit mode

Don't know what you mean by "seq format". Fasta?

ADD REPLY • link 12.4 years ago by Neilfws 49k

score 2 · Answer 1 · 2012-08-05

I'm just looking through the biopython api docs and noticed that the Alignmnet class has a hit_id method. Try adding this line:

print record.alignments[0].hit_id

... and see if those ID's are useful for you. I'm not totally clear on exactly what format you need the gene id in for your downstream use, so you may need to parse it a bit before passing it along.

score 1 · Answer 2 · 2012-08-05

1

Entering edit mode

12.4 years ago

Damian Kao 16k

The XML file does not contain the raw sequences. You are going to have to get the sequence IDs from the XML file and then use the IDs to extract those sequences from the query fasta file. You can use biopython's SeqIO to parse the fasta file. So something like this:

ids = dict([(x, True) for x in open('your ids file','r')])

for record in SeqIO.parse(open('your query fasta file','r'),'fasta'):
   if ids.has_keyrecord.id):
      print ">" + record id + "\n" + str(record.seq)

Assuming you have extracted a list of sequence ids in a file separate by line.

ADD COMMENT • link 12.4 years ago by Damian Kao 16k

2

Entering edit mode

It would be more Pythonic to use 'if id in ids' rather than 'if ids.has_keyrecord.id)', and also there is no point using a dictionary like this where all the values are True. Use a Python set instead, which also has the efficient hash based membership testing offered by a dict but not a list.

ADD REPLY • link 12.4 years ago by Peter 6.0k

0

Entering edit mode

Yeah I use has_key out of habit. 'In' just feels intuitively slower even though it's not. I didn't know sets also use hash based membership. Thanks.