How Do You Extract The Query Title From A Blast Xml Output Using Biopython
2
2
Entering edit mode
13.2 years ago
Farhat ★ 2.9k

I would like to find the alignment title along with the query title and the expect value in a BLAST XML output file with many query sequences. I can get the alignment title and expect value but the query title is eluding me. How do I extract that?

blast_records = NCBIXML.parse(result_handle)
blast_record = blast_records.next()

for blast_record in blast_records:
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            print alignment.title, hsp.expect, blast_record.header.query
blast biopython • 9.4k views
ADD COMMENT
0
Entering edit mode

Can you give the cloud link of your output file?

ADD REPLY
0
Entering edit mode

Thaman: that won't be of much help since any result file should have its query name in the same place.

ADD REPLY
7
Entering edit mode
13.2 years ago
David W 4.9k

Is blast_record.query what you're looking for?

ADD COMMENT
0
Entering edit mode

Yes, that does the job.

ADD REPLY
1
Entering edit mode
13.2 years ago

I'm not a python or a BioPython guy, but if your document is large, it could be a bad idea to upload it in memory. Here is a event-bases (SAX) solution:

import sys
import xml
import xml.sax
from  cStringIO import StringIO
from xml.sax.handler import ContentHandler,DTDHandler,EntityResolver
from xml.sax.xmlreader import InputSource

class BlastHandler(ContentHandler,EntityResolver):
    def __init__(self):
        self.content=None
        self.hitdef=None
        self.evalue=None
        self.query=None
    def startElement(self,name,attrs):
        if(name=="Hit_def" or name=="Hsp_evalue" or name=="BlastOutput_query-def"):
            self.content=""
    def endElement(self,name):
        if(name=="Hsp_evalue"):
            self.evalue=self.content
            print self.hitdef,self.evalue,self.query
            self.evalue=None
        elif(name=="Hit_def"):
            self.hitdef=self.content
        elif(name=="BlastOutput_query-def"):
            self.query=self.content
        self.content=None
    def characters(self,chars):
        if(self.content!=None):
            self.content+=chars
    def notationDecl(self, name, publicId, systemId):
        return None
    def unparsedEntityDecl(self, name, publicId, systemId, ndata):
        return None
    def resolveEntity(self, publicId, systemId):
        input = InputSource()
        input.setByteStream(StringIO(""))
        return input

if __name__=='__main__':
    handler=BlastHandler()
    parser=xml.sax.make_parser()
    parser.setContentHandler(handler)
    parser.setEntityResolver(handler)
    parser.parse(open(sys.argv[1]))
ADD COMMENT
2
Entering edit mode

The Biopython parser should only deal with the hits for one query at a time - so big multi-query BLAST XML files are not such a problem.

ADD REPLY
0
Entering edit mode

I would suggest cElementTree Api instead of SAX

ADD REPLY
0
Entering edit mode

Out of interest why? Other bits (newer) of Biopython do use ElementTree, but the BLAST XML parser (which is older) went for SAX.

ADD REPLY
0
Entering edit mode

Biopython parser does deal with only 1 query at a time so memory is not such an issue.

ADD REPLY

Login before adding your answer.

Traffic: 1970 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6