Question

How to parse Blast output from new XML2 format with taxid and sciname?

1

Entering edit mode

9.6 years ago

n8upham ▴ 10

I have been excited about the inclusion of NCBI taxonomy information in the new XML2 format that was recently released in the ncbi-blast-2.2.31+ executables (documentation here: ftp://ftp.ncbi.nlm.nih.gov/blast/documents/NEWXML/xml2.pdf)

In particular, the addition of the taxid and sciname fields has the potential to dramatically shorten my workflow, which currently involves (i) using normal XML format to output from blast, (ii) parsing using NCBIXML.parse to remove redundant hits and only keep matching subjects > 300 bp (see the excellent post and code from Zach Powers), and then (iii) Efetching each accession number to download the record and grep out the desired taxid & sciname info (see useful post on this here). That whole process works, clunkily, but especially the last step of needing to use Efetch is a major bottleneck in a process that is otherwise all local once the NCBI nt database is downloaded.

So, XML2 format to the rescue, except that it breaks the current NCBIXML parser -- not too surprising given the entirely different set of headers throughout the XML file (e.g., <taxid> is under <HitDescr>, which is nested within <description>, <Hit>, <hits>, <Search>, <search>, <Results>, <results>, <Report>, <report>, <BlastXML> as compared to the old XML format, which had <Hit_accession> under <Hit_id>, <Iteration_hits>, <BlastOutput>)

Just accessing a blast_record with

blast_records = NCBIXML.parse(result_handle)

Yields AttributeError: BlastParser instance has no attribute '_blast' -- Indicating that the parser itself does not recognize the file contents.

So my question: Does anyone have a strategy to either (i) fix the current NCBIXML parser to work for the new XML2 format, or (ii) use one of the other BioPython parsers (e.g., SearchIO or NCBIStandalone) to both reduce redundancy and then extract the desired information in Fasta format?

Also, note that the XML2 (outfmt 14) natively outputs one file per query, which can then be assembled together using the outputted Xinclude file as follows:

xmllint -xinclude XincludeFile.xml -o allFiles.xml

I'm not sure the benefit of that extra step from NCBI's perspective, but presumably they had a reason...

Thanks in advance for any help/ideas, --nate

XML blast biopython • 4.1k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.6 years ago by n8upham ▴ 10

0

Entering edit mode

This isn't really a Q&A style entry, rather this would have been ideal to send to the Biopython mailing lists, e.g. http://lists.open-bio.org/pipermail/biopython/2015-May/015622.html

ADD REPLY • link 9.2 years ago by Peter 6.0k

0

Entering edit mode

For recent visitors, an BLAST XML2 parser was merged into Biopython around 2019: https://github.com/biopython/biopython/pull/1997

ADD REPLY • link 2.1 years ago by saladi ▴ 30

score 1 · Answer 1 · 2015-11-13

Biopython does not yet have a BLAST XML2 parser. With the welcome news that the NCBI will be offering this in single-file mode in the next release of BLAST+ this would be worth adding to Bio.SearchIO (which is in the process of replacing the old parsers under Bio.Blast). See also:

http://blastedbio.blogspot.co.uk/2015/07/blast-xml-2-include-trouble.html

http://www.ncbi.nlm.nih.gov/mailman/pipermail/blast-announce/2015q4/000118.html

score 0 · Answer 2 · 2015-11-13

0

Entering edit mode

9.2 years ago

5heikki 11k

Any particular reason as to why you're not using the tabular output format, in which case you could specify that you e.g. want txid in column 13 and sscinames in column 14?

I haven't used the xml2 format. Does it differ dramatically from normal xml? If not, why not parse with a bash script? You know, something akin to..

rdom () { local IFS=\> ; read -d \< E C ;}; while rdom; do if [[ $E = txid ]]; then echo $C; fi; done < $1 > out

ADD COMMENT • link 9.2 years ago by 5heikki 11k

0

Entering edit mode

BLAST XML2 is very different, even at the low level of tag names.

ADD REPLY • link 9.2 years ago by Peter 6.0k

0

Entering edit mode

Good point on the tabular output - I was thinking that too, but answered only the XML side. See http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html for examples.

ADD REPLY • link 9.2 years ago by Peter 6.0k