I have been excited about the inclusion of NCBI taxonomy information in the new XML2 format that was recently released in the ncbi-blast-2.2.31+ executables (documentation here: ftp://ftp.ncbi.nlm.nih.gov/blast/documents/NEWXML/xml2.pdf)
In particular, the addition of the taxid and sciname fields has the potential to dramatically shorten my workflow, which currently involves (i) using normal XML format to output from blast, (ii) parsing using NCBIXML.parse to remove redundant hits and only keep matching subjects > 300 bp (see the excellent post and code from Zach Powers), and then (iii) Efetching each accession number to download the record and grep out the desired taxid & sciname info (see useful post on this here). That whole process works, clunkily, but especially the last step of needing to use Efetch is a major bottleneck in a process that is otherwise all local once the NCBI nt database is downloaded.
So, XML2 format to the rescue, except that it breaks the current NCBIXML parser -- not too surprising given the entirely different set of headers throughout the XML file (e.g., <taxid>
is under <HitDescr>
, which is nested within
<description>
, <Hit>
, <hits>
, <Search>
, <search>
, <Results>
, <results>
, <Report>
, <report>
, <BlastXML>
as compared to the old XML format, which had <Hit_accession>
under <Hit_id>
, <Iteration_hits>
, <BlastOutput>
)
Just accessing a blast_record with
blast_records = NCBIXML.parse(result_handle)
Yields AttributeError: BlastParser instance has no attribute '_blast'
-- Indicating that the parser itself does not recognize the file contents.
So my question: Does anyone have a strategy to either (i) fix the current NCBIXML parser to work for the new XML2 format, or (ii) use one of the other BioPython parsers (e.g., SearchIO or NCBIStandalone) to both reduce redundancy and then extract the desired information in Fasta format?
Also, note that the XML2 (outfmt 14) natively outputs one file per query, which can then be assembled together using the outputted Xinclude file as follows:
xmllint -xinclude XincludeFile.xml -o allFiles.xml
I'm not sure the benefit of that extra step from NCBI's perspective, but presumably they had a reason...
Thanks in advance for any help/ideas, --nate
This isn't really a Q&A style entry, rather this would have been ideal to send to the Biopython mailing lists, e.g. http://lists.open-bio.org/pipermail/biopython/2015-May/015622.html
For recent visitors, an BLAST XML2 parser was merged into Biopython around 2019: https://github.com/biopython/biopython/pull/1997