I have been using GNU Parallel to search an organism's entire proteome against a sequence database. It works great for BLAST tabular output, but I would like to use the traditional XML output, because of how well the SearchIO module of Biopython works. Specifically, I am interested in filtering the results and outputting the HSPs into a FASTA file, one for each query sequence.
However, my output from GNU parallel is not a valid BLAST XML file. Instead, each search is concatenated. If I try to read the file with SearchIO.read()
, I get this error:
from Bio import SearchIO
all_results = SearchIO.read("results.xml","blast-xml")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/__init__.py", line 364, in read
second = next(generator)
File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/__init__.py", line 316, in parse
for qresult in generator:
File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/BlastIO/blast_xml.py", line 193, in __iter__
for qresult in self._parse_qresult():
File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/BlastIO/blast_xml.py", line 244, in _parse_qresult
for event, qresult_elem in self.xml_iter:
File "<string>", line 91, in next
cElementTree.ParseError: junk after document element: line 603, column 0
The same thing happens with SearchIO.parse()
if I try to access the second element. Line 603 contains the comment line:
<?xml version="1.0"?>
which indicates the beginning of the next BLAST result, done separately by GNU Parallel.
Alternatively, I can read in the full file as text and split into a list. However, it does not appear that Biopython's SearchIO can the use contents of a file (as a string) for an argument; it needs a file handle. Instead, I have been iterating over the results, generating a temporary xml file for each, and reading that back in with SearchIO. This seems very inefficient, as involves about 30,000 write and read commands.
Here are my questions:
- Is this the most efficient use of the tools I am using? I could alternatively have my GNU Parallel script output as tabular blast, but because the desired end result is target sequences, I would have to retrieve these anyway. Would that approach be more efficient?
- Is there a lower-level function within SearchIO that I could use to process the xml result from a string, rather than from a file handle?
- In the absence of the above improvements, what would be the best way to process the file without having to create 30,000 temporary files and re-reading them?
what does your
parallel
cmd look like?I have the basidiomycota sequences from NR in a custom database, and I am BLASTing an entire proteome against them. I have access to four servers, each of which has 12 CPUs. Some testing I did suggested that limiting the number of jobs parallel would execute, while also using some multi-threading for BLAST, would be more efficient than having each BLAST search run on a separate thread. This command completed in about 80 minutes (30,000 query sequences, 90,000 targets):
It sounds like you have 30,000 proteins and you are running one BLAST job per protein. That is a lot of overhead. I would batch things, say 1000 queries in each FASTA file, giving just 33 chunks. This is what the BLAST+ wrappers for Galaxy do, with extra code to merge the XML output properly instead of simple concatenation which (as you have seen) results in an invalid XML file.