How To Concatenate/Merge Blast Xml Outputs?
3
4
Entering edit mode
12.2 years ago

I process my blast (BLASTN 2.2.26+) searches through a script that divides the fasta inputs into N pieces, distributes one blast instance for each piece (in N processes), and, once over, concatenates the outputs into one file. It works pretty well with the table output format, but what about the xml? Indeed, I want to test a software that takes as input only xml format (a dummy cat does not work), and I'm struggling with that case.

Have you an idea on how to make one consistant xml output from several (it can be hundreds) "sub-"outputs? (I need a biopython NCBIXML parsability) ;-) ).

blast xml • 10k views
ADD COMMENT
3
Entering edit mode
12.2 years ago
Peter 6.0k

I did the same task in the Galaxy wrappers for BLAST+ (although task splitting like this isn't on by default). If you understand Python, you can see the code here in the merge method of the BlastXml class

ADD COMMENT
0
Entering edit mode

That worked perfectly. Thanks!

ADD REPLY
0
Entering edit mode

Hi Peter, it appears as though the merge method would work well for my situation, but I am unsure how to 'call' the merge function. I have copied the def merge: to a new python file -- can I simply paste this definition before the start of my actual script? And do I call the merge function by writing merge(directorycontainingxmlfilestobemerged, output_filename) ?

ADD REPLY
0
Entering edit mode

Hey, it's good to know I'm not the only one interested as I had no vote for the question ;-) You can find the modified code to make it working as is here: http://code.google.com/p/bioman/source/browse/BlastXMLmerge.py Enjoy! (And thanks again to Galaxy folks)

ADD REPLY
0
Entering edit mode

I guess in some sense I'm one of the Galaxy folk, although not one of their core developers.

ADD REPLY
0
Entering edit mode

Dear @Manu Prestat,

Thanks for the python script.

If I'm not mistaken in order to execute the script, I should do

python BlastXMLmerge.py merged.xml output1.xml output2.xml output3.xml

Please correct me if I was wrong. Thanks.

ADD REPLY
0
Entering edit mode

You're right KJ ;-)

ADD REPLY
0
Entering edit mode

Dear @Manu Prestat,

I tried to run the script with my xml files, but python raised this error:

-bash: /usr/local/bin/python: Argument list too long

I trying to run this script with 7,749 items.

I tried in my mac (OSX 10.8.4) and in redhat linux, both failed.

Any hints?

Thank you very much

ADD REPLY
0
Entering edit mode

This is a comman line length problem. First of all run this in the same directory as the input XML files to you can use local paths (avoid directory names naming the command longer).

If that is not enough, either do it in batches, or modify the script to accept a folder name as input (although then you will have less control over the file order...).

ADD REPLY
1
Entering edit mode
12.2 years ago

if the XML is well formatted, statistics and the Hit-index are meaningless :

#get one header
while read L; do [[ "${L}" == "<Hit>" ]] && break ; echo $L; done < blast1.xml > output.xml

#loop over each file. Take the fragments  between "<Hit>" and "</Hit>".
for B in blast1.xml blast2.xml blast3.xml ; do while read L; do if [[  "${L}" == *"<Hit>"* ]] ; then echo $L  >> output.xml ; while read L; do echo $L  >> output.xml ;  if [[  "${L}" == *"</Hit>"* ]] ; then break; fi; done; fi ; done < $B ; done 

#get one footer
 grep -A 10000 "</Iteration_hits>" blast1.xml  >> output.xml
ADD COMMENT
0
Entering edit mode

I don't know if it worked properly. I think there's a problem in the loop. I tried but after 3 hours of run, I stopped it.

ADD REPLY
0
Entering edit mode
11.3 years ago
xapple ▴ 230

Here is the python code from galaxy cleaned up, tested, and bundled in a bash callable:

#!/usr/bin/env python

"""
A custom made script to merge the XML blast outputs
when queries are run in parallel by input-choping.

Copied and adapted from https://bitbucket.org/peterjc/galaxy-central/src/5cefd5d5536e/tools/ncbi_blast_plus/blast.py

Tested working with BLAST 2.2.28+

The fields:
  <Iteration_iter-num></Iteration_iter-num>
  <Iteration_query-ID></Iteration_query-ID>
should not be used from the merged file.

You can use this script from the shell like this:
$ blast_merge_xml reads*.xml > merged.xml
"""

# Modules #
import sys, os, sh
from contextlib import nested

###############################################################################
def merge_blast_xml(inputs, output):
    """Will merge the results from several BLAST searches
    when XML output is used."""
    for i,handle in enumerate(inputs):
        path = handle.name
        header = handle.readline()
        if not header:
            raise Exception("BLAST XML file '%s' was empty" % path)
        if header.strip() != '':
            raise Exception("BLAST file '%s' is not an XML file" % path)
        line = handle.readline()
        header += line
        if line.strip()[0:59] != '" in line: break
            if len(header) > 10000:
                raise Exception("BLAST file '%s' has a too long a header" % path)
        if "<BlastOutput>" not in header:
            raise Exception("BLAST XML file '%s' header's seems bad" % path)
        if i == 0:
            output.write(header)
            old_header = header
        elif old_header[:300] != header[:300]:
            raise Exception("BLAST XML headers don't match" % path)
        else: output.write("    <Iteration>\n")
        for line in handle:
            if "</BlastOutput_iterations>" in line: break
            output.write(line)
    output.write("</BlastOutput_iterations>\n")
    output.write("</BlastOutput>\n\n")
    output.flush()

###############################################################################
# Get arguments #
xml_paths = sys.argv[1:]
# Check for non-evaluated wild char #
if len(xml_paths)==1 and '*' in xml_paths[0]: xml_paths = sh.glob(xml_paths[0])
# Check for existence #
for path in xml_paths:
    if not os.path.exists(path): raise Exception("No file at '%s'." % path)
# Check for user intelligence #
if len(xml_paths) == 1: raise Exception("No need to join just one file.")
# Do it #
with nested(*map(open,xml_paths)) as inputs: merge_blast_xml(inputs, sys.stdout)
ADD COMMENT

Login before adding your answer.

Traffic: 1568 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6