Trying To Extract Blast Results Into Xml Outfile From Larger Blast Xml File
2
0
Entering edit mode
11.8 years ago
juefish • 0

This is probably a fairly basic question, so I apologize in advance, but I can't seem to figure out how to output xml format using Biopython. Basically, I have a fairly large BLAST results file in xml format and I'm trying to extract a portion of that file using a list of specific queries I am interest in. I can find the queries in the larger file, but I can't seem to output them into xml format. Here is the script I am currently using:

#!/usr/bin/env python

import sys
import os
import sets
import Bio
from sets import Set
from Bio.Blast import NCBIXML

# Usage.
if len(sys.argv) < 2:
        print ""
        print "This program extracts blast results from an xml file given a list of query sequences"
        print "Usage: %s -list file1 -xml file2 -out file3"
        print "-list: list of sequence names"
        print "-xml: fasta file"
        print "-out: outfile name"
        print ""
        sys.exit()

# Parse args.
for i in range(len(sys.argv)):
        if sys.argv[i] == "-list":
                infile1 = sys.argv[i+1]
        elif sys.argv[i] == "-xml":
                infile2 = sys.argv[i+1]
        elif sys.argv[i] == "-out":
                outfile = sys.argv[i+1]

fls = [infile1,infile2,outfile]
results_handle = open(fls[1], "r")
fin1 = open(fls[0],"r")
save_file = open(fls[2], "w")
geneContigs = Set([])
results_list = list()
blast_records = NCBIXML.parse(results_handle)

for line in fin1:
    temp=line.lstrip('>').split()
    geneContigs.add(temp[0])
fin1.close()

for blast_record in blast_records:
        if(blast_record.query in geneContigs):
                save_file.write(blast_record)
save_file.close()

When I do this, I get the following error:

TypeError: argument 1 must be string or read-only character buffer, not Blast

Anyone have any suggestions on how to turn this Blast record into a string in xml output format?

Thanks

blast xml biopython • 7.3k views
ADD COMMENT
0
Entering edit mode

I don't think biopython supports writing blast_record objects to XML. You might have to just write something yourself if you really need it. The XML output description is located here: ftp://ftp.ncbi.nlm.nih.gov/blast/documents/xml/

There is a .mod file that describes the xml. Maybe you can find a python library that'll let you map data to .mod files.

ADD REPLY
0
Entering edit mode

The latest version supports writing to a BLAST XML format, actually. I've written a quick example in my answer :).

ADD REPLY
3
Entering edit mode
11.8 years ago
bow ▴ 790

This is actually quite easy using the Bio.SearchIO module (recently added in Biopython 1.61). Aside from a new Blast parser, the module also includes a Blast XML writer as well.

Here's a sample code, which assumes the ID list in the gene contigs file (gene_contigs.txt) are contained in one line each. The output file is 'filtered.xml'.

from Bio import SearchIO

# create a generator for the raw BLAST results
raw_qresults = (qresult for qresult in SearchIO.parse(infile2, 'blast-xml'))
# parse gene contigs, removing trailing whitespace
gene_contigs = set([gid.strip() for gid in open('gene_contigs.txt')])
# filter the BLAST results
filtered_records = (qresult for qresult in raw_qresults if qresult.id in gene_contigs)
# and write to an XML file
SearchIO.write(filtered_records, 'filtered.xml', 'blast-xml')

The feature is only available in the latest release, so you might need to install it first. If you have it already, you may want to try the tutorial so that you get a better hang of the module.

On a side note, if you are doing some command line parsing, you may want to look into python's argparse module as well.

In any case, I hope this helps :).

ADD COMMENT
1
Entering edit mode

Thinking about this, another option might be to use Bio.SearchIO's indexing with the get_raw command to pull out the raw XML. The advantage of this (like @juefish's solution) is you ensure 100% preservation of the data, but you'd have to handle the XML header and footer by hand. Overall much more complex than the 5 line version using Bio.SearchIO.parse and write.

ADD REPLY
0
Entering edit mode

Hi bow,

For this question, if I don't have the ID list of the gene contigs file. What I want is only to save the blast out XML with only top 1 hit (<Hit_num>1</Hit_num>) from big XML file, what should I do?

Thank you!

ADD REPLY
0
Entering edit mode
11.8 years ago
juefish • 0

Thanks for the thoughts, Damian. I decided to just avoid biopython as it got complicated and did this instead. Seems to work okay.

#!/usr/bin/env python

import sys
import os
import sets
import Bio

from sets import Set
from Bio.Blast import NCBIXML

# Usage.
if len(sys.argv) < 2:
        print ""
        print "This program extracts blast results from an xml file given a list of query sequences"
        print "Usage: %s -list file1 -xml file2 > outfile"
        print "-list: list of sequence names"
        print "-xml: blast xml output file"
        print ""
        sys.exit()

# Parse args.
for i in range(len(sys.argv)):
        if sys.argv[i] == "-list":
                infile1 = sys.argv[i+1]
        elif sys.argv[i] == "-xml":
                infile2 = sys.argv[i+1]

fls = [infile1,infile2]
results_handle = open(fls[1], "r")
fin1 = open(fls[0],"r")
geneContigs = Set([])

#establish list of names of queries to extract from xml file
for line in fin1:
    temp=line.lstrip('>').split()
    geneContigs.add(temp[0])
fin1.close()

genes2 = []
marker = False
header = True

#cycle through xml file and printing results that match to query list
for line in results_handle:
        line = line.rstrip('\r\n')
        if(header==True):
                if(line.count("<Iteration>")>0):
                        header=False
                        print "%s" % (line)
                        continue
                print "%s" % (line)
        else:
                    if(line.count("<Iteration>")>0):
                            genes2=[]
                    if(marker==False):
                            genes2.append(line)
                    if(marker==True):
                            if(line.count("</Iteration>")<1):
                                    print "%s" % (line)
                            elif(line.count("</Iteration>")>0):
                                    print "%s" % (line)
                                    marker=False
                    if(line.count("<Iteration_query-def>")>0):
                            split1 = line.split(">")
                            split2 = split1[1].split("<")
                            if(split2[0] in geneContigs):
                                    for i in range(len(genes2)):
                                            print "%s" % (genes2[i])
                                    marker=True
                            genes2=[]
ADD COMMENT

Login before adding your answer.

Traffic: 1757 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6