How To Extract Desire Genes Blast Xml Result From A Big Blast Xml File
1
2
Entering edit mode
11.4 years ago
KJ Lim ▴ 140

Dear community,

I have a XML file contained 50,000 genes Blast result with 10 hits for each gene. I want to mine desire genes Blast result from that XML file and the output file is still in XML format. The output file should have a full Blast result in XML form of desire genes.

I tired with the Python script shared by juefish on this posts, however, I only get partial of one gene Blast XML result parsed instead a list of my desire genes. I quote juefish's Python script here to make the discussion easier.

#!/usr/bin/env python

import sys
import os
import sets
import Bio

from sets import Set
from Bio.Blast import NCBIXML

# Usage.
if len(sys.argv) < 2:
        print ""
        print "This program extracts blast results from an xml file given a list of query sequences"
        print "Usage: %s -list file1 -xml file2 > outfile"
        print "-list: list of sequence names"
        print "-xml: blast xml output file"
        print ""
        sys.exit()

# Parse args.
for i in range(len(sys.argv)):
        if sys.argv[i] == "-list":
                infile1 = sys.argv[i+1]
        elif sys.argv[i] == "-xml":
                infile2 = sys.argv[i+1]

fls = [infile1,infile2]
results_handle = open(fls[1], "r")
fin1 = open(fls[0],"r")
geneContigs = Set([])

#establish list of names of queries to extract from xml file
for line in fin1:
    temp=line.lstrip('>').split()
    geneContigs.add(temp[0])
fin1.close()

genes2 = []
marker = False
header = True

#cycle through xml file and printing results that match to query list
for line in results_handle:
        line = line.rstrip('\r\n')
        if(header==True):
                if(line.count("<Iteration>")>0):
                        header=False
                        print "%s" % (line)
                        continue
                print "%s" % (line)
        else:
                    if(line.count("<Iteration>")>0):
                            genes2=[]
                    if(marker==False):
                            genes2.append(line)
                    if(marker==True):
                            if(line.count("</Iteration>")<1):
                                    print "%s" % (line)
                            elif(line.count("</Iteration>")>0):
                                    print "%s" % (line)
                                    marker=False
                    if(line.count("<Iteration_query-def>")>0):
                            split1 = line.split(">")
                            split2 = split1[1].split("<")
                            if(split2[0] in geneContigs):
                                    for i in range(len(genes2)):
                                            print "%s" % (genes2[i])
                                    marker=True
                            genes2=[]

I'm not good in Python nor BioPython, it would be great if the community could share with me your experience or suggestion.

Thanks for your time.

xml blast • 5.6k views
ADD COMMENT
0
Entering edit mode

Hi Pierre; I am trying to compile using the command you mentioned. But it say javac: file not found: Biostar75204.java

Thanks

ADD REPLY
0
Entering edit mode

did you just download the file Biostar75204.java ?

ADD REPLY
0
Entering edit mode

thanks you Pierre. I used the script and was able to compile. Sorry it took a while to reply. I am trying on a large xml output.

ADD REPLY
0
Entering edit mode
11.4 years ago

"How to extract desire genes blast XML result from a big blast XML file":

create a java parser for blast:

 xjc -dtd -p gov.nih.nlm.ncbi.blast "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"

the following java file scan a blast xml file from stdin using a pull parser. Each time we find a 'Hit' we check if it is in the list and we include it in the resulting xml:

compile:

 javac Biostar75204.java gov/nih/nlm/ncbi/blast/*.java

execute:

 java Biostar75204 GENE1 GENE2 GENE3 GENE4  < blast.xml > result.xml

you can then extract the information using, for example, xslt: e.g. tools parsing NCBI blast -m 7 xml output format?

ADD COMMENT
0
Entering edit mode

Thanks Pierre. Thanks for your prompt replied. I will give it a try.

ADD REPLY
0
Entering edit mode

Hi there,

Is it possible to use this script with a list of genes in a file? Thanks, Bernardo

ADD REPLY
0
Entering edit mode

yes replace the loop hitdefs.add(def); by reading a file: http://www.mkyong.com/java/how-to-read-file-from-java-bufferedreader-example/

ADD REPLY
0
Entering edit mode

You can also read files using java.nio.file.Files.readAllBytes() which is faster than the BufferedReader approach in that link, especially for large files:

Here is a code snippet:

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;

public class ReadFile_Files_ReadAllBytes {
  public static void main(String [] pArgs) throws IOException {
    String fileName = "c:\\temp\\sample-10KB.txt";
    File file = new File(fileName);

    byte [] fileBytes = Files.readAllBytes(file.toPath());
    char singleChar;
    for(byte b : fileBytes) {
      singleChar = (char) b;
      System.out.print(singleChar);
    }
  }
}
ADD REPLY

Login before adding your answer.

Traffic: 2274 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6