Tools Parsing Ncbi Blast -M 7 Xml Output Format?
5
3
Entering edit mode
13.7 years ago
Lhl ▴ 760

Hi all,

Is there any script or tool which is able to parse NCBI blast xml output (produced with -m 7 option) ?

I want a tab delimited file containing the following information:

 Name of the query sequence             Seq1
 2. Length of the query sequence           30
 3. Name of target sequence                gnl|BL_ORD_ID|0
 4. Length of target sequence              5528445
 5. Alignment bit score                    59.96
 6. E-value                                8.38112e-11
 7. Start of alignment within query        1
 8. End of alignment within query          30
 9. Start of alignment within target       5436010
10. End of alignment within target         5436039
11. Query frame                            1
12. Target frame                           1
13. Number of identical bases within       29
    the alignment
14. Alignment length                       30
15. Aligned portion (sequence) of query    CGGACAGCGCCGCCACCAACAAAGCCACCA
16. Aligned portion (sequence) of target   CGGACAGCGCCGCCACCAACAAAGCCATCA
17. Midline indicating positions of        ||||||||||||||||||||||||||| ||
    matches within the alignment

Thanks.

Elzed

blast xml parsing • 20k views
ADD COMMENT
5
Entering edit mode
13.7 years ago
Neilfws 49k

All of the major Bio* projects contain libraries to parse BLAST XML output:

  • Bioperl - use the SearchIO module with option -format=>'blastxml'
  • BioPython - their tutorial recommends to use XML output for parsing
  • BioRuby - Bio::Blast.reports will read an XML file

Once you figure out how to extract the required fields, writing to CSV is quite easy in any of these languages.

Also, don't forget that running blastall with the -m 8 or -m 9 options will generate tab-delimited output (but if I recall correctly, not including the aligned sequences, which you need).

ADD COMMENT
3
0
Entering edit mode

Thanks Neilfws. I got the XML files, which are required by other softs for annotation and it contains millions of sequences, so i do not want to wait for weeks by redoing blast with -m 8/9.

ADD REPLY
4
Entering edit mode
13.7 years ago

You can use XSLT to transform your xml to a tabular format:


<xsl:stylesheet version="1.0" xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform"
 xmlns="http://www.w3.org/1999/xhtml"
 >


<xsl:output method="text"/>

<xsl:template match="/">
<xsl:apply-templates select="BlastOutput"/>
</xsl:template>



<xsl:template match="BlastOutput">
<xsl:variable name="queryDef" select="BlastOutput_query-def"/>
<xsl:variable name="queryLen" select="BlastOutput_query-len"/>
<xsl:for-each select="BlastOutput_iterations/Iteration/Iteration_hits/Hit">
<xsl:variable name="hitDef" select="Hit-def"/>
<xsl:variable name="hitLen" select="Hit-len"/>
<xsl:for-each select="Hit_hsps/Hsp">
<xsl:value-of select="$queryDef"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="$queryLen"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="$hitDef"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="$hitLen"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_bit-score"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_evalue"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_query-from"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_query-to"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_hit-from"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_hit-to"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_query-frame"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_hit-frame"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_identity"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_positive"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_gaps"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_align-len"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_qseq"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_hseq"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="Hsp_midline"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>



</xsl:stylesheet>

Example:

xsltproc --novalid blast2csv.xsl jeter.blast.xml

Result:

No definition line    99            159.983    9.34813e-45    1    99    1    105    1    1    99    99    6    105    ATGCCCGCCCTGCGCCCCGCTCTGCT---GTGGGCGCTGCTGGCGCTCTGGCTGTGCTG---CGCGGCCCCCGCGCATGCATTGCAGTGTCGAGATGGCTATGAA    ATGCCCGCCCTGCGCCCCGCTCTGCTAAAGTGGGCGCTGCTGGCGCTCTGGCTGTGCTGAAACGCGGCCCCCGCGCATGCATTGCAGTGTCGAGATGGCTATGAA    ||||||||||||||||||||||||||   ||||||||||||||||||||||||||||||   |||||||||||||||||||||||||||||||||||||||||||
No definition line    99            66.2076    1.5844e-16    1    36    106    141    1    1    36    36    0    36    ATGCCCGCCCTGCGCCCCGCTCTGCTGTGGGCGCTG    ATGCCCGCCCTGCGCCCCGCTCTGCTGTGGGCGCTG    ||||||||||||||||||||||||||||||||||||
ADD COMMENT
1
Entering edit mode

Can this stylesheet be modified to handle multiply queries? That is, can this stylesheet convert batch blast xml output into tabular format? I have tried and failed. -Ian McDowell

ADD REPLY
0
Entering edit mode

Thanks Pierre. This seems the easiest way among those suggested here. However, i met this problem when i tried to run xsltproc.

./xslt: line 1: syntax error near unexpected token newline' ./xslt: line 1:<?xml version="1.0" encoding="UTF-8"?>'

I hope you can help me out of this. Thanks a lot.

ADD REPLY
0
Entering edit mode

Check that there is not any character before <?xml version.... in both XML files ( blast + xslt). You can also download the stylesheet from here.

ADD REPLY
0
Entering edit mode

what is a batch XML output ? some concatenated xml files ? no, it won't work.

ADD REPLY
0
Entering edit mode

First of all, thanks a lot for this script. Second, I cannot run it from my Terminal in mac: I get the following message: "failed to load external entity 'whatever.xslx" cannot parse whatever.xslx

how can I fix it? Am I making any mistake?

Thank you very much

ADD REPLY
0
Entering edit mode

xsltproc cannot find your file "whatever.xslx" check the names, check the path

ADD REPLY
1
Entering edit mode
13.7 years ago
John ▴ 50

For this purpose you can open the xml Blastoutput file in speardsheet as an external data source. You can also find NOBLAST(New Options for BLAST) useful for this purpose.NOBLAST is an open source program that provides a new user-friendly tabular output format for various NCBI BLAST programs (Blastn, Blastp, Blastx, Tblastn, Tblastx, Mega BLAST and Psi BLAST) without any use of a parser and provides E-value correction in case of use of segmented BLAST database.please read the complete publication here and download it from Here

ADD COMMENT
0
Entering edit mode

Hi! I am very new in this world and I do not have too much experience working on bioinformatics. I have downloaded NOBLAST but I have a question about its installation: Do I have to install BLAST on my computer prior to using it? How can I do that? Thanks a lot.

ADD REPLY
1
Entering edit mode
13.7 years ago
Dejian ★ 1.3k

Bioperl gives some specific advice to deal with this problem.

ADD COMMENT
0
Entering edit mode

Yes, You are right. Many thanks!

ADD REPLY
0
Entering edit mode
3.3 years ago

I have developped a small R package able to do that. It is available on Github here.
Using the function NCBI_BLAST_XML2DT() you can load your NCBI BLAST XML result file as an R data.table.
If you have multiple related XML files you can do the same thing using the function aggregate_NCBI_BLAST_XMLs2DT().
Follow the documentation in the README for install, and create an issue in the repository if you have a problem.
Good luck !

ADD COMMENT

Login before adding your answer.

Traffic: 2506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6