Bash Script To Filter Blast Xml Output
1
0
Entering edit mode
12.9 years ago
Vang ▴ 10

I have read the instruction here:
XSLT stylesheet to filter blast xml output
And here:
How to filter a blast xml output ?
and I came up with this script:

#!/bin/sh
# 
# filterblastxml.sh : Script to filter blast XML output by evalue
# Usage: filterblastxml.sh <blastxmlfile> <gt|lt> <evalue> <output_file_name>
# For example: "filterblastxml.sh test.blastn.xml lt 1e-5 filtered_test.blastn.xml"
# Above command filter the test.blastn.xml file to exclude all hits which has evalue
# less than 1e-5 and output the content to filtered_test.blastn.xml file.  
tmpdir=/tmp
xslt_file=$tmpdir/$(date +%s)_blastfilter.xslt.xsl
cat << XSLTOUT > ${xslt_file}
    
<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >
<xsl:output method="xml" encoding="UTF-8" indent="yes" doctype-public="-//NCBI//NCBI BlastOutput/EN" doctype-system="NCBI_BlastOutput.dtd"/>
<xsl:template match="*|text()">
 <xsl:copy>
    <xsl:apply-templates select="*|text()"/>
 </xsl:copy>
</xsl:template>
<xsl:template match="Iteration_hits/Hit/Hit_hsps">
<xsl:apply-templates select="Hsp[number(Hsp_evalue)&amp;${2};${3}]"/>
</xsl:template>

</xsl:stylesheet>
XSLTOUT

xsltproc --novalid ${xslt_file} $1 >$4

It seems to work fine, but when all hits have been removed from a query, the output becomes incompatible with blast xml standard. The xml tags "Hit_hsps" and "Hsp" got deleted. How do I avoid this?

Sample blast xml data:

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastn</BlastOutput_program>
  <BlastOutput_version>BLASTN 2.2.25+</BlastOutput_version>
  <BlastOutput_reference>Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), &quot;A greedy algorithm for aligning DNA sequences&quot;, J Comput Biol 2000; 7(1-2):203-14.</BlastOutput_reference>
  <BlastOutput_db>nt</BlastOutput_db>
  <BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
  <BlastOutput_query-def>G3QAAIR07ISVPP length=527 xy=3492_1275 region=7 run=R_2011_06_03_10_33_44_</BlastOutput_query-def>
  <BlastOutput_query-len>528</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_expect>10</Parameters_expect>
      <Parameters_sc-match>1</Parameters_sc-match>
      <Parameters_sc-mismatch>-2</Parameters_sc-mismatch>
      <Parameters_gap-open>0</Parameters_gap-open>
      <Parameters_gap-extend>0</Parameters_gap-extend>
      <Parameters_filter>L;m;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_query-ID>Query_1</Iteration_query-ID>
      <Iteration_query-def>G3QAAIR07ISVPP length=527 xy=3492_1275 region=7 run=R_2011_06_03_10_33_44_</Iteration_query-def>
      <Iteration_query-len>528</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gi|110430570|gb|DQ675076.1|</Hit_id>
          <Hit_def>Uncultured bacterium clone QHO-B31 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>DQ675076</Hit_accession>
          <Hit_len>1490</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>881.972178398883</Hsp_bit-score>
              <Hsp_score>477</Hsp_score>
              <Hsp_evalue>3</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>525</Hsp_query-to>
              <Hsp_hit-from>510</Hsp_hit-from>
              <Hsp_hit-to>1</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>504</Hsp_identity>
              <Hsp_positive>504</Hsp_positive>
              <Hsp_gaps>9</Hsp_gaps>
              <Hsp_align-len>515</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>2</Hit_num>
          <Hit_id>gi|34305683|gb|AY363244.1|</Hit_id>
          <Hit_def>Hydrocarboniphaga effusa strain AP102 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>AY363244</Hit_accession>
          <Hit_len>1520</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>878.278879094208</Hsp_bit-score>
              <Hsp_score>475</Hsp_score>
              <Hsp_evalue>1</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>525</Hsp_query-to>
              <Hsp_hit-from>511</Hsp_hit-from>
              <Hsp_hit-to>2</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>503</Hsp_identity>
              <Hsp_positive>503</Hsp_positive>
              <Hsp_gaps>9</Hsp_gaps>
              <Hsp_align-len>515</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>3</Hit_num>
          <Hit_id>gi|110430571|gb|DQ675077.1|</Hit_id>
          <Hit_def>Uncultured bacterium clone QHO-B35 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>DQ675077</Hit_accession>
          <Hit_len>1490</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>876.43222944187</Hsp_bit-score>
              <Hsp_score>474</Hsp_score>
              <Hsp_evalue>2</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>525</Hsp_query-to>
              <Hsp_hit-from>510</Hsp_hit-from>
              <Hsp_hit-to>1</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>503</Hsp_identity>
              <Hsp_positive>503</Hsp_positive>
              <Hsp_gaps>9</Hsp_gaps>
              <Hsp_align-len>515</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>4</Hit_num>
          <Hit_id>gi|110430574|gb|DQ675080.1|</Hit_id>
          <Hit_def>Uncultured bacterium clone QHO-B36 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>DQ675080</Hit_accession>
          <Hit_len>1497</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>872.738930137194</Hsp_bit-score>
              <Hsp_score>472</Hsp_score>
              <Hsp_evalue>0</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>525</Hsp_query-to>
              <Hsp_hit-from>511</Hsp_hit-from>
              <Hsp_hit-to>1</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>503</Hsp_identity>
              <Hsp_positive>503</Hsp_positive>
              <Hsp_gaps>10</Hsp_gaps>
              <Hsp_align-len>516</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>5</Hit_num>
          <Hit_id>gi|110430531|gb|DQ675037.1|</Hit_id>
          <Hit_def>Uncultured bacterium clone QHO-B33 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>DQ675037</Hit_accession>
          <Hit_len>1490</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>872.738930137194</Hsp_bit-score>
              <Hsp_score>472</Hsp_score>
              <Hsp_evalue>5</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>525</Hsp_query-to>
              <Hsp_hit-from>511</Hsp_hit-from>
              <Hsp_hit-to>1</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>503</Hsp_identity>
              <Hsp_positive>503</Hsp_positive>
              <Hsp_gaps>10</Hsp_gaps>
              <Hsp_align-len>516</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
      </Iteration_hits>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>14234969</Statistics_db-num>
          <Statistics_db-len>36678711190</Statistics_db-len>
          <Statistics_hsp-len>32</Statistics_hsp-len>
          <Statistics_eff-space>17966703322272</Statistics_eff-space>
          <Statistics_kappa>0.46</Statistics_kappa>
          <Statistics_lambda>1.28</Statistics_lambda>
          <Statistics_entropy>0.85</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
    </Iteration>
    <Iteration>
      <Iteration_iter-num>2</Iteration_iter-num>
      <Iteration_query-ID>Query_2</Iteration_query-ID>
      <Iteration_query-def>G3QAAIR07ITTWS length=486 xy=3503_0538 region=7 run=R_2011_06_03_10_33_44_</Iteration_query-def>
      <Iteration_query-len>527</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gi|110430570|gb|DQ675076.1|</Hit_id>
          <Hit_def>Uncultured bacterium clone QHO-B31 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>DQ675076</Hit_accession>
          <Hit_len>1490</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>931.831719012006</Hsp_bit-score>
              <Hsp_score>504</Hsp_score>
              <Hsp_evalue>1</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>524</Hsp_query-to>
              <Hsp_hit-from>510</Hsp_hit-from>
              <Hsp_hit-to>1</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>508</Hsp_identity>
              <Hsp_positive>508</Hsp_positive>
              <Hsp_gaps>0</Hsp_gaps>
              <Hsp_align-len>510</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>2</Hit_num>
          <Hit_id>gi|34305683|gb|AY363244.1|</Hit_id>
          <Hit_def>Hydrocarboniphaga effusa strain AP102 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>AY363244</Hit_accession>
          <Hit_len>1520</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>928.13841970733</Hsp_bit-score>
              <Hsp_score>502</Hsp_score>
              <Hsp_evalue>0</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>524</Hsp_query-to>
              <Hsp_hit-from>511</Hsp_hit-from>
              <Hsp_hit-to>2</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>507</Hsp_identity>
              <Hsp_positive>507</Hsp_positive>
              <Hsp_gaps>0</Hsp_gaps>
              <Hsp_align-len>510</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>3</Hit_num>
          <Hit_id>gi|110430571|gb|DQ675077.1|</Hit_id>
          <Hit_def>Uncultured bacterium clone QHO-B35 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>DQ675077</Hit_accession>
          <Hit_len>1490</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>926.291770054992</Hsp_bit-score>
              <Hsp_score>501</Hsp_score>
              <Hsp_evalue>0</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>524</Hsp_query-to>
              <Hsp_hit-from>510</Hsp_hit-from>
              <Hsp_hit-to>1</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>507</Hsp_identity>
              <Hsp_positive>507</Hsp_positive>
              <Hsp_gaps>0</Hsp_gaps>
              <Hsp_align-len>510</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>4</Hit_num>
          <Hit_id>gi|110430574|gb|DQ675080.1|</Hit_id>
          <Hit_def>Uncultured bacterium clone QHO-B36 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>DQ675080</Hit_accession>
          <Hit_len>1497</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>920.751821097979</Hsp_bit-score>
              <Hsp_score>498</Hsp_score>
              <Hsp_evalue>4</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>524</Hsp_query-to>
              <Hsp_hit-from>511</Hsp_hit-from>
              <Hsp_hit-to>1</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>507</Hsp_identity>
              <Hsp_positive>507</Hsp_positive>
              <Hsp_gaps>1</Hsp_gaps>
              <Hsp_align-len>511</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
      </Iteration_hits>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>14234969</Statistics_db-num>
          <Statistics_db-len>36678711190</Statistics_db-len>
          <Statistics_hsp-len>32</Statistics_hsp-len>
          <Statistics_eff-space>17930480130090</Statistics_eff-space>
          <Statistics_kappa>0.46</Statistics_kappa>
          <Statistics_lambda>1.28</Statistics_lambda>
          <Statistics_entropy>0.85</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
    </Iteration>
    <Iteration>
      <Iteration_iter-num>3</Iteration_iter-num>
      <Iteration_query-ID>Query_3</Iteration_query-ID>
      <Iteration_query-def>G3QAAIR07ITRSJ length=526 xy=3502_1889 region=7 run=R_2011_06_03_10_33_44_</Iteration_query-def>
      <Iteration_query-len>527</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gi|110430570|gb|DQ675076.1|</Hit_id>
          <Hit_def>Uncultured bacterium clone QHO-B31 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>DQ675076</Hit_accession>
          <Hit_len>1490</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>931.831719012006</Hsp_bit-score>
              <Hsp_score>504</Hsp_score>
              <Hsp_evalue>0</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>524</Hsp_query-to>
              <Hsp_hit-from>510</Hsp_hit-from>
              <Hsp_hit-to>1</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>508</Hsp_identity>
              <Hsp_positive>508</Hsp_positive>
              <Hsp_gaps>0</Hsp_gaps>
              <Hsp_align-len>510</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>2</Hit_num>
          <Hit_id>gi|34305683|gb|AY363244.1|</Hit_id>
          <Hit_def>Hydrocarboniphaga effusa strain AP102 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>AY363244</Hit_accession>
          <Hit_len>1520</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>928.13841970733</Hsp_bit-score>
              <Hsp_score>502</Hsp_score>
              <Hsp_evalue>0</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>524</Hsp_query-to>
              <Hsp_hit-from>511</Hsp_hit-from>
              <Hsp_hit-to>2</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>507</Hsp_identity>
              <Hsp_positive>507</Hsp_positive>
              <Hsp_gaps>0</Hsp_gaps>
              <Hsp_align-len>510</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
      </Iteration_hits>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>14234969</Statistics_db-num>
          <Statistics_db-len>36678711190</Statistics_db-len>
          <Statistics_hsp-len>32</Statistics_hsp-len>
          <Statistics_eff-space>17930480130090</Statistics_eff-space>
          <Statistics_kappa>0.46</Statistics_kappa>
          <Statistics_lambda>1.28</Statistics_lambda>
          <Statistics_entropy>0.85</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
    </Iteration>
    <Iteration>
      <Iteration_iter-num>4</Iteration_iter-num>
      <Iteration_query-ID>Query_4</Iteration_query-ID>
      <Iteration_query-def>G3QAAIR07IS4TC length=527 xy=3495_0782 region=7 run=R_2011_06_03_10_33_44_</Iteration_query-def>
      <Iteration_query-len>526</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gi|110430570|gb|DQ675076.1|</Hit_id>
          <Hit_def>Uncultured bacterium clone QHO-B31 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>DQ675076</Hit_accession>
          <Hit_len>1490</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>924.445120402654</Hsp_bit-score>
              <Hsp_score>500</Hsp_score>
              <Hsp_evalue>0</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>523</Hsp_query-to>
              <Hsp_hit-from>510</Hsp_hit-from>
              <Hsp_hit-to>1</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>507</Hsp_identity>
              <Hsp_positive>507</Hsp_positive>
              <Hsp_gaps>1</Hsp_gaps>
              <Hsp_align-len>510</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>2</Hit_num>
          <Hit_id>gi|34305683|gb|AY363244.1|</Hit_id>
          <Hit_def>Hydrocarboniphaga effusa strain AP102 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>AY363244</Hit_accession>
          <Hit_len>1520</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>920.751821097979</Hsp_bit-score>
              <Hsp_score>498</Hsp_score>
              <Hsp_evalue>0</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>523</Hsp_query-to>
              <Hsp_hit-from>511</Hsp_hit-from>
              <Hsp_hit-to>2</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>506</Hsp_identity>
              <Hsp_positive>506</Hsp_positive>
              <Hsp_gaps>1</Hsp_gaps>
              <Hsp_align-len>510</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>3</Hit_num>
          <Hit_id>gi|110430571|gb|DQ675077.1|</Hit_id>
          <Hit_def>Uncultured bacterium clone QHO-B35 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>DQ675077</Hit_accession>
          <Hit_len>1490</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>918.905171445641</Hsp_bit-score>
              <Hsp_score>497</Hsp_score>
              <Hsp_evalue>0</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>523</Hsp_query-to>
              <Hsp_hit-from>510</Hsp_hit-from>
              <Hsp_hit-to>1</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>506</Hsp_identity>
              <Hsp_positive>506</Hsp_positive>
              <Hsp_gaps>1</Hsp_gaps>
              <Hsp_align-len>510</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>4</Hit_num>
          <Hit_id>gi|110430574|gb|DQ675080.1|</Hit_id>
          <Hit_def>Uncultured bacterium clone QHO-B36 16S ribosomal RNA gene, partial sequence</Hit_def>
          <Hit_accession>DQ675080</Hit_accession>
          <Hit_len>1497</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>915.211872140965</Hsp_bit-score>
              <Hsp_score>495</Hsp_score>
              <Hsp_evalue>1e-10</Hsp_evalue>
              <Hsp_query-from>15</Hsp_query-from>
              <Hsp_query-to>523</Hsp_query-to>
              <Hsp_hit-from>511</Hsp_hit-from>
              <Hsp_hit-to>1</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>-1</Hsp_hit-frame>
              <Hsp_identity>506</Hsp_identity>
              <Hsp_positive>506</Hsp_positive>
              <Hsp_gaps>2</Hsp_gaps>
              <Hsp_align-len>511</Hsp_align-len>
            </Hsp>
          </Hit_hsps>
        </Hit>
      </Iteration_hits>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>14234969</Statistics_db-num>
          <Statistics_db-len>36678711190</Statistics_db-len>
          <Statistics_hsp-len>32</Statistics_hsp-len>
          <Statistics_eff-space>17894256937908</Statistics_eff-space>
          <Statistics_kappa>0.46</Statistics_kappa>
          <Statistics_lambda>1.28</Statistics_lambda>
          <Statistics_entropy>0.85</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>
blast filter xml bash • 5.7k views
ADD COMMENT
1
Entering edit mode

Use existing solutions for this. Probably not bash.

ADD REPLY
1
Entering edit mode

You don't need a bash script here. You can pass some arguments to the stylesheet using xsltproc and xsl:param . See the manual page.

ADD REPLY
0
Entering edit mode

Not clear what you want to do. If you want to avoid removing all hits then...don't remove all hits?

ADD REPLY
0
Entering edit mode

@maasha, could you be more specific of which solutions? I've been searching long before I was persuaded to put up this question myself. BioPerl does not support parse BLAST result and output XML file. I couldn't make XML::Simple and XML::Twig work. What I am trying to do is not pure bash, it actually just automation of instruction at http://biostar.stackexchange.com/questions/2869 @neilfws, sorry I did not put the description more visible. I want to write a "Script to filter blast XML output by evalue", which removes hits that are below or above an input evalue and output filtered XML file.

ADD REPLY
0
Entering edit mode

@Pierre, That is what I am trying to do. It is just one step further after xsltproc command: To make it more flexible with runtime parameters instead of diving into the XSL file and edit it every time users want to run similar jobs.

ADD REPLY
4
Entering edit mode
12.9 years ago
Chris Maloney ▴ 360
  • There should be no space before the xml declaration ("[?]?xml version='1.0' encoding="UTF-8"?>") on the first line. You dont really need this, but if you're going to include it, then there should be no spaces before it.

  • This is just cosmetic, but why do you have two extensions on your XSLT file, as in ".xslt.xsl"? How about just ".xslt"?

  • This template is your identity transform:

    <xsl:template match="*|text()">
      <xsl:copy>
        <xsl:apply-templates select="*|text()"/>
      </xsl:copy>
    </xsl:template>
    

    But instead of *|text(), you should use @*|node(). The latter will match all attributes, elements, and text nodes. The former will leave out attributes. In this case it doesn't matter, because the XML file has no attributes, but in general it's a good thing to memorize.

  • Your XSLT changes the structure of the XML file, and that might be where you're having a problem. You can see it in this template:

    <xsl:template match="Iteration_hits/Hit/Hit_hsps">
      <xsl:apply-templates select="Hsp[number(Hsp_evalue)&${2};${3}]"/>
    </xsl:template>
    

    You match on <Hit_hsps> elements, but then proceed to apply-templates to the <Hsp> child element. The apply-templates causes the copy, so you see that the <Hit_hsps> does not get copied.

    To get the output in exactly the same structure as the input, change it to the following two templates. I'm assuming that if the <Hit_hsps> doesn't have any children meeting the criteria, then you want to discard it.

    <!--
      If a Hit_hsps has at least one Hsp child that meets the criterion,
      then copy it (the Hit_hsps).
    -->
    <xsl:template match="Iteration_hits/Hit/Hit_hsps">
      <xsl:if test='Hsp[number(Hsp_evalue) &${2}; ${3}]'>
        <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
      </xsl:if>
    </xsl:template>
    <!--
      Now only copy those Hsps that meet the criterion.
    -->
    <xsl:template match="Iteration_hits/Hit/Hit_hsps/Hsp">
      <xsl:if test='self::*[number(Hsp_evalue) &${2}; ${3}]'>
        <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
      </xsl:if>
    </xsl:template>
    
  • Finally, I notice that your sample XML file is not valid according to the DTD. I downloaded the DTD from ftp://ftp.ncbi.nlm.nih.gov/blast/documents/xml/, and ran

    xmllint --valid --noout SampleBlastOutput.xml
    

    And got several errors. I then "fixed" the DTD by adding some question marks, to make <Hsp_qseq> and <Hsp_hseq> optional. Then the input would validate. When I run the script with the changes above, using

    ./filterblastxml.sh SampleBlastOutput.xml lt 1e-5 filtered_test.out.xml
    

    I get valid output as well.

ADD COMMENT
0
Entering edit mode

Thank you very much, Chris Maloney! You are my teacher :-) While other suggestions are not very accessible for my cloudy head, you cleared everything and let me see the light :-)

I have never looked close to XML file as this time. And with a lot of things around, I got lost.

I manually copied and edited the sample BLAST output that I gave here, so that's why you found a incorrectly formed xml document. I have modified the bash script and ran on my original BLAST outputs. It gave expected results, and xml are validated with xmllint command. Also, thanks for that command.

ADD REPLY

Login before adding your answer.

Traffic: 1617 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6