Extracting Fasta Alignments From Parsed Blastxml File
1
0
Entering edit mode
13.1 years ago
Eric • 0

Hello,

I have cobbled together a small script that parses a BLASTxml file. It seems to parse the xml file just fine (judging from what it prints to the screen). The problem is the hsp.fas alignment file is incomplete. This file only contains one of the alignments contained in the BLAST output.

I would like to have all the alignments (including the query sequence in each of the individual alignments) that I see in the BLAST outputs (for example if I designate m -2 I get a complete file from the blastall).

Any suggestions? -Thanks!

module load perl

#give the name of the blast xml file to parse in the line where it says 'file =>'
use Bio::SearchIO; 
#Use m -7 to generate xml file from blastall
my $in = new Bio::SearchIO(-format => 'blastxml', 
                           -file   => 'BLASToutxml');
while( my $result = $in->next_result ) {
  ## $result is a Bio::Search::Result::ResultI compliant object
  while( my $hit = $result->next_hit ) {
    ## $hit is a Bio::Search::Hit::HitI compliant object
    while( my $hsp = $hit->next_hsp ) {
      ## $hsp is a Bio::Search::HSP::HSPI compliant object
#ENTER desired sequence length
      if( $hsp->length('total') > 50 ) {
#ENTER desired percent identity
        if ( $hsp->percent_identity >= 75 ) {
          print "Query=",   $result->query_name,
            " Hit=",        $hit->name,
            " Length=",     $hsp->length('total'),
            " Percent_id=", $hsp->percent_identity, "\n";
#Print alignment to file
#$aln will be a Bio::SimpleAlign object
       use Bio::AlignIO;
           my $aln = $hsp->get_aln;

#changed msf to fasta and hsp.msf to hsp.fas output is now a fas file 
          my $alnIO = Bio::AlignIO->new(-format =>"fasta", -file => ">hsp.fas"); 
      $alnIO->write_aln($aln);

        }
      }
    }  
  }
}
blast fasta multiple • 4.7k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
1
Entering edit mode
13.1 years ago

I would use an XSLT stylesheet. For example, with the following BLAST xml result:

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastn</BlastOutput_program>
  <BlastOutput_version>BLASTN 2.2.26+</BlastOutput_version>
  <BlastOutput_reference>Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), &quot;A greedy algorithm for aligning DNA sequences&quot;, J Comput Biol 2000; 7(1-2):203-14.</BlastOutput_reference>
  <BlastOutput_db>nr</BlastOutput_db>
  <BlastOutput_query-ID>26343</BlastOutput_query-ID>
  <BlastOutput_query-def>No definition line</BlastOutput_query-def>
  <BlastOutput_query-len>671</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_expect>10</Parameters_expect>
      <Parameters_sc-match>1</Parameters_sc-match>
      <Parameters_sc-mismatch>-2</Parameters_sc-mismatch>
      <Parameters_gap-open>0</Parameters_gap-open>
      <Parameters_gap-extend>0</Parameters_gap-extend>
      <Parameters_filter>L;m;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>26343</Iteration_query-ID>
  <Iteration_query-def>No definition line</Iteration_query-def>
  <Iteration_query-len>671</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|118082669|ref|XM_416233.2|</Hit_id>
  <Hit_def>PREDICTED: Gallus gallus similar to ubiquitous tetratricopeptide containing protein RoXaN; Rotavirus X associated non-structural protein (LOC417996), mRNA</Hit_def>
  <Hit_accession>XM_416233</Hit_accession>
  <Hit_len>2868</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>556.962</Hsp_bit-score>
      <Hsp_score>301</Hsp_score>
      <Hsp_evalue>3.58957e-158</Hsp_evalue>
      <Hsp_query-from>92</Hsp_query-from>
      <Hsp_query-to>395</Hsp_query-to>
      <Hsp_hit-from>2378</Hsp_hit-from>
      <Hsp_hit-to>2681</Hsp_hit-to>
      <Hsp_query-frame>1</Hsp_query-frame>
      <Hsp_hit-frame>1</Hsp_hit-frame>
      <Hsp_identity>303</Hsp_identity>
      <Hsp_positive>303</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>304</Hsp_align-len>
      <Hsp_qseq>TACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCTGGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAGCTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTACCA</Hsp_qseq>
      <Hsp_hseq>TACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCTGGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAGCTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTTCCA</Hsp_hseq>
      <Hsp_midline>|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||</Hsp_midline>
    </Hsp>
  </Hit_hsps>
</Hit>
<Hit>
  <Hit_num>2</Hit_num>
  <Hit_id>gi|27881483|ref|NM_017590.4|</Hit_id>
  <Hit_def>Homo sapiens zinc finger CCCH-type containing 7B (ZC3H7B), mRNA</Hit_def>
  <Hit_accession>NM_017590</Hit_accession>
  <Hit_len>5868</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>366.757</Hsp_bit-score>
      <Hsp_score>198</Hsp_score>
      <Hsp_evalue>6.49273e-101</Hsp_evalue>
      <Hsp_query-from>100</Hsp_query-from>
      <Hsp_query-to>390</Hsp_query-to>
      <Hsp_hit-from>2608</Hsp_hit-from>
      <Hsp_hit-to>2898</Hsp_hit-to>
      <Hsp_query-frame>1</Hsp_query-frame>
      <Hsp_hit-frame>1</Hsp_hit-frame>
      <Hsp_identity>264</Hsp_identity>
      <Hsp_positive>264</Hsp_positive>
      <Hsp_gaps>8</Hsp_gaps>
      <Hsp_align-len>295</Hsp_align-len>
      <Hsp_qseq>ATGCAGCAGACCTATGACATGTGGCT-AAAGAAACACAATCCTGGGAAGCCTGGAG-AGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAGC-TATC-GCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGG</Hsp_qseq>
      <Hsp_hseq>ATGCAGCAGACCTATGACATGTGGCTGAAA-AAACACAACCCAGGAAAGCCTGGAGAAGGGACCCCCA-TCAGTTCTCGGGAAGGGGAGAAGCAGATCCAGATGCCCACGGACTACGCGGACATCATGATGGGCTACCACTGCTGGCTCTGCGGCAAGAACAGCAACAGCAAGAAGCAGTGGCAGCAGCACATCCAGTCCGAGAAGCACAAGGAGAAGGTCTTCACGTCCGACAGTGACGCCAGCGGCTGG-GCCT-TCCGCTTCCCCATGGGCGAGTTCCGGCTCTGCGACAGG</Hsp_hseq>
      <Hsp_midline>|||||||||||||||||||||||||| ||| |||||||| || || |||||||||| ||||| | ||| ||| ||| || ||||||||||| ||||||||||||||||| ||||| || ||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||||||||| || ||||||||| ||||| ||||| || | || ||||||| ||||||||||||| |||||| || |||</Hsp_midline>
    </Hsp>
  </Hit_hsps>
</Hit>
<Hit>
  <Hit_num>3</Hit_num>
  <Hit_id>gi|194733718|ref|NM_001130695.1|</Hit_id>
  <Hit_def>Rattus norvegicus zinc finger CCCH-type containing 7B (Zc3h7b), mRNA</Hit_def>
  <Hit_accession>NM_001130695</Hit_accession>
  <Hit_len>5466</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>355.677</Hsp_bit-score>
      <Hsp_score>192</Hsp_score>
      <Hsp_evalue>1.40543e-97</Hsp_evalue>
      <Hsp_query-from>97</Hsp_query-from>
      <Hsp_query-to>390</Hsp_query-to>
      <Hsp_hit-from>2433</Hsp_hit-from>
      <Hsp_hit-to>2726</Hsp_hit-to>
      <Hsp_query-frame>1</Hsp_query-frame>
      <Hsp_hit-frame>1</Hsp_hit-frame>
      <Hsp_identity>266</Hsp_identity>
      <Hsp_positive>266</Hsp_positive>
      <Hsp_gaps>12</Hsp_gaps>
      <Hsp_align-len>300</Hsp_align-len>
      <Hsp_qseq>GATATGCAGCAGACCTATGACATGTGGCT-AAAGAAACACAATCCTGGGAAGCCTGGAG-AGGGAACACCACTCA-CTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTG-CTGGAGC-TATCGCTTCCCTATGGGCGAGTTCC-AGCTCTGTGAAAGG</Hsp_qseq>
      <Hsp_hseq>GATATGCAACAGACCTATGACATGTGGCTGAAA-AAACACAACCCAGGGAAGCCAGGAGAAGGGACCCCCA-TCAGC-TCCCGGGAAGGAGAGAAGCAGATCCAGATGCCCACGGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGCAAGAACAGCAACAGCAAGAAGCAGTGGCAGCAGCACATCCAGTCTGAGAAGCACAAGGAGAAGGTCTTCACTTCCGACAGCGACGCCAG-TGGCTGG-GCCTACCGATTCCCCATGGGCGAGTTCCGA-CTCTGTGACAGG</Hsp_hseq>
      <Hsp_midline>|||||||| |||||||||||||||||||| ||| |||||||| || |||||||| |||| ||||| | ||| ||| | || || ||||| ||||| ||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||||||||| || ||||| ||| |||| || |||| || || || ||||| ||||||||||||| | |||||||| |||</Hsp_midline>
    </Hsp>
  </Hit_hsps>
</Hit>
</Iteration_hits>
  <Iteration_stat>
    <Statistics>
      <Statistics_db-num>18780</Statistics_db-num>
      <Statistics_db-len>25940078</Statistics_db-len>
      <Statistics_hsp-len>0</Statistics_hsp-len>
      <Statistics_eff-space>0</Statistics_eff-space>
      <Statistics_kappa>0.46</Statistics_kappa>
      <Statistics_lambda>1.28</Statistics_lambda>
      <Statistics_entropy>0.85</Statistics_entropy>
    </Statistics>
  </Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>

and the following XSLT stylesheet: https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/blast2fasta.xsl


processing:

xsltproc --novalid  blast2fasta.xsl blast.xml


result:

>PREDICTED: Gallus gallus similar to ubiquitous tetratricopeptide containing protein RoXaN; Rotavirus X associated non-structural protein (LOC417996), mRNA|len:303|ident:303
TACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCTGGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAGCTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTTCCA
>Homo sapiens zinc finger CCCH-type containing 7B (ZC3H7B), mRNA|len:290|ident:264
ATGCAGCAGACCTATGACATGTGGCTGAAAAAACACAACCCAGGAAAGCCTGGAGAAGGGACCCCCATCAGTTCTCGGGAAGGGGAGAAGCAGATCCAGATGCCCACGGACTACGCGGACATCATGATGGGCTACCACTGCTGGCTCTGCGGCAAGAACAGCAACAGCAAGAAGCAGTGGCAGCAGCACATCCAGTCCGAGAAGCACAAGGAGAAGGTCTTCACGTCCGACAGTGACGCCAGCGGCTGGGCCTTCCGCTTCCCCATGGGCGAGTTCCGGCTCTGCGACAGG
>Rattus norvegicus zinc finger CCCH-type containing 7B (Zc3h7b), mRNA|len:293|ident:266
GATATGCAACAGACCTATGACATGTGGCTGAAAAAACACAACCCAGGGAAGCCAGGAGAAGGGACCCCCATCAGCTCCCGGGAAGGAGAGAAGCAGATCCAGATGCCCACGGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGCAAGAACAGCAACAGCAAGAAGCAGTGGCAGCAGCACATCCAGTCTGAGAAGCACAAGGAGAAGGTCTTCACTTCCGACAGCGACGCCAGTGGCTGGGCCTACCGATTCCCCATGGGCGAGTTCCGACTCTGTGACAGG
ADD COMMENT

Login before adding your answer.

Traffic: 1853 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6