Extracting Fasta Alignments From Parsed Blastxml File
1
Hello,
I have cobbled together a small script that parses a BLASTxml file. It seems to parse the xml file just fine (judging from what it prints to the screen). The problem is the hsp.fas alignment file is incomplete. This file only contains one of the alignments contained in the BLAST output.
I would like to have all the alignments (including the query sequence in each of the individual alignments) that I see in the BLAST outputs (for example if I designate m -2 I get a complete file from the blastall).
Any suggestions? -Thanks!
module load perl
#give the name of the blast xml file to parse in the line where it says 'file =>'
use Bio::SearchIO;
#Use m -7 to generate xml file from blastall
my $in = new Bio::SearchIO(-format => 'blastxml',
-file => 'BLASToutxml');
while( my $result = $in->next_result ) {
## $result is a Bio::Search::Result::ResultI compliant object
while( my $hit = $result->next_hit ) {
## $hit is a Bio::Search::Hit::HitI compliant object
while( my $hsp = $hit->next_hsp ) {
## $hsp is a Bio::Search::HSP::HSPI compliant object
#ENTER desired sequence length
if( $hsp->length('total') > 50 ) {
#ENTER desired percent identity
if ( $hsp->percent_identity >= 75 ) {
print "Query=", $result->query_name,
" Hit=", $hit->name,
" Length=", $hsp->length('total'),
" Percent_id=", $hsp->percent_identity, "\n";
#Print alignment to file
#$aln will be a Bio::SimpleAlign object
use Bio::AlignIO;
my $aln = $hsp->get_aln;
#changed msf to fasta and hsp.msf to hsp.fas output is now a fas file
my $alnIO = Bio::AlignIO->new(-format =>"fasta", -file => ">hsp.fas");
$alnIO->write_aln($aln);
}
}
}
}
}
blast
fasta
multiple
• 4.7k views
I would use an XSLT stylesheet. For example, with the following BLAST xml result:
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastn</BlastOutput_program>
<BlastOutput_version>BLASTN 2.2.26+</BlastOutput_version>
<BlastOutput_reference>Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), "A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14.</BlastOutput_reference>
<BlastOutput_db>nr</BlastOutput_db>
<BlastOutput_query-ID>26343</BlastOutput_query-ID>
<BlastOutput_query-def>No definition line</BlastOutput_query-def>
<BlastOutput_query-len>671</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_expect>10</Parameters_expect>
<Parameters_sc-match>1</Parameters_sc-match>
<Parameters_sc-mismatch>-2</Parameters_sc-mismatch>
<Parameters_gap-open>0</Parameters_gap-open>
<Parameters_gap-extend>0</Parameters_gap-extend>
<Parameters_filter>L;m;</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>26343</Iteration_query-ID>
<Iteration_query-def>No definition line</Iteration_query-def>
<Iteration_query-len>671</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gi|118082669|ref|XM_416233.2|</Hit_id>
<Hit_def>PREDICTED: Gallus gallus similar to ubiquitous tetratricopeptide containing protein RoXaN; Rotavirus X associated non-structural protein (LOC417996), mRNA</Hit_def>
<Hit_accession>XM_416233</Hit_accession>
<Hit_len>2868</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>556.962</Hsp_bit-score>
<Hsp_score>301</Hsp_score>
<Hsp_evalue>3.58957e-158</Hsp_evalue>
<Hsp_query-from>92</Hsp_query-from>
<Hsp_query-to>395</Hsp_query-to>
<Hsp_hit-from>2378</Hsp_hit-from>
<Hsp_hit-to>2681</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>1</Hsp_hit-frame>
<Hsp_identity>303</Hsp_identity>
<Hsp_positive>303</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>304</Hsp_align-len>
<Hsp_qseq>TACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCTGGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAGCTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTACCA</Hsp_qseq>
<Hsp_hseq>TACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCTGGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAGCTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTTCCA</Hsp_hseq>
<Hsp_midline>|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
<Hit>
<Hit_num>2</Hit_num>
<Hit_id>gi|27881483|ref|NM_017590.4|</Hit_id>
<Hit_def>Homo sapiens zinc finger CCCH-type containing 7B (ZC3H7B), mRNA</Hit_def>
<Hit_accession>NM_017590</Hit_accession>
<Hit_len>5868</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>366.757</Hsp_bit-score>
<Hsp_score>198</Hsp_score>
<Hsp_evalue>6.49273e-101</Hsp_evalue>
<Hsp_query-from>100</Hsp_query-from>
<Hsp_query-to>390</Hsp_query-to>
<Hsp_hit-from>2608</Hsp_hit-from>
<Hsp_hit-to>2898</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>1</Hsp_hit-frame>
<Hsp_identity>264</Hsp_identity>
<Hsp_positive>264</Hsp_positive>
<Hsp_gaps>8</Hsp_gaps>
<Hsp_align-len>295</Hsp_align-len>
<Hsp_qseq>ATGCAGCAGACCTATGACATGTGGCT-AAAGAAACACAATCCTGGGAAGCCTGGAG-AGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAGC-TATC-GCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGG</Hsp_qseq>
<Hsp_hseq>ATGCAGCAGACCTATGACATGTGGCTGAAA-AAACACAACCCAGGAAAGCCTGGAGAAGGGACCCCCA-TCAGTTCTCGGGAAGGGGAGAAGCAGATCCAGATGCCCACGGACTACGCGGACATCATGATGGGCTACCACTGCTGGCTCTGCGGCAAGAACAGCAACAGCAAGAAGCAGTGGCAGCAGCACATCCAGTCCGAGAAGCACAAGGAGAAGGTCTTCACGTCCGACAGTGACGCCAGCGGCTGG-GCCT-TCCGCTTCCCCATGGGCGAGTTCCGGCTCTGCGACAGG</Hsp_hseq>
<Hsp_midline>|||||||||||||||||||||||||| ||| |||||||| || || |||||||||| ||||| | ||| ||| ||| || ||||||||||| ||||||||||||||||| ||||| || ||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||||||||| || ||||||||| ||||| ||||| || | || ||||||| ||||||||||||| |||||| || |||</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
<Hit>
<Hit_num>3</Hit_num>
<Hit_id>gi|194733718|ref|NM_001130695.1|</Hit_id>
<Hit_def>Rattus norvegicus zinc finger CCCH-type containing 7B (Zc3h7b), mRNA</Hit_def>
<Hit_accession>NM_001130695</Hit_accession>
<Hit_len>5466</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>355.677</Hsp_bit-score>
<Hsp_score>192</Hsp_score>
<Hsp_evalue>1.40543e-97</Hsp_evalue>
<Hsp_query-from>97</Hsp_query-from>
<Hsp_query-to>390</Hsp_query-to>
<Hsp_hit-from>2433</Hsp_hit-from>
<Hsp_hit-to>2726</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>1</Hsp_hit-frame>
<Hsp_identity>266</Hsp_identity>
<Hsp_positive>266</Hsp_positive>
<Hsp_gaps>12</Hsp_gaps>
<Hsp_align-len>300</Hsp_align-len>
<Hsp_qseq>GATATGCAGCAGACCTATGACATGTGGCT-AAAGAAACACAATCCTGGGAAGCCTGGAG-AGGGAACACCACTCA-CTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTG-CTGGAGC-TATCGCTTCCCTATGGGCGAGTTCC-AGCTCTGTGAAAGG</Hsp_qseq>
<Hsp_hseq>GATATGCAACAGACCTATGACATGTGGCTGAAA-AAACACAACCCAGGGAAGCCAGGAGAAGGGACCCCCA-TCAGC-TCCCGGGAAGGAGAGAAGCAGATCCAGATGCCCACGGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGCAAGAACAGCAACAGCAAGAAGCAGTGGCAGCAGCACATCCAGTCTGAGAAGCACAAGGAGAAGGTCTTCACTTCCGACAGCGACGCCAG-TGGCTGG-GCCTACCGATTCCCCATGGGCGAGTTCCGA-CTCTGTGACAGG</Hsp_hseq>
<Hsp_midline>|||||||| |||||||||||||||||||| ||| |||||||| || |||||||| |||| ||||| | ||| ||| | || || ||||| ||||| ||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||||||||| || ||||| ||| |||| || |||| || || || ||||| ||||||||||||| | |||||||| |||</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>18780</Statistics_db-num>
<Statistics_db-len>25940078</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>0.46</Statistics_kappa>
<Statistics_lambda>1.28</Statistics_lambda>
<Statistics_entropy>0.85</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>
and the following XSLT stylesheet: https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/blast2fasta.xsl
processing:
xsltproc --novalid blast2fasta.xsl blast.xml
result:
>PREDICTED: Gallus gallus similar to ubiquitous tetratricopeptide containing protein RoXaN; Rotavirus X associated non-structural protein (LOC417996), mRNA|len:303|ident:303
TACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCTGGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAGCTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTTCCA
>Homo sapiens zinc finger CCCH-type containing 7B (ZC3H7B), mRNA|len:290|ident:264
ATGCAGCAGACCTATGACATGTGGCTGAAAAAACACAACCCAGGAAAGCCTGGAGAAGGGACCCCCATCAGTTCTCGGGAAGGGGAGAAGCAGATCCAGATGCCCACGGACTACGCGGACATCATGATGGGCTACCACTGCTGGCTCTGCGGCAAGAACAGCAACAGCAAGAAGCAGTGGCAGCAGCACATCCAGTCCGAGAAGCACAAGGAGAAGGTCTTCACGTCCGACAGTGACGCCAGCGGCTGGGCCTTCCGCTTCCCCATGGGCGAGTTCCGGCTCTGCGACAGG
>Rattus norvegicus zinc finger CCCH-type containing 7B (Zc3h7b), mRNA|len:293|ident:266
GATATGCAACAGACCTATGACATGTGGCTGAAAAAACACAACCCAGGGAAGCCAGGAGAAGGGACCCCCATCAGCTCCCGGGAAGGAGAGAAGCAGATCCAGATGCCCACGGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGCAAGAACAGCAACAGCAAGAAGCAGTGGCAGCAGCACATCCAGTCTGAGAAGCACAAGGAGAAGGTCTTCACTTCCGACAGCGACGCCAGTGGCTGGGCCTACCGATTCCCCATGGGCGAGTTCCGACTCTGTGACAGG
Login before adding your answer.
Traffic: 1853 users visited in the last hour
BioPerl mailing list