Hello, I've noticed some strange behaviour when parsing BLAST .xml output files (-oufmt 5) using BioPerl's Bio::SearchIO library.
I have a simple parser script that looks something like:
#!/usr/bin/perl -w
use strict;
use Bio::SearchIO;
my $in = Bio::SearchIO -> new (-format => 'blastxml', -file => "consensusSeqs.BLASTp.xml");
open (OUT, ">consensusSeqs.parse.OUT");
my $query_count = 1;
while( my $result = $in->next_result ) {
print "Query count: ".$query_count."\n";
print "Query name: ".$result->query_description."\n";
print "Number of hits: ".$result->num_hits."\n";
my $hit_count = 1;
if (!defined $result->next_hit) {
print OUT $result->query_description."\tNo hit\n";
}
while ( my $hit = $result->next_hit ) {
print "\tHit count: ".$hit_count."\n";
while ( my $hsp = $hit->next_hsp ) {
my @a = split /\|/, $hit->name;
my $hit_accession = $a[3];
# print "Hit name: ".$hit->name."\n";
print "\tHit accession: ".$hit_accession."\n";
## get some stats of hit
my $percent_id = sprintf("%.2f", $hsp->percent_identity);
my $percent_q_coverage = sprintf("%.2f", ((($hsp->length('query')/($result->query_length)))*100));
my @b = split /[\[\]]/, $hit->description;
my $organism = $b[1];
(my $short_desc = $hit->description) =~ s/\[.*//;
print "\t\tE-value: ".$hsp->evalue."\n";
print "\t\t% identity: ".$percent_id."\n";
print "\t\t% coverage: ".$percent_q_coverage."\n";
print "\t\tDescription: ".$hit->description."\n";
if (!defined $short_desc) {
$short_desc = "No hits found";
}
print "\t\tShort description: ".$short_desc."\n";
## if there is nothing under [organism]
if (!defined $organism) {
$organism = "not defined";
}
print "\t\tOrganism: ".$organism."\n";
print "\t~~~\n";
## print the hit to OUT
print OUT $result->query_description."\t".$percent_id."\t".$percent_q_coverage."\t".$organism."\t".$short_desc."\n";
}
$hit_count++;
}
print "\n";
print OUT "~~~\n";
$query_count++;
}
print "Finished\n\n";
With something like the above, I hope to capture some information for all query sequences, even if there is no hit. But I noticed that some queries were not in my output file. Looking more closely at the XML file itself, I think the issue only occurs when there is exactly one hit for a given query. In these cases, it seems to skip the while ( my $hit = $result->next_hit )
and while ( my $hsp = $hit->next_hsp )
loops and iterates directly to the next $result.
Has anyone else noticed this behaviour? Is it a bug, or is it something in the way I've set up my script? Any suggestions for a workaround would be much appreciated!
Thanks!