Strange Behaviour Of Bioperl'S Bio::Searchio When Parsing Xml Blast Output
1
0
Entering edit mode
11.0 years ago
rwn ▴ 610

Hello, I've noticed some strange behaviour when parsing BLAST .xml output files (-oufmt 5) using BioPerl's Bio::SearchIO library.

I have a simple parser script that looks something like:

#!/usr/bin/perl -w

use strict;
use Bio::SearchIO;

my $in = Bio::SearchIO -> new (-format => 'blastxml', -file => "consensusSeqs.BLASTp.xml");
open (OUT, ">consensusSeqs.parse.OUT");

my $query_count = 1;

while( my $result = $in->next_result ) {
    print "Query count: ".$query_count."\n";
    print "Query name: ".$result->query_description."\n";
    print "Number of hits: ".$result->num_hits."\n";
    my $hit_count = 1;

    if (!defined $result->next_hit) {
        print OUT $result->query_description."\tNo hit\n";
    }

    while ( my $hit = $result->next_hit ) {
        print "\tHit count: ".$hit_count."\n";
        while ( my $hsp = $hit->next_hsp ) {
            my @a = split /\|/, $hit->name;
            my $hit_accession = $a[3];
            # print "Hit name: ".$hit->name."\n";
            print "\tHit accession: ".$hit_accession."\n";

             ## get some stats of hit
            my $percent_id = sprintf("%.2f", $hsp->percent_identity);
            my $percent_q_coverage = sprintf("%.2f", ((($hsp->length('query')/($result->query_length)))*100)); 
            my @b = split /[\[\]]/, $hit->description;
            my $organism = $b[1];

            (my $short_desc = $hit->description) =~ s/\[.*//;

            print "\t\tE-value: ".$hsp->evalue."\n";
            print "\t\t% identity: ".$percent_id."\n";
            print "\t\t% coverage: ".$percent_q_coverage."\n";
            print "\t\tDescription: ".$hit->description."\n";
            if (!defined $short_desc) {
                $short_desc = "No hits found";
            }
            print "\t\tShort description: ".$short_desc."\n";
            ## if there is nothing under [organism] 
            if (!defined $organism) {
                $organism = "not defined";
            }
            print "\t\tOrganism: ".$organism."\n";
            print "\t~~~\n";

            ## print the hit to OUT
            print OUT $result->query_description."\t".$percent_id."\t".$percent_q_coverage."\t".$organism."\t".$short_desc."\n"; 
        }
        $hit_count++;
    }
    print "\n";
    print OUT "~~~\n";
    $query_count++;
}

print "Finished\n\n";

With something like the above, I hope to capture some information for all query sequences, even if there is no hit. But I noticed that some queries were not in my output file. Looking more closely at the XML file itself, I think the issue only occurs when there is exactly one hit for a given query. In these cases, it seems to skip the while ( my $hit = $result->next_hit ) and while ( my $hsp = $hit->next_hsp ) loops and iterates directly to the next $result.

Has anyone else noticed this behaviour? Is it a bug, or is it something in the way I've set up my script? Any suggestions for a workaround would be much appreciated!

Thanks!

blast perl bioperl xml • 3.5k views
ADD COMMENT
3
Entering edit mode
11.0 years ago
rwn ▴ 610

Turns out it was just me being an idiot. The piece of code:

if (!defined $result->next_hit) {
    print OUT $result->query_description."\tNo hit\n";
}

iterates through the list of possible $hit objects, such that by the time the while (my $hit = $result->next_hit) loop is reached, it is already on $hit number 2. To get around it, can use the result->rewind function to start from the beginning again, like this:

if (!defined $result->next_hit) {
  print OUT $result->query_description."\tNo hit\n";
}
$result->rewind;
# etc etc

You live and learn!

ADD COMMENT

Login before adding your answer.

Traffic: 1773 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6