Dear friends:
I have a blast report, and want to extract following information for each result(query): the Query_name, hit_number, name and description of the hit(HSP) with the highest identity.
my bioperl script is:
#!/usr/bin/perl -w
use strict;
use warnings;
use Bio::SearchIO;
if (@ARGV != 2){die "Usage: $0 <BLAST-report-file> <output-file>\n"};
my ($infile,$outfile) = @ARGV;
print "Parsing the BLAST result ...";
my $blast = new Bio::SearchIO(
-format => 'blast',
-file => $infile);
open (OUT,">$outfile") or die "Cannot open $outfile: $!";
print OUT "query\tNumber of hits\tGreatest identity %\tAccession (identity %)\tDescription (identity %)\n";
while (my $result = $blast->next_result){
print OUT $result->query_name . "\t";
print OUT $result->num_hits. "\t";
# don't need the '&': if (my $hit = sort_hit($result))
if (my $hit = &sort_hit($result)){
if (my $hsp = $hit->hsp){
print OUT $hsp->percent_identity. "\t";
print OUT $hit->accession. "\t";
print OUT $hit->description. "\n";
}
}
}
close OUT;
print " DONE!!!\n";
# the problem starts here:
# no matter what, this function will always return the first hit
sub sort_hit{
my $result = shift;
my @hits;
# The following while loop is very clumsy, ResultI has a hits method:
# better: my @hits = $result->hits();
while (my $hit = $result->next_hit) {
push @hits, $hit;
}
# here you sort the @hits, assign the result and do not use the result ever again.
# sorting the array has no effect on the original hits, that is also why I recommended to
# use $result->sort_hits instead
# minor: you don't need to multiply percent_identity by 100
my @hit_sort = sort { $b-> hsp ->percent_identity * 100 <=> $a-> hsp ->percent_identity * 100 } @hits;
# the following code is possibly meant to return the first element of the array, but
# is totally without use. You do not need to iterate again.
# Correct would be:
return shift @hit_sort; # you do know shift, unshift, push and pop for array operations?
# rest of code doesn't do anything sensible:
$result->rewind;
my $flag=0;
my $temp_hit;
# At the end of this while loop, $temp_hit will always be the first hit!
while (my $hit=$result->next_hit){
if($flag==0){ # this is false already in the second iteration
$temp_hit=$hit;
$flag++;
}
return $temp_hit
}
}
error information:
-------------EXCEPTION: Bio::Root::Exception -------------
MSG: Can't get HSPs: data not collected.
STACK: Error::throw
STACK: Bio::Root::Root::throw /home/xiongtl/Bioperl/dag/lib/perl5/Bio/Root/Root.pm:368
STACK: Bio::Search::Hit::GenericHit::hsp /home/xiongtl/Bioperl/dag/lib/perl5/Bio/Search/Hit/GenericHit.pm:688
STACK: main::sort_hit xtl_new.pl:37
STACK: xtl_new.pl:20
-----------------------------------------------------------
Parsing the BLAST result ...
please give me some clues, all your words are welcome!
NOTE: My bioperl version is 1.6.1, and I use blastx 2.2.27+. My blast report(-outfmt 0) have many queries and each query have many hits(against nr database), so each hit maybe also have several HSPs.
Some more information: I tried the $hit->next_hsp before, but running the script still informed me the same error. However, today I have just reruned the script, both $hit->next_hsp and $hit->hsp are ok! They both outputed the first hit for each result. I don't know why. So I tried different sort rules as follows:
# my @hit_sort = sort { $b-> hsp ->hsp_length <=> $a-> hsp ->hsp_length } @hits;
# my @hit_sort = sort { $b-> hsp ->percent_identity * 100 <=> $a-> hsp ->percent_identity * 100 } @hits;
# my @hit_sort = sort { $b-> significance <=> $a-> significance } @hits;
# my @hit_sort = sort { $a-> accession cmp $b-> accession } @hits;
but different sort functions still generate the same output (still the first hit)! It seems that the sub-function does not work at all! oh, it really troubles me...
Here is my data structure:
Query= scaffold28:2776872-2777385(-)
Length=513
Score E
Sequences producing significant alignments: (Bits) Value
gb|AEA49877.1| RNA-dependent RNA polymerase [Drosophila ananassa... 90.5 3e-30
...
...
...
gb|AAA48442.1| L protein [Vesicular stomatitis virus] 115 2e-26
>gb|AEA49877.1| RNA-dependent RNA polymerase [Drosophila ananassae sigmavirus]
Length=615
Score = 90.5 bits (223), Expect(2) = 3e-30, Method: Composition-based stats.
Identities = 40/57 (70%), Positives = 43/57 (75%), Gaps = 0/57 (0%)
Frame = +2
Query 173 ISIANHIDYEKWNAHQRGTGNDPCFKVMGQFLGLPNSILRTNEFFSKITFYYNQRPD 343
I+IANHIDYEKWN HQRG N P FKVMGQFLG PN I RT+EFF K YY+ R D
Sbjct 554 ITIANHIDYEKWNNHQRGDANGPVFKVMGQFLGYPNLIARTHEFFEKSLIYYSDRAD 610
Score = 67.0 bits (162), Expect(2) = 3e-30, Method: Compositional matrix adjust.
Identities = 39/54 (72%), Positives = 46/54 (85%), Gaps = 0/54 (0%)
Frame = +1
Query 1 FSLMSWRLREYFVITEYLIKLHFVPLFKGITMAddlttlikklldtSSGQALTD 162
FSLMSW LR+YFV TEYLIK HFVPLF+G+TMADDLTT+ +K+L +SSGQ D
Sbjct 497 FSLMSWELRDYFVFTEYLIKTHFVPLFEGLTMADDLTTVTQKMLSSSSGQGNDD 550
>....
>....
>....
>gb|AAA48442.1| L protein [Vesicular stomatitis virus]
Length=2109
Score = 115 bits (289), Expect = 2e-26, Method: Compositional matrix adjust.
Identities = 81/172 (47%), Positives = 104/172 (60%), Gaps = 3/172 (2%)
Frame = +1
Query 1 FSLMSWRLREYFVITEYLIKLHFVPLFKGITMAddlttlikklldtSSGQALTDyrsyfy 180
FSLMSWRLREYFVITEYLIK ++VPLFKG+TMADDLT++IKK++D+SSGQ L DY S
Sbjct 541 FSLMSWRLREYFVITEYLIKTYYVPLFKGLTMADDLTSVIKKMMDSSSGQGLDDYSSVCL 600
Query 181 s*syrL*KMECSPTRNRE*SMFQGHGSIFRSAQLNTQN**IF**NYFLL*STTRPD-MIM 357
+ K + +F+ G L + F L+ RPD M +
Sbjct 601 ANHIDYEKWNNHQRKESNGPIFRVMGQFLGYPSLIERTHEFF--EKSLIYYNGRPDLMTI 658
Query 358 NNGVIENLTDQRVCWNGQLGGQEGIGQRGCCVVNALAIERECQDQNIKVSLL 513
NG + N T RVCWNGQ GG EG+ Q+G +VN L I+RE + +N V +L
Sbjct 659 RNGTLCNSTKHRVCWNGQKGGLEGLRQKGWSIVNLLVIQREAKIRNTAVKVL 710
Lambda K H
0.318 0.134 0.401
Gapped
Lambda K H
0.267 0.0410 0.140
Effective search space used: 304649109255
cross posted on seqanswers: http://seqanswers.com/forums/showthread.php?t=30888
Also, with similar problems it is absolutely necessary to post the following information:
The requirement to give sufficient version information and reproducible examples applies to all software related questions. I think, such question should be closed as "not enough detail/ no longer relevant (because they might refer to an outdated version of the software)" unless the information is provided in the future.
I feel very sorry for my mistakes! you've helped me in a good habit of posting a clear question, I really appreciate you and your tips!
The extra important information has been added, as shown above.
I have experienced some problems with older versions of BioPerl and blast+, do you have the possibility to test with 1.6.901? Otherwise I need to see your blast output. A single result (from
Query=
tag to nextQuery=
)So, according to your edit, the error message just "magically" vanished?
Indeed, I have compared the upload version above and my current local version (I just repeated to rerun it, no error informed), there are no difference in both! Maybe a temporary bug?
Now, the sort function still does not work (different rules have the same output, although values in report are obviously different [even I have tried to inverse $a and $b]). I am trying to work it out via modifying bit by bit. Do you have any ideas?
My data structure is showed above. Thank you very much!
I added some comments, please have a look at your sourcecode.