I'll mention first that there are already tools for doing phylogenetic classifications of sequences, for example MEGAN (I haven't used this tool personally though). If something like that won't work, it's quite easy to roll your own solution using BioPerl. Here's an example scenario: parse your blast report using BioPerl's Bio::SearchIO, then use the species information in your hits to look up the taxonomic information at NCBI.
#!/usr/bin/env perl
use strict;
use warnings;
#use Bio::SearchIO # for parsing blast, which we aren't doing
use Bio::DB::Taxonomy; # for accessing NCBI's entrez Taxonomy database
## plug in some awesome code to parse your blast report here
my $db = Bio::DB::Taxonomy->new(-source => 'entrez');
my $taxonid = $db->get_taxonid('Homo sapiens'); ## you could get this from your blast report, or an ID of some form...
my $taxon = $db->get_taxon(-taxonid => $taxonid);
print "Taxon ID is ", $taxon->id, "\n";
print "Scientific name is ", $taxon->scientific_name, "\n";
print "Rank is ", $taxon->rank, "\n";
print "Division is ", $taxon->division, "\n";
if (defined $taxonid) { # is your species in the database?
my $node = $db->get_Taxonomy_Node(-taxonid => $taxonid);
my $kingdom = $node;
for (1..25) {
$kingdom = $db->get_Taxonomy_Node(-taxonid => $kingdom->parent_id);
}
print "Kingdom is ",$kingdom->scientific_name,"\n";
}
Call this biostars62911.pl, and just execute it with perl biostars62911.pl
. This will output:
Taxon ID is 9606
Scientific name is Homo sapiens
Rank is species
Division is Primates
Kingdom is Metazoa
Note that this is just an example and I probably wouldn't traverse the tree this way for classifying certain groups. The reason is that it may not be the most efficient for many searches (this search takes ~3 seconds), and you will have to take a different number of steps back to the same point depending on the lineage. It's possible to select just a single rank of interest (e.g., kingdom), of course, but this illustrates how you can get to any part of the tree you want. I added a comment in the code where I checked if the species is in the database because you'll find that many are not, so don't assume every species is represented. I think Pierre's solution is really cool (I couldn't come up with that), but I have to say that a Bio* approach is probably more reliable (and readable) than trying to construct URLs for each query since there are a lot of tests going on behind the scences in the Perl code above. You can also do the same thing in Biopython or probably any Bio* package and I'd like to see those examples personally because I'm not familiar with those methods.
EDIT: I've found that Bio::DB::EUtilities (or Bio::DB::SoapEUtilities) is faster than my example above, but I still don't think these methods (including Pierre's solution, which is really fast) are ideal for your problem. The reason is that NCBI asks you to limit queries to 3 per second and only run large jobs during certain hours or on weekends. When you say that you have "very large sets of blast results" I'm guessing you mean millions of queries, and that could take anywhere from a week to months to run. A better solution would be to download the taxonomy flat files, change the source in the code above from 'entrez' to 'flatfile' and do the search locally. That way you can split up your blast reports and run many jobs in parallel. You could probably modify Pierre's code and do the same thing with a bash script.