You can use BioPython to access any of NCBI's Entrez databases, specifically the taxonomy database in your case - http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc116
Or you can do the same sort of thing with BioPerl's Bio::DB::Taxonomy package - http://doc.bioperl.org/bioperl-live/Bio/DB/Taxonomy.html and Bio::Tree::Tree package - http://doc.bioperl.org/bioperl-live/Bio/Tree/Tree.html
Perl example:
use strict;
use warnings;
use Bio::DB::Taxonomy;
use Bio::Tree::Tree;
my @taxonids = ("515482", "515474");
my @lineages = ();
my $db = Bio::DB::Taxonomy->new(-source => 'entrez');
foreach my $taxonid (@taxonids) {
my $taxon = $db->get_taxon(-taxonid => $taxonid);
my $tree = Bio::Tree::Tree->new(-node => $taxon);
my @taxa = $tree->get_nodes;
my @tids = ();
foreach my $t (@taxa) {
unshift(@tids, $t->id());
}
push(@lineages, $taxonid . "\t|\t" . $taxon->scientific_name() . "\t|\t" . "@tids");
}
foreach my $lineage (@lineages) {
print "$lineage\n";
}
Outputs:
515482 | Nitzschia dubiiformis | 515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474 | Cocconeis stauroneiformis | 515474 216715 216714 186023 33850 33849 2836 33634 2759 131567
Python example:
#import entrez module
from Bio import Entrez
# set variables
taxids = [515482, 515474]
# set email
Entrez.email = "youremail@gmail.com"
# traverse ids
for taxid in taxids:
handle = Entrez.efetch(db="taxonomy", id=taxid, mode="text", rettype="xml")
records = Entrez.read(handle)
for taxon in records:
taxid = taxon["TaxId"]
name = taxon["ScientificName"]
tids = []
for t in taxon["LineageEx"]:
tids.insert(0, t["TaxId"])
tids.insert(0, taxid)
print "%s\t|\t%s\t|\t%s" % (taxid, name, " ".join(tids))
Outputs:
515482 | Nitzschia dubiiformis | 515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474 | Cocconeis stauroneiformis | 515474 216715 216714 186023 33850 33849 2836 33634 2759 131567
These essentially just call http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi, which you could call and parse yourself? Perhaps iterate over the ids in the list, call the url for each and output as you need?
Calling the following for example:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=515482&mode=text&report=xml
Give us the following XML output:
Which you can then parse the lineage tax ids from quite simply. Pierre gives a great example in his post, using an XSLT stylesheet.
It's easy enough to amend these scripts to take the tax ids as input arguments from the command line, to fit into your pipeline nicely. In fact, I've added the updated forms of these scripts to my bitbucket with the names tax_identifier.pl/py - https://bitbucket.org/gawbul/bioinformatics-scripts/src
You can easily get these scripts to pull a list of tax ids from the command line as arguments and process them in the same way!
It's easy enough to amend these scripts to take the tax ids as input arguments. In fact, I've added this scripts to my bitbucket repository with the names tax_identifier.pl/py https://bitbucket.org/gawbul/bioinformatics-scripts/src
in python script,line 12 need to surround ids in quotes: taxids = ["515482", "515474"]
otherwise was getting this error:
Which Python script? In my https://github.com/gawbul/bioinformatics-scripts repository? Could you submit an issue request on GitHub please?