I am building a pipeline for taxonomically identifying a blast result at each taxa level, with a bespoke reference dataset. I want it to be easily reproduceable, which it is, other than an ugly step where I have to take a list of taxonIDs and put it into the site: http://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi
retrieve the output and carry on.
Does anyone know if I can get the code that is used? Clearly its calling a cgi from somewhere but I cant find it in the ncbi ftp. If not, is there an equivalent? Technically, I could pull apart the names.dmp and nodes.dmp but there is already a ncbi tool so I'm loathed to do that.
EDIT: For the record, I have ~30,000 taxIDs that I am retrieving and gawbul's script, while elegant, won't complete. Im looking at Frederic's now for its multi-coring ability.
use strict;
use warnings;
use Bio::DB::Taxonomy;
use Bio::Tree::Tree;
my @taxonids =("515482", "515474");
my @lineages =();
my $db= Bio::DB::Taxonomy->new(-source =>'entrez');
foreach my $taxonid(@taxonids){
my $taxon=$db->get_taxon(-taxonid =>$taxonid);
my $tree= Bio::Tree::Tree->new(-node =>$taxon);
my @taxa =$tree->get_nodes;
my @tids =();
foreach my $t(@taxa){
unshift(@tids, $t->id());}
push(@lineages, $taxonid."\t|\t".$taxon->scientific_name()."\t|\t"."@tids");}
foreach my $lineage(@lineages){
print "$lineage\n";}
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
It's easy enough to amend these scripts to take the tax ids as input arguments from the command line, to fit into your pipeline nicely. In fact, I've added the updated forms of these scripts to my bitbucket with the names tax_identifier.pl/py - https://bitbucket.org/gawbul/bioinformatics-scripts/src
It's easy enough to amend these scripts to take the tax ids as input arguments. In fact, I've added this scripts to my bitbucket repository with the names tax_identifier.pl/py https://bitbucket.org/gawbul/bioinformatics-scripts/src
in python script,line 12 need to surround ids in quotes: taxids = ["515482", "515474"]
otherwise was getting this error:
File "taxa.py", line 12, in<module>
handle = Entrez.efetch(db="taxonomy", id=taxid, mode="text", rettype="xml")
File "/usr/lib/python2.7/site-packages/biopython-1.64-py2.7-linux-x86_64.egg/Bio/Entrez/__init__.py", line 145, in efetch
if ids.count(",")>= 200:
AttributeError: 'int' object has no attribute 'count'
ADD REPLY
• link
updated 5.6 years ago by
Ram
45k
•
written 10.4 years ago by
hn57
•
0
A few weeks ago, in response to a similar question, I posted two simple shell scripts: one that downloads an prepares the NCBI's taxonomic data, and one that takes a list of GIs and returns their complete taxonomy. As it runs locally, I find it easier to parallelize than an efetch-based script.
while read line
do
xsltproc --novalid stylesheet.xsl \
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=${line}&mode=text&report=xml"done< mylistof_ids.txt
It's easy enough to amend these scripts to take the tax ids as input arguments from the command line, to fit into your pipeline nicely. In fact, I've added the updated forms of these scripts to my bitbucket with the names tax_identifier.pl/py - https://bitbucket.org/gawbul/bioinformatics-scripts/src
You can easily get these scripts to pull a list of tax ids from the command line as arguments and process them in the same way!
It's easy enough to amend these scripts to take the tax ids as input arguments. In fact, I've added this scripts to my bitbucket repository with the names tax_identifier.pl/py https://bitbucket.org/gawbul/bioinformatics-scripts/src
in python script,line 12 need to surround ids in quotes: taxids = ["515482", "515474"]
otherwise was getting this error:
Which Python script? In my https://github.com/gawbul/bioinformatics-scripts repository? Could you submit an issue request on GitHub please?