Question

Looking For A Command Line Version Of Ncbi Taxonomy 'Stratification' Tool

8

Entering edit mode

12.9 years ago

Daniel ★ 4.0k

Hi all

I am building a pipeline for taxonomically identifying a blast result at each taxa level, with a bespoke reference dataset. I want it to be easily reproduceable, which it is, other than an ugly step where I have to take a list of taxonIDs and put it into the site: http://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi

retrieve the output and carry on.

Does anyone know if I can get the code that is used? Clearly its calling a cgi from somewhere but I cant find it in the ncbi ftp. If not, is there an equivalent? Technically, I could pull apart the names.dmp and nodes.dmp but there is already a ncbi tool so I'm loathed to do that.

Thanks

N.B. The function is turning:

515482
515474

into

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567

EDIT: For the record, I have ~30,000 taxIDs that I am retrieving and gawbul's script, while elegant, won't complete. Im looking at Frederic's now for its multi-coring ability.

taxonomy blast pipeline ncbi • 12k views

ADD COMMENT • link updated 18 months ago by Ram 44k • written 12.9 years ago by Daniel ★ 4.0k

2

Entering edit mode

12.9 years ago

Frédéric Mahé ★ 3.2k

A few weeks ago, in response to a similar question, I posted two simple shell scripts: one that downloads an prepares the NCBI's taxonomic data, and one that takes a list of GIs and returns their complete taxonomy. As it runs locally, I find it easier to parallelize than an efetch-based script.

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 12.9 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

I get 404 error with that link.

ADD REPLY • link 6.4 years ago by gtrwst9 • 0

0

Entering edit mode

12.9 years ago

Pierre Lindenbaum 164k

The following XSLT stylesheet does the job:


<xsl:stylesheet xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text"/>
  <xsl:template match="/">
    <xsl:for-each select="TaxaSet/Taxon">
      <xsl:value-of select="TaxId"/>
      <xsl:text>  |  </xsl:text>
      <xsl:apply-templates select="ScientificName"/>
      <xsl:text>  |  </xsl:text>
      <xsl:value-of select="TaxId"/>
      <xsl:for-each select="LineageEx/Taxon">
        <xsl:sort select="position()" data-type="number" order="descending"/>
        <xsl:text> </xsl:text>
        <xsl:value-of select="TaxId"/>
      </xsl:for-each>
      <xsl:text>
</xsl:text>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

usage:

xsltproc --novalid  stylesheet.xsl \
   "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=515482,515474&mode=text&report=xml"

Result:

515482  |  Nitzschia dubiiformis  |  515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |  Cocconeis stauroneiformis  |  515474 216715 216714 186023 33850 33849 2836 33634 2759 13156

and for a large number of IDS:

while read line
  do
     xsltproc --novalid  stylesheet.xsl \
       "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=${line}&mode=text&report=xml"
  done < mylistof_ids.txt

ADD COMMENT • link 12.9 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

6.4 years ago

gtrwst9 • 0

See also the ETE Toolkit:

http://etetoolkit.org/docs/latest/index.html

ADD COMMENT • link 6.4 years ago by gtrwst9 • 0

Ram · Accepted Answer · 2012-01-14

You can use BioPython to access any of NCBI's Entrez databases, specifically the taxonomy database in your case - http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc116

Or you can do the same sort of thing with BioPerl's Bio::DB::Taxonomy package - http://doc.bioperl.org/bioperl-live/Bio/DB/Taxonomy.html and Bio::Tree::Tree package - http://doc.bioperl.org/bioperl-live/Bio/Tree/Tree.html

Perl example:

use strict;
use warnings;
use Bio::DB::Taxonomy;
use Bio::Tree::Tree;

my @taxonids = ("515482", "515474");
my @lineages = ();

my $db = Bio::DB::Taxonomy->new(-source => 'entrez');

foreach my $taxonid (@taxonids) {
    my $taxon = $db->get_taxon(-taxonid => $taxonid);
    my $tree = Bio::Tree::Tree->new(-node => $taxon);
    my @taxa = $tree->get_nodes;
    my @tids = ();
    foreach my $t (@taxa) {
        unshift(@tids, $t->id());
    }
    push(@lineages, $taxonid . "\t|\t" . $taxon->scientific_name() . "\t|\t" . "@tids");
}

foreach my $lineage (@lineages) {
    print "$lineage\n";
}

Outputs:

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567

Python example:

#import entrez module
from Bio import Entrez

# set variables
taxids = [515482, 515474]

# set email
Entrez.email = "youremail@gmail.com"

# traverse ids
for taxid in taxids:
    handle = Entrez.efetch(db="taxonomy", id=taxid, mode="text", rettype="xml")
    records = Entrez.read(handle)
    for taxon in records:
        taxid = taxon["TaxId"]
        name = taxon["ScientificName"]
        tids = []
        for t in taxon["LineageEx"]:
            tids.insert(0, t["TaxId"])
        tids.insert(0, taxid)
        print "%s\t|\t%s\t|\t%s" % (taxid, name, " ".join(tids))

Outputs:

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567

These essentially just call http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi, which you could call and parse yourself? Perhaps iterate over the ids in the list, call the url for each and output as you need?

Calling the following for example:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=515482&mode=text&report=xml

Give us the following XML output:

Which you can then parse the lineage tax ids from quite simply. Pierre gives a great example in his post, using an XSLT stylesheet.