Question

How to get correct Taxonomy UID from NCBI Taxonomy using Entrez E-utilities efetch?

0

Entering edit mode

8.3 years ago

JV ▴ 470

Basically the short form of my question is: how can I use Entrez esearch with STRICT search terms (so it gives me only the most appropriate and not any similar related hits).

I frequently get different types of sequences annotated to different taxonomic levels using "Least Common Ancestor" (LCA) Approaches (implemented by different tools). I want to convert these annotations to the respective NCBI TaxonomyIDs (if the terms are present in the Taxonomy Database).

I gathered, I could use the Entrez E-utilities for this (here implemented in Bioython):

from Bio import Entrez
handle = Entrez.esearch(db="taxonomy", term="<my_annotation>")
print(handle.read())
#-->Output = something I can parse the TaxID from

This works on most lower taxonomy levels. For example if I set term="Psychrobacter i get this result:


http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>1</Count><RetMax>1</RetMax><RetStart>0</RetStart><IdList>
<Id>497</Id>
</IdList><TranslationSet/><TranslationStack>   <TermSet>    <Term>Psychrobacter[All Names]</Term>    <Field>All Names</Field>    <Count>1</Count>    <Explode>N</Explode>   </TermSet>   <OP>GROUP</OP>  </TranslationStack><QueryTranslation>Psychrobacter[All Names]</QueryTranslation></eSearchResult>

from which i can easily parse the TaxId "497"

HOWEVER, some search terms (e.g. "Planctomycetes") yield several different Taxonomic IDs, corresponding to similar (but not identical) terms. so for term="Planctomycetes" I get this result:


http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>3</Count><RetMax>3</RetMax><RetStart>0</RetStart><IdList>
<Id>203683</Id>
<Id>203682</Id>
<Id>112</Id>
</IdList><TranslationSet/><TranslationStack>   <TermSet>    <Term>Planctomycetes[All Names]</Term>    <Field>All Names</Field>    <Count>3</Count>    <Explode>N</Explode>   </TermSet>   <OP>GROUP</OP>  </TranslationStack><QueryTranslation>Planctomycetes[All Names]</QueryTranslation><ErrorList><PhraseNotFound>Bacteria</PhraseNotFound></ErrorList></eSearchResult>

These are the Taxonomic IDs for Planctomycetia (class) , Planctomycetes (phylum) and Planctomycetales (order), respectively. How can I get the correct Taxonomic ID corresponding to my exact search term from such an output? Limiting the reported results to only one uid does not help (in this case that would have returned the UID for Planctomycetia, although my search term was "Planctomycetes").

specific example:

For example I sometimes have 16S rRNA sequences extracted from metagenome datasets (using rnammer) classified by the "Least Common Ancestor" (LCA) Approach, which would give me annotations something like this:

sequence    annotation(silva)
seq1    Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Psychrobacter;
seq2    Bacteria;Actinobacteria;Acidimicrobiia;Acidimicrobiales;Sva0996 marine group;
seq3    Bacteria;Planctomycetes;Planctomycetacia;Planctomycetales;Planctomycetaceae;Blastopirellula;
seq4    Bacteria;Planctomycetes;

Note that using the LCA approach, some sequences may not be unambiguously classified above phylum level (this may be unusual for 16S rRNA genes, but can often occur when classifying protein-coding genes) For each of these annotations, I want the lowest taxonomic level, that is represented in the taxonomy database of NCBI, converted into the respective lowest-level NCBI taxonomy ID.

In the above example, I would want the following output:

sequence    TaxonomyID
seq1    497
seq2    84993
seq3    126
seq4    203682

how do I get that?

efetch ncbi biopython Entrez E-utilities • 6.1k views

ADD COMMENT • link updated 8.3 years ago by piet ★ 1.9k • written 8.3 years ago by JV ▴ 470

score 1 · Answer 1 · 2016-07-22

The LCA yields a list of taxon names for each query. The taxon names are separated by semicolon. Now you want to know which names in the list are valid taxon names in NCBI taxonomy database.

Start with the last name in the list and send an Entrez query. Restrict the search to the field "Scientific Name" in order to get only a single hit or no hit:

term = '"Sva0996 marine group"[Scientific Name]'
term = '"Acidimicrobiales"[Scientific Name]'

If you get no hit, then continue with the previous name in the list until you find a name exactly matching a taxon in taxonomy database. If a name has more than one word, you should enclose it with double quotes.You can test your queries with a web browser before using them in an program.

https://www.ncbi.nlm.nih.gov/taxonomy/?term=%22Sva0996+marine+group%22%5BScientific+Name%5D

https://www.ncbi.nlm.nih.gov/taxonomy/?term=%22Acidimicrobiales%22%5BScientific+Name%5D