Basically the short form of my question is: how can I use Entrez esearch with STRICT search terms (so it gives me only the most appropriate and not any similar related hits).
I frequently get different types of sequences annotated to different taxonomic levels using "Least Common Ancestor" (LCA) Approaches (implemented by different tools). I want to convert these annotations to the respective NCBI TaxonomyIDs (if the terms are present in the Taxonomy Database).
I gathered, I could use the Entrez E-utilities for this (here implemented in Bioython):
from Bio import Entrez
handle = Entrez.esearch(db="taxonomy", term="<my_annotation>")
print(handle.read())
#-->Output = something I can parse the TaxID from
This works on most lower taxonomy levels.
For example if I set term="Psychrobacter
i get this result:
http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>1</Count><RetMax>1</RetMax><RetStart>0</RetStart><IdList>
<Id>497</Id>
</IdList><TranslationSet/><TranslationStack> <TermSet> <Term>Psychrobacter[All Names]</Term> <Field>All Names</Field> <Count>1</Count> <Explode>N</Explode> </TermSet> <OP>GROUP</OP> </TranslationStack><QueryTranslation>Psychrobacter[All Names]</QueryTranslation></eSearchResult>
from which i can easily parse the TaxId "497"
HOWEVER, some search terms (e.g. "Planctomycetes") yield several different Taxonomic IDs, corresponding to similar (but not identical) terms. so for term="Planctomycetes"
I get this result:
http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>3</Count><RetMax>3</RetMax><RetStart>0</RetStart><IdList>
<Id>203683</Id>
<Id>203682</Id>
<Id>112</Id>
</IdList><TranslationSet/><TranslationStack> <TermSet> <Term>Planctomycetes[All Names]</Term> <Field>All Names</Field> <Count>3</Count> <Explode>N</Explode> </TermSet> <OP>GROUP</OP> </TranslationStack><QueryTranslation>Planctomycetes[All Names]</QueryTranslation><ErrorList><PhraseNotFound>Bacteria</PhraseNotFound></ErrorList></eSearchResult>
These are the Taxonomic IDs for Planctomycetia (class) , Planctomycetes (phylum) and Planctomycetales (order), respectively. How can I get the correct Taxonomic ID corresponding to my exact search term from such an output? Limiting the reported results to only one uid does not help (in this case that would have returned the UID for Planctomycetia, although my search term was "Planctomycetes").
specific example:
For example I sometimes have 16S rRNA sequences extracted from metagenome datasets (using rnammer) classified by the "Least Common Ancestor" (LCA) Approach, which would give me annotations something like this:
sequence annotation(silva)
seq1 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Psychrobacter;
seq2 Bacteria;Actinobacteria;Acidimicrobiia;Acidimicrobiales;Sva0996 marine group;
seq3 Bacteria;Planctomycetes;Planctomycetacia;Planctomycetales;Planctomycetaceae;Blastopirellula;
seq4 Bacteria;Planctomycetes;
Note that using the LCA approach, some sequences may not be unambiguously classified above phylum level (this may be unusual for 16S rRNA genes, but can often occur when classifying protein-coding genes) For each of these annotations, I want the lowest taxonomic level, that is represented in the taxonomy database of NCBI, converted into the respective lowest-level NCBI taxonomy ID.
In the above example, I would want the following output:
sequence TaxonomyID
seq1 497
seq2 84993
seq3 126
seq4 203682
how do I get that?