Obtain full taxonomic hierarchy from multiple species names
4
9
Entering edit mode
9.1 years ago
fibar ▴ 90

I have got assembled metagenomic contigs, with multiple annotations per contig. The species name assigned to each annotation varies within contigs. It is suspicious, I know, but they are generally species of the same family. Considering this is due to the known difficulty of assigning sequences to the species level:

Is it possible to get the full taxonomy (genus, family, order, class, phylum) for a list of species names? That would allow me to cluster annotations at a higher taxonomic rank.

Example: Bacteroides thetaiotaomicron belongs to (Phylum)Bacteroidetes;(Class)Bacteroidetes;(Order)Bacteroidales;(Family)Bacteroidaceae

Any additional comment or question is welcome!

taxonomy species contigs metagenomics mg-rast • 7.7k views
ADD COMMENT
7
Entering edit mode
9.1 years ago
jhc ★ 3.0k

Using ete-ncbiquery you could do something like:

$ ete ncbiquery --search 9606 'Canis familiaris' --info
# Taxid Sci.Name Rank Named Lineage Taxid Lineage 

9606 Homo sapiens species root,cellular organisms,Eukaryota,Opisthokonta,Metazoa,Eumetazoa,Bilateria,Deuterostomia,Chordata,Craniata,Vertebrata,Gnathostomata,Teleostomi,Euteleostomi,Sarcopterygii,Dipnotetrapodomorpha,Tetrapoda,Amniota,Mammalia,Theria,Eutheria,Boreoeutheria,Euarchontoglires,Primates,Haplorrhini,Simiiformes,Catarrhini,Hominoidea,Hominidae,Homininae,Homo,Homo sapiens 1,131567,2759,33154,33208,6072,33213,33511,7711,89593,7742,7776,117570,117571,8287,1338369,32523,32524,40674,32525,9347,1437010,314146,9443,376913,314293,9526,314295,9604,207598,9605,9606 

9615 Canis lupus familiaris subspecies root,cellular organisms,Eukaryota,Opisthokonta,Metazoa,Eumetazoa,Bilateria,Deuterostomia,Chordata,Craniata,Vertebrata,Gnathostomata,Teleostomi,Euteleostomi,Sarcopterygii,Dipnotetrapodomorpha,Tetrapoda,Amniota,Mammalia,Theria,Eutheria,Boreoeutheria,Laurasiatheria,Carnivora,Caniformia,Canidae,Canis,Canis lupus,Canis lupus familiaris 1,131567,2759,33154,33208,6072,33213,33511,7711,89593,7742,7776,117570,117571,8287,1338369,32523,32524,40674,32525,9347,1437010,314145,33554,379584,9608,9611,9612,9615

Search terms can be PIPEd from files:

cut -f1 species.txt | ete ncbiquery  --info
ADD COMMENT
3
Entering edit mode
9.1 years ago
piet ★ 1.9k

The NCBI taxonomy database comprises such information. They call it the lineage of a taxon. You can query the taxonomy database with eutils and retrieve the results in XML format. (In this particular case the XML is quite well human readable).

wget 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=818&retype=xml' -O - | less

Please note, that every node in the taxonomy database has a rank like 'species', 'genus', 'order'.

ADD COMMENT
0
Entering edit mode

If you want to focus on the main taxonomic ranks, i.e. superkingdom, kingdom, phylum, class, order, family, genus & species, you can do the following :

efetch -db taxonomy \
       -id 9606 \
       -format xml \
       | xtract -pattern Taxon \
                -tab '\n' -sep '\t' \
                -element TaxId,ScientificName \
                -division LineageEx \
                -group Taxon \
                -if Rank -equals superkingdom \
                -or Rank -equals kingdom \
                -or Rank -equals phylum \
                -or Rank -equals class \
                -or Rank -equals order \
                -or Rank -equals family \
                -or Rank -equals genus \
                -tab '\n' -sep '\t' \
                -element Rank,ScientificName

Which yields :

9606    Homo sapiens
superkingdom    Eukaryota
kingdom Metazoa
phylum  Chordata
class   Mammalia
order   Primates
family  Hominidae
genus   Homo

Ps : All the doc for xtract is here : https://dataguide.nlm.nih.gov/edirect/xtract.html

ADD REPLY
3
Entering edit mode
9.1 years ago

One-liner using xpath:

echo "Bacteroides thetaiotaomicron" | tr " " "+" | \
    while read TERM; do \
        (xmllint --format --xpath '/eSearchResult/IdList/Id[1]/text()' \
        "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=taxonomy&term=${TERM}" && echo ) | \
        while read ID; do xmllint --xpath \
            'concat(/TaxaSet/Taxon/LineageEx/Taxon[Rank/text() = \
                "genus"]/ScientificName/text(),"|",/TaxaSet/Taxon/LineageEx/Taxon[Rank/text() = \
                "class"]/ScientificName/text(),"|",/TaxaSet/Taxon/LineageEx/Taxon[Rank/text() = \
                "order"]/ScientificName/text(),"|",/TaxaSet/Taxon/LineageEx/Taxon[Rank/text() = \
                "family"]/ScientificName/text())' \
            --format "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=${ID}" \
        && echo ; done; done

output:

Bacteroides|Bacteroidia|Bacteroidales|Bacteroidaceae
ADD COMMENT
1
Entering edit mode
9.1 years ago
5heikki 11k

Entrez direct:

esearch -db taxonomy -query "Homo sapiens[Scientific Name]" | efetch -format xml | xtract -element Lineage

cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Euarchontoglires; Primates; Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae; Homininae; Homo

ADD COMMENT
0
Entering edit mode

Where can I download esearch, efetch and xtract?

ADD REPLY
0
Entering edit mode

Documentation is here and ftp dir here.

ADD REPLY

Login before adding your answer.

Traffic: 2564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6