Parsing Ncbi Taxonomic Tree?
5
17
Entering edit mode
13.1 years ago
Prohan ▴ 350

Hi, I'd like to assign taxonomies to some of my BLAST hits to NR. So I have the GIs.

I've figured that the way to do this is by traversing the files in: ftp://ftp.ncbi.nih.gov/pub/taxonomy - specifically: gi_taxid_prot.dmp and taxdmp

Does anyone have any hints on how to do this? I basically don't understand how to parse the actually tree. I'm planning on doing this in Python.

Thanks

ncbi taxonomy python tree • 27k views
ADD COMMENT
1
Entering edit mode

is there any chance to use this script using an array of organism's name instead of gis or taxid?

ADD REPLY
24
Entering edit mode
13.1 years ago

A couple months ago I wrote a short shell script that does the job:

#!/bin/bash

NAMES="names.dmp"
NODES="nodes.dmp"
GI_TO_TAXID="gi_taxid_nucl.dmp"
TAXONOMY=""
GI="${1}"

# Obtain the name corresponding to a taxid or the taxid of the parent taxa
get_name_or_taxid()
{
    grep --max-count=1 "^${1}"$'\t' "${2}" | cut --fields="${3}"
}

# Get the taxid corresponding to the GI number
TAXID=$(get_name_or_taxid "${GI}" "${GI_TO_TAXID}" "2")

# Loop until you reach the root of the taxonomy (i.e. taxid = 1)
while [[ "${TAXID}" -gt 1 ]] ; do
    # Obtain the scientific name corresponding to a taxid
    NAME=$(get_name_or_taxid "${TAXID}" "${NAMES}" "3")
    # Obtain the parent taxa taxid
    PARENT=$(get_name_or_taxid "${TAXID}" "${NODES}" "3")
    # Build the taxonomy path
    TAXONOMY="${NAME};${TAXONOMY}"
    TAXID="${PARENT}"
done

echo -e "${GI}\t${TAXONOMY}"

exit 0

For instance, if you have a table of blast results:

cut -d "|" -f 2 myblast.table | sed -e '/^$/d' | grep -v "^#" | while read GI ; do bash get_ncbi_taxonomy.sh "$GI" ; done

It is not very fast, but it can be easily parallelized:

xargs --arg-file=GI.list --max-procs=8 -I '{}' bash get_ncbi_taxonomy.sh '{}'

With 8 cores, you can treat 500-1000 GIs per minute. If you have tens or hundreds of thousand of GIs, it would be more efficient to index everything (python dictionary?).

There is also a companion script that downloads and prepares NCBI's files:

#!/bin/bash

## Download NCBI's taxonomic data and GI (GenBank ID) taxonomic
## assignation.

## Variables
NCBI="ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/"
TAXDUMP="taxdump.tar.gz"
TAXID="gi_taxid_nucl.dmp.gz"
NAMES="names.dmp"
NODES="nodes.dmp"
DMP=$(echo {citations,division,gencode,merged,delnodes}.dmp)
USELESS_FILES="${TAXDUMP} ${DMP} gc.prt readme.txt"

## Download taxdump
rm -rf ${USELESS_FILES} "${NODES}" "${NAMES}"
wget "${NCBI}${TAXDUMP}" && \
    tar zxvf "${TAXDUMP}" && \
    rm -rf ${USELESS_FILES}

## Limit search space to scientific names
grep "scientific name" "${NAMES}" > "${NAMES/.dmp/_reduced.dmp}" && \
    rm -f "${NAMES}" && \
    mv "${NAMES/.dmp/_reduced.dmp}" "${NAMES}"

## Download gi_taxid_nucl
rm -f "${TAXID/.gz/}*"
wget "${NCBI}${TAXID}" && \
    gunzip "${TAXID}"

exit 0
ADD COMMENT
1
Entering edit mode

Impressive use of Bash and xargs there! But re-grepping the nodes file is not scalable, as you state.

ADD REPLY
1
Entering edit mode

If you mean that multiplying concurrent accesses to the same file is not something scalable, you're right. For a very number of GI requests, it would be better to transform back nodes and names files into indexed databases (sqlite or python pickled object). But for my level of use, these shell scripts are more than enough.

ADD REPLY
1
Entering edit mode

This is really impressive bash scripting. It seems to work great for me. Now just need to understand how it works! Thanks a ton.

ADD REPLY
1
Entering edit mode

very useful information!

ADD REPLY
1
Entering edit mode

[SOLVED] Great work, thanks a lot. I have been testing it and I've found a disturbing behavior. As get_name_or_taxid() is getting the first instance of its first argument, it may sometimes pull synonyms or misspellings from names.dmp.

ADD REPLY
0
Entering edit mode

I never had that problem. I reviewed and updated the above code, but I don't think it would solve your problem. Could you please give an example of problematic GI?

ADD REPLY
0
Entering edit mode

Yes. I was trying GI 115495057, using gi_taxid_prot.dmp instead gi_taxid_nucl.dmp. The output of your script is:

115495057 biota; Eucarya; Fungi/Metazoa group; Animalia; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Laurasiatheria; Cetartiodactyla; Artiodactyla; Pecora; Bovidae; Bovinae; Bos; Bos Tauurus;

While the lineage for cattle (taxonomy ID 9913) at NCBI's taxonomy browser is:

cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae; Bovinae; Bos; Bos taurus

Some of the taxonomic categories are labelled as synonyms or misspellings at names.dmp, and the results I get seem to be the first occurrence in the list independently of its staus.

ADD REPLY
0
Entering edit mode

I just tried with the GI 115495057, and the output is correct:

115495057 cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae; Bovinae; Bos; Bos taurus

Did you apply the companion script that reduces names.dmp to only scientific names? That operation gets rid of all synonyms and misspellings.

# limit search space to scientific names
NAMES="names.dmp"
grep "scientific name" ${NAMES} > ${NAMES/.dmp/_reduced.dmp}
mv ${NAMES/.dmp/_reduced.dmp} ${NAMES}
ADD REPLY
0
Entering edit mode

My bad... as I had downloaded the taxdump files already, I stopped reading your post after "There is also a companion script that downloads NCBI's files" and I didn't notice the step to search only for scientific names. It working fine for me now. Again, thanks for the script.

ADD REPLY
0
Entering edit mode

Hi I executed the above "get_ncbi_taxonomy.sh" & I got an error. Am I missing something?

myblast.table contained the following data gi|472256744| gi|461490773| gi|71143482| gi|461490773| I go the following error raghul@raghul-Studio-1749:~/db/tax-dump$ cut -d "|" -f 2 myblast.table | sed -e '/^$/d' | grep -v "^#" | while read GI ; do bash get_ncbi_taxonomy.sh "$GI" ; donetax: line 19: [: : integer expression expected 472256744
get_ncbi_taxonomy.sh: line 19: [: : integer expression expected 461490773
get_ncbi_taxonomy.sh: line 19: [: : integer expression expected 71143482
get_ncbi_taxonomy.sh: line 19: [: : integer expression expected 461490773

thank u raghul

ADD REPLY
2
Entering edit mode

Hi Raghul, this is a bug caused by tabulations. I corrected the script (using $'\t') to avoid that.

ADD REPLY
0
Entering edit mode

Hello, the script doesn't work for me. It appears that the GI is correctly returned, but not the TAXONOMY. All ncbi files were downloaded according to the companion script.

kschoonv@molfyl2:~> cut -d "|" -f 2 2blastx | sed -e '/^$/d' | grep -v "^#" | while read GI ; do bash get_ncbi_taxonomy.sh "$GI" ; done
802670096
kschoonv@molfyl2:~>

What's going on here?

ADD REPLY
0
Entering edit mode

Hello,

I've just tried the scripts and they work correctly (besides a change in NCBI's FTP URL: I updated the companion script). It seems that you are using protein GIs as queries. You need to replace gi_taxid_nucl.dmp with gi_taxid_prot.dmp.

ADD REPLY
11
Entering edit mode
13.1 years ago
jhc ★ 3.0k

If you just want to link GIs to taxon names, parse the "gi_taxid_prot.dmp" to extract the taxids of your hits, and translate them to scientific names using the "names.dmp" file included in "taxdump.tar.gz".

If you are also interested in getting the taxonomy tree of the selected species, you will need to parse the parent-child relationships in "nodes.dmp". For this, you could use the ETE Python toolkit to load the whole NCBI taxonomy tree, and then prune it to the selected taxa. Actually, there is an example showing how to do exactly that.

P.D. I would recommend you to use the last ETE version (ete2a1). Some functions are still beta, but pruning and traversing methods are much faster when dealing with such a huge (>500k nodes) trees.

UPDATE!: ete2a1 is no longer maintained, use the main branch "ete2". I have also uploaded to github the basic script that I usually use to query the NCBI taxonomy tree (https://github.com/jhcepas/ncbi_taxonomy).

ADD COMMENT
1
Entering edit mode

This is a very nice tool! It also generates a tabular file containing the information of the hierarchy of the taxonomy of each species that might be used in additional analyses.

ADD REPLY
5
Entering edit mode
9.4 years ago
jhc ★ 3.0k

The ETE toolkit (v2.3+) allows to query the NCBI taxonomy database in a very easy way. You can dump annotated trees by querying with taxids or species names, or get extended taxa information. There is an API and a command line tool available.

ADD COMMENT
4
Entering edit mode
13.1 years ago

One way would be to parse the nodes.dmp file and keep track of the tree in Python. If you only have a fixed set of taxon ids, you could also paste them into iTOL and use the resulting tree with a Newick parser. Lastly, you could try my fork of the Google Code taxonomy repository. This needs more set-up (SQLAlchemy and a parsed NCBI taxonomy), but then is faster for repeated queries.

ADD COMMENT
2
Entering edit mode
7.8 years ago
-_- ★ 1.1k

The whole NCBI taxonomy database is not that big. I have written some code to convert NCBI taxdump into lineages identified by tax ids, https://github.com/zyxue/ncbitax2lin. You may find it useful.

ADD COMMENT

Login before adding your answer.

Traffic: 2681 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6