A couple months ago I wrote a short shell script that does the job:
#!/bin/bash
NAMES="names.dmp"
NODES="nodes.dmp"
GI_TO_TAXID="gi_taxid_nucl.dmp"
TAXONOMY=""
GI="${1}"
# Obtain the name corresponding to a taxid or the taxid of the parent taxa
get_name_or_taxid()
{
grep --max-count=1 "^${1}"$'\t' "${2}" | cut --fields="${3}"
}
# Get the taxid corresponding to the GI number
TAXID=$(get_name_or_taxid "${GI}" "${GI_TO_TAXID}" "2")
# Loop until you reach the root of the taxonomy (i.e. taxid = 1)
while [[ "${TAXID}" -gt 1 ]] ; do
# Obtain the scientific name corresponding to a taxid
NAME=$(get_name_or_taxid "${TAXID}" "${NAMES}" "3")
# Obtain the parent taxa taxid
PARENT=$(get_name_or_taxid "${TAXID}" "${NODES}" "3")
# Build the taxonomy path
TAXONOMY="${NAME};${TAXONOMY}"
TAXID="${PARENT}"
done
echo -e "${GI}\t${TAXONOMY}"
exit 0
For instance, if you have a table of blast results:
cut -d "|" -f 2 myblast.table | sed -e '/^$/d' | grep -v "^#" | while read GI ; do bash get_ncbi_taxonomy.sh "$GI" ; done
It is not very fast, but it can be easily parallelized:
xargs --arg-file=GI.list --max-procs=8 -I '{}' bash get_ncbi_taxonomy.sh '{}'
With 8 cores, you can treat 500-1000 GIs per minute. If you have tens or hundreds of thousand of GIs, it would be more efficient to index everything (python dictionary?).
There is also a companion script that downloads and prepares NCBI's files:
#!/bin/bash
## Download NCBI's taxonomic data and GI (GenBank ID) taxonomic
## assignation.
## Variables
NCBI="ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/"
TAXDUMP="taxdump.tar.gz"
TAXID="gi_taxid_nucl.dmp.gz"
NAMES="names.dmp"
NODES="nodes.dmp"
DMP=$(echo {citations,division,gencode,merged,delnodes}.dmp)
USELESS_FILES="${TAXDUMP} ${DMP} gc.prt readme.txt"
## Download taxdump
rm -rf ${USELESS_FILES} "${NODES}" "${NAMES}"
wget "${NCBI}${TAXDUMP}" && \
tar zxvf "${TAXDUMP}" && \
rm -rf ${USELESS_FILES}
## Limit search space to scientific names
grep "scientific name" "${NAMES}" > "${NAMES/.dmp/_reduced.dmp}" && \
rm -f "${NAMES}" && \
mv "${NAMES/.dmp/_reduced.dmp}" "${NAMES}"
## Download gi_taxid_nucl
rm -f "${TAXID/.gz/}*"
wget "${NCBI}${TAXID}" && \
gunzip "${TAXID}"
exit 0
is there any chance to use this script using an array of organism's name instead of gis or taxid?