Question

NCBI Species name with Taxonomy ID

0

Entering edit mode

5.8 years ago

sureshhewabi • 0

I want to retrieve all the species name along with their taxonomy IDs from the Taxonomy Database. Eg: Homo sapiens(9606)

I am aware of that there are two files available in FTP:

nodes.dmp (which associates each taxid with its parent taxid)
names.dmp (which associates names with taxids).

But if I am correct, name file contains not only species name, but also other higher level (family, class, phylum etc). If I am using these two files, first I need to find all the Taxonomy IDs that are "Species" level from node.dmp and secondly, I need to get the names for those filtered IDs from name.dmp file.

Is there any straight forward/handy way to retrieve only species names along with IDs other than the method I explained?

genome • 3.8k views

ADD COMMENT • link updated 5.8 years ago by shenwei356 8.7k • written 5.8 years ago by sureshhewabi • 0

2

Entering edit mode

Check out the module NCBITaxa within the ETE3 toolkit. It will allow you to do this fairly easily.

ADD REPLY • link 5.8 years ago by Joe 21k

score 2 · Accepted Answer · 2019-01-22

2

Entering edit mode

5.8 years ago

shenwei356 8.7k

Restrieving all taxids with rank of "species":

$ awk '$5 == "species" {print $1}' nodes.dmp  > species_taxids.txt

$ head -n 3 species_taxids.txt
7
9
11

Taxids and theirs scientific names:

awk -F $'\t'  'BEGIN {OFS="\t"} $7 == "scientific name" {print $1,$3} ' names.dmp > scientific_names.txt

$ head -n 3  scientific_names.txt
1       root
2       Bacteria
6       Azorhizobium

Searching in scientific_names.txt with species_taxids.txt:

$ grep -w -f species_taxids.txt scientific_names.txt  > result.txt

$ head -n 3 result.txt
7       Azorhizobium caulinodans
9       Buchnera aphidicola
11      Cellulomonas gilvus

$ grep -w 9606 result.txt 
9606    Homo sapiens

ADD COMMENT • link 5.8 years ago by shenwei356 8.7k

0

Entering edit mode

This seems the solution. But following line seems computationally expensive. Do you have any idea how long will it take? I was running nearly an 2 hour with 16GB Mac now...

grep -w -f species_taxids.txt scientific_names.txt  > result.txt

ADD REPLY • link 5.8 years ago by sureshhewabi • 0

0

Entering edit mode

Less than 1 second for me:

$ memusg -t  grep -w -f species_taxids.txt scientific_names.txt  > result.txt

elapsed time: 0.893s
peak rss: 177.77 MB

$ wc -l *
  2902963 names.dmp
  2043416 nodes.dmp
  1753136 result.txt
  2043416 scientific_names.txt
  1675766 species_taxids.txt
 10418697 total

$ ls -lh *
-rw-r--r-- 1 shenwei shenwei 167M 1月   8 18:23 names.dmp
-rw-r--r-- 1 shenwei shenwei 133M 1月   8 18:23 nodes.dmp
-rw-r--r-- 1 shenwei shenwei  58M 1月  23 21:37 result.txt
-rw-r--r-- 1 shenwei shenwei  69M 1月  22 22:42 scientific_names.txt
-rw-r--r-- 1 shenwei shenwei  13M 1月  22 22:36 species_taxids.txt

ADD REPLY • link 5.8 years ago by shenwei356 8.7k

0

Entering edit mode

it took me more than 5 hours, but never finished that step. tried 3 times. Are you using ftp://ftp.ncbi.nih.gov/pub/taxonomy/ or ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/ ? I was using ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/

ADD REPLY • link 5.8 years ago by sureshhewabi • 0

0

Entering edit mode

Old one from ftp://ftp.ncbi.nih.gov/pub/taxonomy/.

ADD REPLY • link 5.8 years ago by shenwei356 8.7k

0

Entering edit mode

I tried with old one as well .my file sizes are more or less equivalent to yours. however, I accept this as the solution although my machine cannot deal with grep command. can I try doing the same thing with awk command?

ADD REPLY • link 5.8 years ago by sureshhewabi • 0

0

Entering edit mode

Just google and install GNU grep ~~

How to install and use GNU Grep in OSX

ADD REPLY • link 5.8 years ago by shenwei356 8.7k

0

Entering edit mode

Do you know, you are a genius!! Managed to get the results file and it did not take even a 1 second

ADD REPLY • link 5.8 years ago by sureshhewabi • 0