NCBI Species name with Taxonomy ID
1
0
Entering edit mode
5.8 years ago

I want to retrieve all the species name along with their taxonomy IDs from the Taxonomy Database. Eg: Homo sapiens(9606)

I am aware of that there are two files available in FTP:

  • nodes.dmp (which associates each taxid with its parent taxid)

  • names.dmp (which associates names with taxids).

But if I am correct, name file contains not only species name, but also other higher level (family, class, phylum etc). If I am using these two files, first I need to find all the Taxonomy IDs that are "Species" level from node.dmp and secondly, I need to get the names for those filtered IDs from name.dmp file.

Is there any straight forward/handy way to retrieve only species names along with IDs other than the method I explained?

genome • 3.8k views
ADD COMMENT
2
Entering edit mode

Check out the module NCBITaxa within the ETE3 toolkit. It will allow you to do this fairly easily.

ADD REPLY
2
Entering edit mode
5.8 years ago
  1. Restrieving all taxids with rank of "species":

    $ awk '$5 == "species" {print $1}' nodes.dmp  > species_taxids.txt
    
    $ head -n 3 species_taxids.txt
    7
    9
    11
    
  2. Taxids and theirs scientific names:

    awk -F $'\t'  'BEGIN {OFS="\t"} $7 == "scientific name" {print $1,$3} ' names.dmp > scientific_names.txt
    
    $ head -n 3  scientific_names.txt
    1       root
    2       Bacteria
    6       Azorhizobium
    
  3. Searching in scientific_names.txt with species_taxids.txt:

    $ grep -w -f species_taxids.txt scientific_names.txt  > result.txt
    
    $ head -n 3 result.txt
    7       Azorhizobium caulinodans
    9       Buchnera aphidicola
    11      Cellulomonas gilvus
    
    $ grep -w 9606 result.txt 
    9606    Homo sapiens
    
ADD COMMENT
0
Entering edit mode

This seems the solution. But following line seems computationally expensive. Do you have any idea how long will it take? I was running nearly an 2 hour with 16GB Mac now...

grep -w -f species_taxids.txt scientific_names.txt  > result.txt
ADD REPLY
0
Entering edit mode

Less than 1 second for me:

$ memusg -t  grep -w -f species_taxids.txt scientific_names.txt  > result.txt

elapsed time: 0.893s
peak rss: 177.77 MB

$ wc -l *
  2902963 names.dmp
  2043416 nodes.dmp
  1753136 result.txt
  2043416 scientific_names.txt
  1675766 species_taxids.txt
 10418697 total

$ ls -lh *
-rw-r--r-- 1 shenwei shenwei 167M 1月   8 18:23 names.dmp
-rw-r--r-- 1 shenwei shenwei 133M 1月   8 18:23 nodes.dmp
-rw-r--r-- 1 shenwei shenwei  58M 1月  23 21:37 result.txt
-rw-r--r-- 1 shenwei shenwei  69M 1月  22 22:42 scientific_names.txt
-rw-r--r-- 1 shenwei shenwei  13M 1月  22 22:36 species_taxids.txt
ADD REPLY
0
Entering edit mode

it took me more than 5 hours, but never finished that step. tried 3 times. Are you using ftp://ftp.ncbi.nih.gov/pub/taxonomy/ or ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/ ? I was using ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

I tried with old one as well .my file sizes are more or less equivalent to yours. however, I accept this as the solution although my machine cannot deal with grep command. can I try doing the same thing with awk command?

ADD REPLY
0
Entering edit mode

Just google and install GNU grep ~~

How to install and use GNU Grep in OSX

ADD REPLY
0
Entering edit mode

Do you know, you are a genius!! Managed to get the results file and it did not take even a 1 second

ADD REPLY

Login before adding your answer.

Traffic: 2212 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6