Download whole dataset from NCBI Taxonomy
1
4
Entering edit mode
9.1 years ago
stackf03 ▴ 40

Hello. I want to know where can I download the NCBI taxonomy data file from the NCBI database?

The file that I am looking should contains the following:​

  1. Taxonomy ID
  2. Common Name
  3. Scientific Name

If anyone can provide me the link, I'd be grateful. Thanks & Regards.

NCBI Taxonomy • 13k views
ADD COMMENT
0
Entering edit mode

Hi there, thank you stackf03 for this thread. I'm in need of an automation to include the TaxaDB in a small thesis project. Hope you're still active members and can help me in the following questions:

  1. NCBI keeps uploading to their ftp address the whole TaxaDB in the fashion you've shown in this thread. Do you know if there's any other source for this data?, better yet, in a different format? Since I need an automated way to import (and update) the taxa section of our DB. The dmp files are hard to handle (NCBI uses MySQL but this dump files are not directly from MySQL

  2. If not another source of the data itself, any piece of software that uses TaxaDB as part of their functioning?. I will give a try to this one Taxadb. Would appreciate if there's another tool around.

  3. The 'common-name' is stored in the names (file), for each name that a tax_id has there's a row for it, each indicates the name class. I comment this in case someone else finds this thread and wonders if the common name is there or not.

Thanks.

ADD REPLY
3
Entering edit mode
9.1 years ago
Phil S. ▴ 700

This is your site! And the file you want to download is this one.

HTH

ADD COMMENT
2
Entering edit mode

Just to be clear the file linked is not a single file archive.

@stackf03: You would want to take a look at the readme that goes with that dump.

ADD REPLY
0
Entering edit mode

Does this contains the taxID, scientific name and common name?

ADD REPLY
0
Entering edit mode

It contains this:

nodes.dmp
---------

This file represents taxonomy nodes. The description for each node includes 
the following fields:

    tax_id                  -- node id in GenBank taxonomy database
    parent tax_id               -- parent node id in GenBank taxonomy database
    rank                    -- rank of this node (superkingdom, kingdom, ...) 
    embl code               -- locus-name prefix; not unique
    division id             -- see division.dmp file
    inherited div flag  (1 or 0)        -- 1 if node inherits division from parent
    genetic code id             -- see gencode.dmp file
    inherited GC  flag  (1 or 0)        -- 1 if node inherits genetic code from parent
    mitochondrial genetic code id       -- see gencode.dmp file
    inherited MGC flag  (1 or 0)        -- 1 if node inherits mitochondrial gencode from parent
    GenBank hidden flag (1 or 0)            -- 1 if name is suppressed in GenBank entry lineage
    hidden subtree root flag (1 or 0)       -- 1 if this subtree has no sequence data yet
    comments                -- free-text comments and citations

names.dmp
---------
Taxonomy names file has these fields:

    tax_id                  -- the id of node associated with this name
    name_txt                -- name itself
    unique name             -- the unique variant of this name if name not unique
    name class              -- (synonym, common name, ...)

from where you can put your parts together

ADD REPLY
0
Entering edit mode

Thanks for this.

So basically, I would need the taxonomy names file which names.dmp !

May I knw the tool you use to open this file please? :)​

ADD REPLY
0
Entering edit mode

That should be a text file. It would likely be large so you may not want to open it in a standard editor. It would be best to use awk to pull out the fields you need.

ADD REPLY
0
Entering edit mode

I have managed to open it with sublime text editor. It consists this:

1 | all |  | synonym |
1 | root |  | scientific name |
2 | Bacteria | Bacteria <prokaryote> | scientific name |
2 | Monera | Monera <Bacteria> | in-part |
2 | Procaryotae | Procaryotae <Bacteria> | in-part |
2 | Prokaryota | Prokaryota <Bacteria> | in-part |
2 | Prokaryotae | Prokaryotae <Bacteria> | in-part |
2 | bacteria | bacteria <blast2> | blast name |
2 | eubacteria |  | genbank common name |
2 | not Bacteria Haeckel 1894 |  | synonym |
2 | prokaryote | prokaryote <Bacteria> | in-part |
2 | prokaryotes | prokaryotes <Bacteria> | in-part |
6 | Azorhizobium |  | scientific name |
6 | Azorhizobium Dreyfus et al. 1988 emend. Lang et al. 2013 |  | authority |
6 | Azotirhizobium |  | misspelling |
7 | ATCC 43989 |  | type material |
7 | Azorhizobium caulinodans |  | scientific name |
7 | Azorhizobium caulinodans Dreyfus et al. 1988 |  | synonym |
7 | Azotirhizobium caulinodans |  | equivalent name |
7 | CCUG 26647 |  | type material |
7 | DSM 5975 |  | type material |
7 | IFO 14845 |  | type material |
7 | JCM 20966 |  | type material |
7 | LMG 6465 |  | type material |
7 | NBRC 14845 |  | type material |
7 | ORS 571 |  | type material |
9 | Acyrthosiphon pisum symbiont P |  | includes |
9 | Buchnera aphidicola |  | scientific name |
9 | Buchnera aphidicola Munson et al. 1991 |  | synonym |
10 | "Cellvibrio" Winogradsky 1929 |  | synonym |
10 | Cellvibrio |  | scientific name |
10 | Cellvibrio (ex Winogradsky 1929) Blackall et al. 1986 emend. Humphry et al. 2003 |  | synonym |
ADD REPLY
0
Entering edit mode

Does this make sense?

ADD REPLY
1
Entering edit mode

Looks like you won't get the "common name" from this file. Look at the other files included in the archive. TaxID and scientific names are the first two fields here. Unless you don't need the common name.

Following should give you records (taxID, names) labelled as "scientific names" in names.dmp

$ awk -F "|" '$4 ~ /scientific/ {print $1"\t"$2}' names.dmp > sci_names
ADD REPLY

Login before adding your answer.

Traffic: 1136 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6