How could I generate a gi_taxid_nucl.dmp file similar to the one previously hosted by NCBI?
2
0
Entering edit mode
3.0 years ago

Background. I'm trying to use a tool called centrifuge to identify potential genus and species in a given set of FASTQ files. It works with their provided indices, but these indices are out of date and I need to include some more recent sequences from NCBI for my study.

Fortunately, centrifuge allows me to create updated indices using data that was previously available from NCBI's FTP server. I've discovered that this information is no longer available from NCBI in the format that centrifuge needs. Specifically, this file: gi_taxid_nucl.dmp (or it's gzipped equivalent) is supposed to be accessible from this site: https://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz, but it is not.

Investigations. I've done some digging and discovered that this is not a new problem. Other metagenomics tools, like kraken tool have had similar issues raised and centrifuge itself has some issues surrounding this. Unfortunately, the answer seems to be either "use the old version" (which actually is no longer hosted by NCBI, not even in an obsolete directory as suggested by one issue) or "update the code to use the new taxonomy format".

Questions.

  1. I was wondering if/how I could, with minimal updates to centrifuge, provide it with updated data from NCBI in the format it expects.
  2. Is the actual issue that the underlying format for gi_taxid_nucl.dmp is bad and causing issues? Is that why it is not updated or hosted anymore by NCBI? If so, then question 1 is not really an option and I will need to actually update centrifuge. If this is the case, could someone explain what purpose the old gi_taxid_nucl.dmp served and how I might reproduce that from the taxonomy data hosted by NCBI?
ncbi metagenomics centrifuge gi_taxid_nucl.dmp taxonomy • 2.8k views
ADD COMMENT
3
Entering edit mode
3.0 years ago

NCBI has retired the use of GI numbers, hence tools that use that information have become obsolete.

Use kraken2 instead, does everything that Centrifuge does, and mostly without the unexpected crazy results :-)

I did some testing a while back there seemed to be a major flaw with centrifuge in that it does not properly consolidate reads that have equal classifications, it will report the reads to both of the sources, thus, depending on circumstances can make things look really out of whack.

long story short, use kraken2

ADD COMMENT
0
Entering edit mode

Ah, so GI numbers have been phased out altogether. That makes sense and helps my googling. I found this announcement and this NCBI insights article which provided more context.

That's good information on centrifuge. I was mostly using it as a starting point and potentially going to compare against more recent tools, but if it has become obsolete then I guess there's not much need for comparison. For posterity's sake, after reading your post I found this guide to choosing metagenomics tools that gives slightly more information about the tools (from the same institution as the authors of centrifuge) and it does recommend kraken2 over centrifuge: http://ccb.jhu.edu/software/choosing-a-metagenomics-classifier/

ADD REPLY
0
Entering edit mode

Just to note that NCBI hasn't retired GI numbers yet, and it will be a long while before we do. We recently expanded GI numbers to 64-bits. See the NCBI Insights post about this for some details.

The accession2taxid files on the Taxonomy FTP area (https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/) do contain the GI numbers (in addition to the accession and accession.version) for nucleotide and protein sequences.

ADD REPLY
0
Entering edit mode
7 months ago

JUST GIVE THE ANSWER! Here is the solution:

aria2c -c -s 4 -j 4 \
https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/pdb.accession2taxid.gz
zcat pdb.accession2taxid.gz |\
     awk -v OFS='\t' '{print $2, $3}' \
    > gi_taxid_nucl.map
centrifuge-build \
    -p 16 \
     --conversion-table ./gi_taxid_nucl.map \
    --taxonomy-tree ../taxonomy/nodes.dmp \
    --name-table ../taxonomy/names.dmp \
    ../../blastn.fasta \
    ncbi_nt

Note that it will take days to finish this. According to my experience, even 64 threads won't help a lot.

ADD COMMENT

Login before adding your answer.

Traffic: 2154 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6