Background. I'm trying to use a tool called centrifuge to identify potential genus and species in a given set of FASTQ files. It works with their provided indices, but these indices are out of date and I need to include some more recent sequences from NCBI for my study.
Fortunately, centrifuge allows me to create updated indices using data that was previously available from NCBI's FTP server. I've discovered that this information is no longer available from NCBI in the format that centrifuge needs. Specifically, this file: gi_taxid_nucl.dmp (or it's gzipped equivalent) is supposed to be accessible from this site: https://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz, but it is not.
Investigations.
I've done some digging and discovered that this is not a new problem. Other metagenomics tools, like kraken tool have had similar issues raised and centrifuge itself has some issues surrounding this. Unfortunately, the answer seems to be either "use the old version" (which actually is no longer hosted by NCBI, not even in an obsolete
directory as suggested by one issue) or "update the code to use the new taxonomy format".
Questions.
- I was wondering if/how I could, with minimal updates to centrifuge, provide it with updated data from NCBI in the format it expects.
- Is the actual issue that the underlying format for gi_taxid_nucl.dmp is bad and causing issues? Is that why it is not updated or hosted anymore by NCBI? If so, then question 1 is not really an option and I will need to actually update centrifuge. If this is the case, could someone explain what purpose the old gi_taxid_nucl.dmp served and how I might reproduce that from the taxonomy data hosted by NCBI?
Ah, so GI numbers have been phased out altogether. That makes sense and helps my googling. I found this announcement and this NCBI insights article which provided more context.
That's good information on centrifuge. I was mostly using it as a starting point and potentially going to compare against more recent tools, but if it has become obsolete then I guess there's not much need for comparison. For posterity's sake, after reading your post I found this guide to choosing metagenomics tools that gives slightly more information about the tools (from the same institution as the authors of
centrifuge
) and it does recommendkraken2
overcentrifuge
: http://ccb.jhu.edu/software/choosing-a-metagenomics-classifier/Just to note that NCBI hasn't retired GI numbers yet, and it will be a long while before we do. We recently expanded GI numbers to 64-bits. See the NCBI Insights post about this for some details.
The accession2taxid files on the Taxonomy FTP area (https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/) do contain the GI numbers (in addition to the accession and accession.version) for nucleotide and protein sequences.