...then what I get is a single .tar archive, itself containing several hundred thousand .tar archives - each of those containing the text file with the FASTA nucleotide sequence. It would require 2-3 days for my modest but capable Mac Core Duo to untar all these archives and I expect a further day or two for it to cat them into a single flat text file.
So, how can I download a single flat text file (or a manageable number of text files, e.g. 10 files) of the entire NCBI prokaryote Genbank and Refseq databases as nucleotide FASTA?
To my knowledge, there isn't an equivalent direct programmatic way to do this. It is possible to download all the fast files as text files directly though (see for instance ncbi-genome-download). I would be surprised if cat-ing the files takes that long though, so you should be able to concatenate them all.
If you cant even concatenate a file of that size, you're going to struggle to do any meaningful downstream analysis with something that unwieldy too, so you may need to reconsider your approach.
Thanks Joe. By ncbi-genome-download, do you mean a third party shell script / an Entrez query / a web portal?
Yes, I would certainly give cat -ing a go - if I absolutely had to.
Once I've got the single text file, even if it is hundreds of GB in size, I've found it's perfectly possible to run practical analysis on it with packages like HMMER.
ncbi-genome-download from Kai Blin is a utility program that will download genomes for you. You can also use NCBI's newest program called Datasets. More here.
This doesn't address the question. These utilities download data to separate compressed archives. I am specifically looking for a single flat file download for multiple genomes - as is currently available in the NCBI virus portal - but for prokaryotes.
There is no pre-created file for prokaryotic genomes. You will need to make it yourself by downloading the genomes.
You could try to create a fasta file from ref_prok_rep_genomes, which is a pre-formatted blast database NCBI makes available on their blast db FTP site. You can use blastdbcmd tool with the data files. This fasta would contain representative genomes as the name says. Perhaps that may work for whatever you are trying to do.
There is no such file. You will have to do it the same way everyone else does it and download the genomes separately. You can parse the ftp addresses from the assembly summary files. All RefSeq bacteria is 600-700GB and all GenBank bacteria +1TB. Generally you would want files which fit on the RAM of your computer. That being said, you can zcat the archives, no?
Unzip and convert from annotated Genbank (.gbk) format to .fna format using any of a range of tools.
Concatenate.
It is surprising to me that this is not as straightforward for prokaryotes as it could be. As I mentioned in my original question, on the NCBI Virus web interface, whole-database nucleotide FASTAs (RefSeq and Genbank) can be downloaded as a single nucleotide text file with a single click. I have queried this with NCBI and will update this answer if they can add anything to this.
Virus web interface, whole-database nucleotide FASTAs (RefSeq and Genbank) can be downloaded as a single nucleotide text file with a single click.
There are 9507 RefSeq viral entries as complete genomes. In terms of nucleotides, that is at least 2-3 orders of magnitude less than a corresponding compendium for bacteria. Frankly, I would be surprised if the file you are referring to contains all RefSeq viral genomes, but I am definitely not surprised that a single file does not exist for all bacterial RefSeq genomes.
To the best of my knowledge, what you want to do is not possible. I don't know the exact reason, but I would guess it is because there isn't enough demand among users to download a single flat file with all RefSeq genomes. Most people like to customize their downloads, and most people have no problem (g)unzipping and concatenating thousands of files.
The recipe you show above for flat .fna files is most likely incorrect, possibly because you are not pointing at correct directory. It is easy to show using genome_updater that there are 17439 RefSeq bacterial genomes that fulfill the "Complete Genome" criterion (as of June 8th). Likewise, there are 357 RefSeq archaeal genomes fulfilling the same criterion (as of right now). If you do the same exercise but extend this to RefSeq genomes that are not complete (considered in "Contig" state with a reasonably small number of contigs), there are additional 97100 genomes among bacteria (as of June 8th), and another 459 among archaea (as of right now). These numbers are considerably different from what you have.
To my knowledge, there isn't an equivalent direct programmatic way to do this. It is possible to download all the fast files as text files directly though (see for instance
ncbi-genome-download
). I would be surprised ifcat
-ing the files takes that long though, so you should be able to concatenate them all.If you cant even concatenate a file of that size, you're going to struggle to do any meaningful downstream analysis with something that unwieldy too, so you may need to reconsider your approach.
Thanks Joe. By ncbi-genome-download, do you mean a third party shell script / an Entrez query / a web portal?
Yes, I would certainly give cat -ing a go - if I absolutely had to.
Once I've got the single text file, even if it is hundreds of GB in size, I've found it's perfectly possible to run practical analysis on it with packages like HMMER.
ncbi-genome-download
from Kai Blin is a utility program that will download genomes for you. You can also use NCBI's newest program calledDatasets
. More here.This doesn't address the question. These utilities download data to separate compressed archives. I am specifically looking for a single flat file download for multiple genomes - as is currently available in the NCBI virus portal - but for prokaryotes.
There is no pre-created file for prokaryotic genomes. You will need to make it yourself by downloading the genomes.
You could try to create a fasta file from
ref_prok_rep_genomes
, which is a pre-formatted blast database NCBI makes available on their blast db FTP site. You can useblastdbcmd
tool with the data files. This fasta would contain representative genomes as the name says. Perhaps that may work for whatever you are trying to do.Thank you, but as I mentioned in my question I am looking for the entire prokaryote RefSeq and Genbank databases.
There is no such file. You will have to do it the same way everyone else does it and download the genomes separately. You can parse the ftp addresses from the assembly summary files. All RefSeq bacteria is 600-700GB and all GenBank bacteria +1TB. Generally you would want files which fit on the RAM of your computer. That being said, you can zcat the archives, no?