Question

How do I download NCBI prokaryote Genbank and RefSeq databases as single flat text file?

0

Entering edit mode

5.2 years ago

Michael • 0

It's easy to download all viral Genbank and RefSeq genomes from NCBI as a single flat text file of nucleotide FASTAs.

I simply go here and click Download: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide

However, how do I do this for all prokaryote Genbank and RefSeq genomes?

If I go to the following URL and click "Download Assemblies": https://www.ncbi.nlm.nih.gov/assembly/?term=prokaryota%5Borgn%5D

...then what I get is a single .tar archive, itself containing several hundred thousand .tar archives - each of those containing the text file with the FASTA nucleotide sequence. It would require 2-3 days for my modest but capable Mac Core Duo to untar all these archives and I expect a further day or two for it to cat them into a single flat text file.

So, how can I download a single flat text file (or a manageable number of text files, e.g. 10 files) of the entire NCBI prokaryote Genbank and Refseq databases as nucleotide FASTA?

ncbi prokaryote • 6.5k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 5.2 years ago by Michael • 0

2

Entering edit mode

To my knowledge, there isn't an equivalent direct programmatic way to do this. It is possible to download all the fast files as text files directly though (see for instance ncbi-genome-download). I would be surprised if cat-ing the files takes that long though, so you should be able to concatenate them all.

If you cant even concatenate a file of that size, you're going to struggle to do any meaningful downstream analysis with something that unwieldy too, so you may need to reconsider your approach.

ADD REPLY • link 5.2 years ago by Joe 22k

0

Entering edit mode

Thanks Joe. By ncbi-genome-download, do you mean a third party shell script / an Entrez query / a web portal?

Yes, I would certainly give cat -ing a go - if I absolutely had to.

Once I've got the single text file, even if it is hundreds of GB in size, I've found it's perfectly possible to run practical analysis on it with packages like HMMER.

ADD REPLY • link 5.2 years ago by Michael • 0

1

Entering edit mode

ncbi-genome-download from Kai Blin is a utility program that will download genomes for you. You can also use NCBI's newest program called Datasets. More here.

ADD REPLY • link 5.2 years ago by GenoMax 153k

0

Entering edit mode

This doesn't address the question. These utilities download data to separate compressed archives. I am specifically looking for a single flat file download for multiple genomes - as is currently available in the NCBI virus portal - but for prokaryotes.

ADD REPLY • link 5.2 years ago by Michael • 0

0

Entering edit mode

There is no pre-created file for prokaryotic genomes. You will need to make it yourself by downloading the genomes.

You could try to create a fasta file from ref_prok_rep_genomes, which is a pre-formatted blast database NCBI makes available on their blast db FTP site. You can use blastdbcmd tool with the data files. This fasta would contain representative genomes as the name says. Perhaps that may work for whatever you are trying to do.

ADD REPLY • link 5.2 years ago by GenoMax 153k

0

Entering edit mode

Thank you, but as I mentioned in my question I am looking for the entire prokaryote RefSeq and Genbank databases.

ADD REPLY • link 5.2 years ago by Michael • 0

0

Entering edit mode

There is no such file. You will have to do it the same way everyone else does it and download the genomes separately. You can parse the ftp addresses from the assembly summary files. All RefSeq bacteria is 600-700GB and all GenBank bacteria +1TB. Generally you would want files which fit on the RAM of your computer. That being said, you can zcat the archives, no?

ADD REPLY • link 5.2 years ago by 5heikki 11k

score 0 · Answer 1 · 2020-06-19

For anyone else looking to do this, the best solution (involving a manageable number of files) I have found so far is:

Creating a Bacterial RefSeq nucleotide flat file:

Download all .fna (nucleotide FASTA) files (~2000 files) from:
https://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/
These are flat files each containing multiple genomes.
Unzip and concatenate.

Creating a Bacterial Genbank nucleotide flat file:

Download all gbbct.seq (Bacterial sequence entries) (~450 files) from:
ftp://ftp.ncbi.nlm.nih.gov/genbank/
Unzip and convert from annotated Genbank (.gbk) format to .fna format using any of a range of tools.
Concatenate.

It is surprising to me that this is not as straightforward for prokaryotes as it could be. As I mentioned in my original question, on the NCBI Virus web interface, whole-database nucleotide FASTAs (RefSeq and Genbank) can be downloaded as a single nucleotide text file with a single click. I have queried this with NCBI and will update this answer if they can add anything to this.

score 0 · Answer 2 · 2020-06-19

To the best of my knowledge, what you want to do is not possible. I don't know the exact reason, but I would guess it is because there isn't enough demand among users to download a single flat file with all RefSeq genomes. Most people like to customize their downloads, and most people have no problem (g)unzipping and concatenating thousands of files.

The recipe you show above for flat .fna files is most likely incorrect, possibly because you are not pointing at correct directory. It is easy to show using genome_updater that there are 17439 RefSeq bacterial genomes that fulfill the "Complete Genome" criterion (as of June 8th). Likewise, there are 357 RefSeq archaeal genomes fulfilling the same criterion (as of right now). If you do the same exercise but extend this to RefSeq genomes that are not complete (considered in "Contig" state with a reasonably small number of contigs), there are additional 97100 genomes among bacteria (as of June 8th), and another 459 among archaea (as of right now). These numbers are considerably different from what you have.