I have a list of hundreds or thousands of RefSeqs, e.g. NC_000964.1
If I point my browser to https://www.ncbi.nlm.nih.gov/nuccore/NC_000964.1, I see the gbff file displayed and I can download it through various clicks. What I can't figure out is how to automate this, how to construct some url or ftp address I can just use wget or rsync on. Any advice?
Edit: the specific file I'm after is the one called GenBank(full) in the drop-down:
NCBI Datasets is designed specifically for something like this! You can query the bacterial genomes by taxids if you would like and download all of the data in one package. There is a command line tool that you can use on Windows, Mac or Linux machines to download sequence and annotation data starting from either taxids or accessions.
At this time, NCBI Datasets includes only the latest assemblies (see here). If you have a list of NCBI assembly accessions that are out of the scope of NCBI Datasets, you can download directly from the FTP paths as shown below. The NCBI Genomes FTP has a bunch of assembly_summary.txt files (this one, for example) that have the full FTP paths for assemblies that can be used with a tool like lftp to download GFF3 and GBFF files.
Technically, you can use Entrez Direct to download the GenBank flatfiles for a given set of nucleotide accessions like NC_000913 but it does not scale very well if you have many thousands of accessions. Note, there is a limit on the number of requests you can make using e-utilities API which can be increased by getting an API key as described here.
Hi, an update. Your piped chain of commands works for me, thank you, so I can proceed with that. For completeness, I cannot find refseq accessions in the file you linked to (ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt). For example, searching that file for NC_000913.3 or just NC_000913 returns no hits.
Thanks. I want to avoid doing that because the organism names get confusing fast with different strains of the same bacterial species (and the strains can be pretty different). I will stick to the Entrez Direct method. In case someone reads this later, NCBI's instructions on installing these utilities omit the simplest method, at least on Ubuntu 20 you can just do sudo apt install ncbi-entrez-direct.
Note, the version of Entrez Direct in Ubuntu repositories is typically behind the latest version. I recommend following the directions on the NCBI website.
refseq and assembly accessions are database accessions or unique identifiers in their own worlds. The refseq or nuccore accession is an accession for a particular sequence id, while the assembly accession can group multiple contigs, scaffolds, chromosomes, etc. In your case, with a near perfect assembly of a bacterial genome, the difference might feel semantic, however think about about for example eukaryotes.
Thanks that is helpful. When you say "The refseq or nuccore accession is an accession for a particular sequence id", can a "particular sequence id" only refer to single physically connected sequence? I.e., the refseq accession can be a genome in the bacterial case only because there is a single "chromosome" that contains the entire genome?
You can find information about RefSeq records in this FAQ. Accession prefixes for RefSeq records are in this table. An organism may have multiple NC* ID's for chromosomes (e.g. human genome ones are NC000001 through NC000024).
Key differences between GenBank (GCA) and RefSeq (GCF) assemblies are summarized on this page.
OK so I thought I was going from RefSeq to GenBank by converting from e.g. NC_000913.1 to GCF_000005845.2 but apparently GCF is RefSeq as well so everything is in RefSeq land. I guess by going from NC_ to GCF_ I am going from "Complete genomic molecule, usually reference assembly" to "RefSeq genome assembly".
Thanks, I think GCF is fine for me. I just want to be able to download either .gb or .gff files given a NC_ accession, from the command line. So my current strategy is
I was ultimately successful in using a combination of @vkkodali 's suggestion to use the NCBI datasets tool and @GenoMax's suggested use of EntrezDirect tools to autmate go from an NC_* style annotation to a GCF_* annotation, and then download fasta and gff. A big thanks to both from me.
For an accession such as NC_000913.3, I did esearch -db nuccore -query NC_000913 | elink -target assembly | esummary | xtract -pattern DocumentSummary -element RefSeq (note the dropping of whatever is after the dot) which yielded e.g. GCF_000005845.2. Then datasets download genome accession GCF_000005845.2 would download a file ncbi_dataset.zip containing what I needed. This didn't always work but out of about 3400 accessions it definitely worked the large majority of the time, which is all I needed.
There might have been more direct ways. For a given NC_* accession, it seems like pointing a browser to e.g. https://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3/ works, so I considered trying to automate clicking through "Send to", maybe with something like AutoHotKey, but ultimately decided what I had here was good enough.
You can get the GenBank format file by simply using
$ efetch -db nuccore -id "NC_000913" -format gb | head -5
LOCUS NC_000913 4641652 bp DNA circular CON 11-OCT-2018
DEFINITION Escherichia coli str. K-12 substr. MG1655, complete genome.
ACCESSION NC_000913
VERSION NC_000913.3
DBLINK BioProject: PRJNA57779
What you can't get is a GFF file.
You could get the feature table and make your own GFF file.
Yes, you would have to get the GCF accessions for the nucleotide accessions using elink. Entrez Direct works fine for this because you are not downloading entire records but only a summary. In fact, you don't even have to extract the GCF accessions list if you don't care about them. There is an FtpPath_RefSeq attribute in the output of esummary that you can extract and use that to download files from FTP.
Thanks that looks useful. Yes they are all accessions
Looking through the example syntax for the tool, it unfortunately doesn't look like one can pass it accession numbers. I guess I could just download all bacterial genomes and filter afterwards...
Thanks that looks promising. Unfortunately I can't get it to work, for example:
./datasets download genome accession NC_000913.3
Error: invalid or unsupported assembly accession: NC_000913.3
But that refseq certainly exists: https://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3/
EDIT: OK I guess there is a difference between a RefSeq accession and an assembly accession? The following works
./datasets download genome accession GCF_000005845.2
I don't understand the difference between a refseq accession and a "assembly" accession, nor how to go from one to the other.
You can use Entrezdirect to go from one to other:
or download this file (RefSeq) and parse it.
RefSeq
accessions are manually curated.Awesome. Thank you!
Hi, an update. Your piped chain of commands works for me, thank you, so I can proceed with that. For completeness, I cannot find refseq accessions in the file you linked to (ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt). For example, searching that file for
NC_000913.3
or justNC_000913
returns no hits.Sorry I forgot that these report files don't use the RefSeq ID's in
NC*
format. You could grep for the organism names in this file.Thanks. I want to avoid doing that because the organism names get confusing fast with different strains of the same bacterial species (and the strains can be pretty different). I will stick to the Entrez Direct method. In case someone reads this later, NCBI's instructions on installing these utilities omit the simplest method, at least on Ubuntu 20 you can just do
sudo apt install ncbi-entrez-direct
.Note, the version of Entrez Direct in Ubuntu repositories is typically behind the latest version. I recommend following the directions on the NCBI website.
refseq and assembly accessions are database accessions or unique identifiers in their own worlds. The refseq or nuccore accession is an accession for a particular sequence id, while the assembly accession can group multiple contigs, scaffolds, chromosomes, etc. In your case, with a near perfect assembly of a bacterial genome, the difference might feel semantic, however think about about for example eukaryotes.
Thanks that is helpful. When you say "The refseq or nuccore accession is an accession for a particular sequence id", can a "particular sequence id" only refer to single physically connected sequence? I.e., the refseq accession can be a genome in the bacterial case only because there is a single "chromosome" that contains the entire genome?
You can find information about RefSeq records in this FAQ. Accession prefixes for RefSeq records are in this table. An organism may have multiple
NC*
ID's for chromosomes (e.g. human genome ones areNC000001
throughNC000024
).Key differences between GenBank (GCA) and RefSeq (GCF) assemblies are summarized on this page.
OK so I thought I was going from RefSeq to GenBank by converting from e.g. NC_000913.1 to GCF_000005845.2 but apparently GCF is RefSeq as well so everything is in RefSeq land. I guess by going from NC_ to GCF_ I am going from "Complete genomic molecule, usually reference assembly" to "RefSeq genome assembly".
If you wanted to go to Genbank then you should do:
Thanks, I think GCF is fine for me. I just want to be able to download either .gb or .gff files given a NC_ accession, from the command line. So my current strategy is
NC_000913.3
NC_000913
esearch -db nuccore -query NC_000913 | elink -target assembly | esummary | xtract -pattern DocumentSummary -element RefSeq
which yieldsGCF_000005845.2
datasets download genome accession GCF_000005845.2
which produces a zip file containing fasta and GFF among other things