Download many NCBI genomes with list of GCA identifiers
2
0
Entering edit mode
20 months ago
0fbcb0a9 • 0

Hi, I have a list of NCBI GCA/GCF identifiers for many (hundreds of) vertebrate whole genomes that I would like to download. I initially tried Entrez.efetch, but have realized GCA numbers cannot be used with it, and this also may not be an efficient method for so many large genomes. I'm now looking at using ftplib or the NCBI datasets tool, but I'm new to bioinformatics and I am having trouble understanding the best approach with these. Has anyone done this, and if so, do you have example code you are willing to share? I'm hoping to do this with python/command line. Any help would be greatly appreciated, thanks!

genomes biopython ncbi api python • 2.3k views
ADD COMMENT
2
Entering edit mode

There are how to guides available about how to use NCBI datasets command line tool: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/ with one specifically for large genomes : https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/genomes/large-download/

ADD REPLY
0
Entering edit mode
ADD REPLY
2
Entering edit mode
20 months ago
5heikki 11k

More information/download for datasets command line tool: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/

datasets download

Download genome, gene and virus data packages, including sequence, annotation, and metadata, as a zip file.

Refer to NCBI's [download and install](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) documentation for information about getting started with the command-line tools.

Usage
  datasets download [command]

Sample Commands
  datasets download genome accession GCF_000001405.40 --chromosomes X,Y --exclude-gff3 --exclude-rna
  datasets download genome taxon "bos taurus"
  datasets download gene gene-id 672
  datasets download gene symbol brca1 --taxon mouse
  datasets download gene accession NP_000483.3
  datasets download virus genome taxon sars-cov-2 --host dog
  datasets download virus protein S --host dog --filename SARS2-spike-dog.zip

Available Commands
  gene        Download a gene data package
  genome      Download a genome data package
  virus       Download a virus data package

Flags
      --filename string   Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")
      --no-progressbar    Hide progress bar


Global Flags
      --api-key string   Specify an NCBI API key
      --debug            Emit debugging info
      --help             Print detailed help about a datasets command
      --version          Print version of datasets

Use datasets download <command> --help for detailed help about a command.
ADD COMMENT
0
Entering edit mode
20 months ago
size_t ▴ 120

try this tool: ncbi-genome-download

ADD COMMENT
0
Entering edit mode

To the best of my knowledge it doesn't work with GCA/GCF identifiers. Deals only with taxonomic names or IDs.

ADD REPLY
1
Entering edit mode

Actually it's possible, see the parameter -A

usage: ncbi-genome-download [-h] [-s {refseq,genbank}] [-F FILE_FORMATS] [-l ASSEMBLY_LEVELS] [-g GENERA] [--genus GENERA] [--fuzzy-genus] [-S STRAINS] [-T SPECIES_TAXIDS] [-t TAXIDS] [-A ASSEMBLY_ACCESSIONS]
                        [-R REFSEQ_CATEGORIES] [--refseq-category REFSEQ_CATEGORIES] [-o OUTPUT] [--flat-output] [-H] [-P] [-u URI] [-p N] [-r N] [-m METADATA_TABLE] [-n] [-N] [-v] [-d] [-V]
                        [-M TYPE_MATERIALS]
                        groups

eg: ncbi-genome-download --formats "fasta,gff" -A GCF.txt --parallel 6 plant

ADD REPLY
0
Entering edit mode

Thanks, always something new to learn.

ADD REPLY

Login before adding your answer.

Traffic: 2155 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6