Question

Download many NCBI genomes with list of GCA identifiers

0

Entering edit mode

2.2 years ago

0fbcb0a9 • 0

Hi, I have a list of NCBI GCA/GCF identifiers for many (hundreds of) vertebrate whole genomes that I would like to download. I initially tried Entrez.efetch, but have realized GCA numbers cannot be used with it, and this also may not be an efficient method for so many large genomes. I'm now looking at using ftplib or the NCBI datasets tool, but I'm new to bioinformatics and I am having trouble understanding the best approach with these. Has anyone done this, and if so, do you have example code you are willing to share? I'm hoping to do this with python/command line. Any help would be greatly appreciated, thanks!

genomes biopython ncbi api python • 3.4k views

ADD COMMENT • link updated 2.2 years ago by Mensur Dlakic ★ 29k • written 2.2 years ago by 0fbcb0a9 • 0

2

Entering edit mode

There are how to guides available about how to use NCBI datasets command line tool: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/ with one specifically for large genomes : https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/genomes/large-download/

ADD REPLY • link 2.2 years ago by GenoMax 151k

0

Entering edit mode

Please upvote the original post:

Getting a curl: (22) The requested URL returned error: 500 ERROR

ADD REPLY • link 2.2 years ago by Mensur Dlakic ★ 29k

GenoMax · Answer 1 · 2023-04-18

More information/download for datasets command line tool: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/

datasets download

Download genome, gene and virus data packages, including sequence, annotation, and metadata, as a zip file.

Refer to NCBI's [download and install](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) documentation for information about getting started with the command-line tools.

Usage
  datasets download [command]

Sample Commands
  datasets download genome accession GCF_000001405.40 --chromosomes X,Y --exclude-gff3 --exclude-rna
  datasets download genome taxon "bos taurus"
  datasets download gene gene-id 672
  datasets download gene symbol brca1 --taxon mouse
  datasets download gene accession NP_000483.3
  datasets download virus genome taxon sars-cov-2 --host dog
  datasets download virus protein S --host dog --filename SARS2-spike-dog.zip

Available Commands
  gene        Download a gene data package
  genome      Download a genome data package
  virus       Download a virus data package

Flags
      --filename string   Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")
      --no-progressbar    Hide progress bar


Global Flags
      --api-key string   Specify an NCBI API key
      --debug            Emit debugging info
      --help             Print detailed help about a datasets command
      --version          Print version of datasets

Use datasets download <command> --help for detailed help about a command.

score 0 · Answer 2 · 2023-04-18

0

Entering edit mode

2.2 years ago

size_t ▴ 120

try this tool： ncbi-genome-download

ADD COMMENT • link 2.2 years ago by size_t ▴ 120

0

Entering edit mode

To the best of my knowledge it doesn't work with GCA/GCF identifiers. Deals only with taxonomic names or IDs.

ADD REPLY • link 2.2 years ago by Mensur Dlakic ★ 29k

1

Entering edit mode

Actually it's possible, see the parameter -A

usage: ncbi-genome-download [-h] [-s {refseq,genbank}] [-F FILE_FORMATS] [-l ASSEMBLY_LEVELS] [-g GENERA] [--genus GENERA] [--fuzzy-genus] [-S STRAINS] [-T SPECIES_TAXIDS] [-t TAXIDS] [-A ASSEMBLY_ACCESSIONS]
                        [-R REFSEQ_CATEGORIES] [--refseq-category REFSEQ_CATEGORIES] [-o OUTPUT] [--flat-output] [-H] [-P] [-u URI] [-p N] [-r N] [-m METADATA_TABLE] [-n] [-N] [-v] [-d] [-V]
                        [-M TYPE_MATERIALS]
                        groups

eg： ncbi-genome-download --formats "fasta,gff" -A GCF.txt --parallel 6 plant