Question

more elegant way to bulk download genomes from the NCBI

0

Entering edit mode

5.3 years ago

Carambakaracho ★ 3.3k

Hi eutils/edirect specialists,

I wonder whether NCBI's edirect utils or eutils API offer a more elegant, ideally one step solution to download genomes/assemblies based on a query, say all reference/representative genomes for Lactobacillus.

My current working solution is to use a pipe via esearch | esummary | xtract to build tab- delimited output containing the ftp path, the accession and the name (and the species) which I pass through a perl command to build curl commands.

esearch \
    -db assembly \
    -query "Lactobacillus[orgn] AND complete+genome[assembly+level] + latest[filter]" \
    | esummary \
    | xtract -pattern DocumentSummary \
        -element FtpPath_RefSeq \
        -element AssemblyAccession \
        -element AssemblyName \
        -element SpeciesName \
    | perl -nwe 'chomp; @a = split(/\t/,$_); $a[3] =~ s/ /_/g; $g = $a[1] . "_" . $a[2] . "_genomic.fna.gz"; print "curl -L -o $a[3]_$g $a[0]/$g\n";' \
    >lactobacillus_ftp.curl_commands.sh

My goal is to avoid the obscure perl code - it is just difficult to hand over to beginners. Do I miss, something, maybe via elink?

Thanks in advance

edirect Assembly NCBI • 4.0k views

ADD COMMENT • link 5.3 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Download of genome data has been covered in past biostars threads (for future reference)
how to download all the complete genomes for mycobacteria from NCBI?
How to download COMPLETE bacterial genomes from NCBI based on list of names?
download refseq of thousand of assembly file from NCBI
Retrieve genome in fasta format from ncbi

ADD REPLY • link 5.3 years ago by GenoMax 147k

score 2 · Accepted Answer · 2019-08-29

2

Entering edit mode

5.3 years ago

Joe 21k

in short, I dont think so, but for genomes, ncbi-genome-download can do those kinds of queries.

ADD COMMENT • link 5.3 years ago by Joe 21k

0

Entering edit mode

bummer - I forgot about Kai Blin's tool... I remember reading through his repository ncbi-genome-download @ github

ADD REPLY • link 5.3 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Yeah its a good tool, its my go-to recommendation for this kind of thing :)

ADD REPLY • link 5.3 years ago by Joe 21k

1

Entering edit mode

Hi jrj.healey, I modified the title so you're comment would qualify as answer - if you care enough to move it :-)

ADD REPLY • link 5.3 years ago by Carambakaracho ★ 3.3k

score 2 · Accepted Answer · 2019-08-30

Do you need this to be a command line tool? You can do this query on the NCBI Assembly portal and use the 'Download Assemblies' button to download the data as shown in the screenshot below: enter image description here

Note: Your query should actually be Lactobacillus[Organism] AND complete_genome[filter] AND latest[filter]; see the warning with the yellow triangle above the results list notifying you of the issue.

Another file that may be of interest to you is the assembly_summary.txt file located in the NCBI Genomes FTP path: http://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS. The files located here contain information about all of the assemblies but if you are interested in only, say, bacteria and RefSeq data you can use the smaller file located further down in the FTP tree: http://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt. Once you have picked the correct assembly_summary.txt file for your needs, you can further narrow down the data you want using standard unix commands and use wget for downloads as shown below:

## first download the assembly_summary.txt file
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt

## parse assembly_summary file to make a list of urls
cat assembly_summary.txt \
  | awk 'BEGIN{FS="\t";OFS="\t"} \
    ($8~/Lactobacillus/ && $11=="latest" && $12=="Complete Genome") \
    {print $20}' \
  | sed -r 's/(GC[AF]_[0-9.]*_.*$)/\1\/\1_genomic.fna.gz/g' \
  > url_list.txt

## download data
wget -i url_list.txt