more elegant way to bulk download genomes from the NCBI
2
0
Entering edit mode
5.3 years ago
Carambakaracho ★ 3.3k

Hi eutils/edirect specialists,

I wonder whether NCBI's edirect utils or eutils API offer a more elegant, ideally one step solution to download genomes/assemblies based on a query, say all reference/representative genomes for Lactobacillus.

My current working solution is to use a pipe via esearch | esummary | xtract to build tab- delimited output containing the ftp path, the accession and the name (and the species) which I pass through a perl command to build curl commands.

esearch \
    -db assembly \
    -query "Lactobacillus[orgn] AND complete+genome[assembly+level] + latest[filter]" \
    | esummary \
    | xtract -pattern DocumentSummary \
        -element FtpPath_RefSeq \
        -element AssemblyAccession \
        -element AssemblyName \
        -element SpeciesName \
    | perl -nwe 'chomp; @a = split(/\t/,$_); $a[3] =~ s/ /_/g; $g = $a[1] . "_" . $a[2] . "_genomic.fna.gz"; print "curl -L -o $a[3]_$g $a[0]/$g\n";' \
    >lactobacillus_ftp.curl_commands.sh

My goal is to avoid the obscure perl code - it is just difficult to hand over to beginners. Do I miss, something, maybe via elink?

Thanks in advance

edirect Assembly NCBI • 4.0k views
ADD COMMENT
2
Entering edit mode
5.3 years ago
Joe 21k

in short, I dont think so, but for genomes, ncbi-genome-download can do those kinds of queries.

ADD COMMENT
0
Entering edit mode

bummer - I forgot about Kai Blin's tool... I remember reading through his repository ncbi-genome-download @ github

ADD REPLY
0
Entering edit mode

Yeah its a good tool, its my go-to recommendation for this kind of thing :)

ADD REPLY
1
Entering edit mode

Hi jrj.healey, I modified the title so you're comment would qualify as answer - if you care enough to move it :-)

ADD REPLY
2
Entering edit mode
5.2 years ago
vkkodali_ncbi ★ 3.8k

Do you need this to be a command line tool? You can do this query on the NCBI Assembly portal and use the 'Download Assemblies' button to download the data as shown in the screenshot below: enter image description here

Note: Your query should actually be Lactobacillus[Organism] AND complete_genome[filter] AND latest[filter]; see the warning with the yellow triangle above the results list notifying you of the issue.

Another file that may be of interest to you is the assembly_summary.txt file located in the NCBI Genomes FTP path: http://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS. The files located here contain information about all of the assemblies but if you are interested in only, say, bacteria and RefSeq data you can use the smaller file located further down in the FTP tree: http://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt. Once you have picked the correct assembly_summary.txt file for your needs, you can further narrow down the data you want using standard unix commands and use wget for downloads as shown below:

## first download the assembly_summary.txt file
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt

## parse assembly_summary file to make a list of urls
cat assembly_summary.txt \
  | awk 'BEGIN{FS="\t";OFS="\t"} \
    ($8~/Lactobacillus/ && $11=="latest" && $12=="Complete Genome") \
    {print $20}' \
  | sed -r 's/(GC[AF]_[0-9.]*_.*$)/\1\/\1_genomic.fna.gz/g' \
  > url_list.txt

## download data
wget -i url_list.txt
ADD COMMENT
0
Entering edit mode

thanks for looking into the issue, vkkodali. You're screenshot showed me some new NCBI functionality I didn't notice yet! And you're totally right, this seems a really elegant solution (you don't happen to know whether the archive contains a singel file or separate files). BTW, without quotes in the web form, both our queries yield the same result.

Your second solution is equally valid, others on my team prefer parsing the assembly_summary.txt over using the eutils, too.

ADD REPLY
1
Entering edit mode

you don't happen to know whether the archive contains a singel file or separate files

It has a single file, one for each genome, in the archive. Specifically, it has the following files:

ncbi-genomes-2019-09-01/GCF_000203855.3_ASM20385v3_genomic.fna.gz
ncbi-genomes-2019-09-01/GCF_000014525.1_ASM1452v1_genomic.fna.gz
ncbi-genomes-2019-09-01/GCF_000011985.1_ASM1198v1_genomic.fna.gz
ncbi-genomes-2019-09-01/GCF_000008925.1_ASM892v1_genomic.fna.gz
....
ncbi-genomes-2019-09-01/md5checksums.txt
ncbi-genomes-2019-09-01/README.txt
report.txt
ADD REPLY

Login before adding your answer.

Traffic: 1904 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6