ncbi-genome-download - bacterial and viral genomes ?
1
4
Entering edit mode
2.5 years ago
sunnykevin97 ▴ 990

Hi,

I find this tool https://github.com/kblin/ncbi-genome-download in github for downloading the bacteria genomes.

ncbi-genome-download bacteria

it downloades ---> refseq/bacteria/GCF_940077525.1

from GCF_ * how do I get fasta files ?

Then I tried this, cmd, it unables to download the fasta files.

ncbi-genome-download --formats fasta bacteria --parallel 16

WARNING: Skipping entry, as it has no ftp directory listed: 'GCF_023646435.1'

Inaddition, I tried to download sequences directly from ncbi ftp site.

https://ftp.ncbi.nlm.nih.gov/refseq/release/README

wget ftp://ftp.ncbi.nih.gov/refseq/release/bacteria/ 
generates ----> index.html file

The html file contains all the ftp links related to the bacterial genomes.

How do I download the sequences using the index.html file ? Is their any easy way to download the bacterial, viral genomes ?

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head>
<title>Index of /refseq/release/bacteria on ftp.ncbi.nih.gov:21</title>
</head>
<body>
<h1>Index of /refseq/release/bacteria on ftp.ncbi.nih.gov:21</h1>
<hr>
<pre>
  2022 May 05 22:46  File        <a href="ftp://ftp.ncbi.nih.gov:21/refseq/release/bacteria/bacteria.1.1.genomic.fna.gz">bacteria.1.1.genomic.fna.gz</a>  (112297068 bytes)
  2022 May 05 22:46  File        <a href="ftp://ftp.ncbi.nih.gov:21/refseq/release/bacteria/bacteria.1.genomic.gbff.gz">bacteria.1.genomic.gbff.gz</a>  (90133141 bytes)
  2022 May 05 22:46  File        <a href="ftp://ftp.ncbi.nih.gov:21/refseq/release/bacteria/bacteria.10.1.genomic.fna.gz">bacteria.10.1.genomic.fna.gz</a>  (2647137 bytes)
  2022 May 05 22:46  File        <a href="ftp://ftp.ncbi.nih.gov:21/refseq/release/bacteria/bacteria.10.genomic.gbff.gz">bacteria.10.genomic.gbff.gz</a>  (2242108 bytes)
  2022 May 05 22:45  File        <a href="ftp://ftp.ncbi.nih.gov:21/refseq/release/bacteria/bacteria.100.1.genomic.fna.gz">bacteria.100.1.genomic.fna.gz</a>  (94215320 bytes)
  2022 May 05 22:46  File        <a href="ftp://ftp.ncbi.nih.gov:21/refseq/release/bacteria/bacteria.100.genomic.gbff.gz">bacteria.100.genomic.gbff.gz</a>  (82718913 bytes)
  2022 May 05 22:46  File        <a href="ftp://ftp.ncbi.nih.gov:21/refseq/release/bacteria/bacteria.1000.1.genomic.fna.gz">bacteria.1000.1.genomic.fna.gz</a>  (120745492 bytes)
  2022 May 05 22:46  File        <a href="ftp://ftp.ncbi.nih.gov:21/refseq/release/bacteria/bacteria.1000.genomic.gbff.gz">bacteria.1000.genomic.gbff.gz</a>  (104026228 bytes)
  2022 May 05 22:46  File        <a href="ftp://ftp.ncbi.nih.gov:21/refseq/release/bacteria/bacteria.1001.1.genomic.fna.gz">bacteria.1001.1.genomic.fna.gz</a>  (120374412 bytes)
  2022 May 05 22:46  File        <a href="ftp://ftp.ncbi.nih.gov:21/refseq/release/bacteria/bacteria.1001.genomic.gbff.gz">bacteria.1001.genomic.gbff.gz</a>  (102935563 bytes)
  2022 May 05 22:46  File        <a href="ftp://ftp.ncbi.nih.gov:21/refseq/release/bacteria/bacteria.1002.1.genomic.fna.gz">bacteria.1002.1.genomic.fna.gz</a>  (116037436 bytes)
  2022 May 05 22:46  File        <a href="ftp://ftp.ncbi.nih.gov:21/refseq/release/bacteria/bacteria.1002.genomic.gbff.gz">bacteria.1002.genomic.gbff.gz</a>  (100084701 bytes)
  2022 May 05 22:45  File        <a href="ftp://ftp.ncbi.nih.gov:21/refseq/release/bacteria/bacteria.1003.1.genomic.fna.gz">bacteria.1003.
protein gene genome • 2.0k views
ADD COMMENT
7
Entering edit mode
2.5 years ago
Mensur Dlakic ★ 28k

There is always a way directly from the FTP site:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/940/077/525/GCF_940077525.1_ASM94007752v1/

You likely need the file that ends in genomic.fna.gz.

You may want to try this program:

https://github.com/pirovc/genome_updater

Beware that going with 16 parallel downloads will likely cause NCBI to throttle down your IP number as you may be considered to abuse their resources. Sometimes going with 4-8 parallel downloads will get the job done faster than with 16.

ADD COMMENT
2
Entering edit mode

The ftp file you pointing out to a single bacteria genome is it right ?

Can't we get access to a ftp file in .gz with all bacteria and viral genomes.

I'm unable to download such data directly to my server. after few minutes, the programs stops. With out showing error message.

./genome_updater.sh -d "refseq" -g "archaea,bacteria,fungi,viral" -f "genomic.fna.gz" -o "arc_bac_fun_vir_refseq_cg" -t 8 -m 

-------------------------------------------
Mode: UPDATE 
Args: -d 'refseq' -f 'genomic.fna.gz' -g 'archaea,bacteria,fungi,viral' -o 'arc_bac_fun_vir_refseq_cg' -t '8' -m
Outp: /data/Mcology/sun/softwares/busc/busco/bin/genomeupdater/arc_bac_fun_vir_refseq_cg/
-------------------------------------
Checking for missing files in the current version [2022-06-08_20-55-26]
 - 262275 missing files
Downloading 262275 files with 8 threads
ADD REPLY
4
Entering edit mode

Of course it is possible - that is what the tool is meant for. I just did an incremental update of my archaeal database.

genome_updater.sh -d "refseq" -g "archaea" -c "all" -l "Complete Genome" -f "genomic.fna.gz" -o genomes/archaea/sketch/reference -t 10 -u -m -a

----------------------------------------
      genome_updater version: 0.2.2
----------------------------------------
Mode: UPDATE - DOWNLOAD
Working directory: /home/xxx/
----------------------------------------
Checking for missing files in the current version [2021-10-11_13-52-39]
 - None

Checking for extra files [2021-10-11_13-52-39]
 - None

Downloading assembly summary [2022-06-08_14-12-20]
 - 812/1245 entries removed [RefSeq category: all, Assembly level: Complete Genome, Version status: latest]
 - 433 entries available

Linking versions [2021-10-11_13-52-39 --> 2022-06-08_14-12-20]
 - Done.

Updating [2021-10-11_13-52-39 --> 2022-06-08_14-12-20]
 - 4 updated, 1 deleted, 42 new entries
 - UPDATE: Deleting 4 files
 - UPDATE: Downloading 4 files with 10 threads
 - 4/4 files successfully downloaded
 - DELETE: Deleting 1 files
 - NEW: Downloading 42 files with 10 threads
 - 42/42 files successfully downloaded
 - Assembly accession report written [/home/xxx/xxx/xxx/updated_assembly_accession.txt]

Setting new version [2022-06-08_14-12-20]
 - Done.

Downloading current Taxonomy database [/home/xxx/xxx/xxx/taxdump.tar.gz]
 - Done

# 433/433 files successfully obtained
# Log file: /home/xxxr/xxx.log
# Finished! Current version: /home/xxx/xxx/xxx
ADD REPLY
0
Entering edit mode

genome_updater.sh works well @ Mensur, good post.

ADD REPLY
3
Entering edit mode

Can't we get access to a ftp file in .gz with all bacteria and viral genomes.

No you can't.

This is where datasets tool from NCBI can come in handy. Here an example of all viral genomes: https://www.ncbi.nlm.nih.gov/datasets/genomes/?taxon=10239 <-- This example is only for view. You will need to use command line datasets tools to do the actual download since web tool is limited to 1000 genomes.

ADD REPLY

Login before adding your answer.

Traffic: 2610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6