Question

How to download COMPLETE bacterial genomes from NCBI based on list of names?

1

Entering edit mode

7.8 years ago

taojincs ▴ 50

I have a long list of complete bacterial organism names (more than 100000, thus impossible to search and download it line by line). Format is one name on each line. I need to download GCA (It must be GCA instead of GCF) fasta files of the corresponding genomes from https://www.ncbi.nlm.nih.gov/genome/browse/ (Specify Levels as Complete).

I have to achieve this through command lines. How to do it efficiently? Thank you.

search ncbi • 4.7k views

ADD COMMENT • link updated 7.5 years ago by arsilan324 ▴ 90 • written 7.8 years ago by taojincs ▴ 50

0

Entering edit mode

This code didn't generate anything for me. Also it didn't give me any error. Did you manage to solve the issue?

ADD REPLY • link 7.5 years ago by arsilan324 ▴ 90

0

Entering edit mode

My answer: C: How to retrieve single protein fasta file for multiple species?

ADD REPLY • link 7.5 years ago by GenoMax 153k

score 5 · Accepted Answer · 2017-10-27

5

Entering edit mode

7.8 years ago

5heikki 11k

cat species.txt
Porphyromonas levii
Porphyromonas somerae

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

IFS=$'\n'; for next in $(cat species.txt); do awk -v SPECIES=^"$next" 'BEGIN{FS="\t"}{if($8 ~ SPECIES && $12=="Complete Genome"){print $20}}' assembly_summary.txt \
    | awk 'BEGIN{OFS=FS="/"}{print "wget "$0,$NF"_genomic.fna.gz"}'; done \
    | sh

NOTE: Only 8,413 Bacterial genomes have "Complete Genome" assembly level status (not even 10% of your list of names). For example, nothing will be downloaded in the example shown above. Do you really need to limit yourself to such a small subset?

  1577 Chromosome
  8413 Complete Genome
  52594 Contig
  54565 Scaffold

ADD COMMENT • link 7.8 years ago by 5heikki 11k

0

Entering edit mode

This didn't download the fasta file in my directory. Nothing happened. Could you please double check it?

ADD REPLY • link 7.8 years ago by taojincs ▴ 50

0

Entering edit mode

You need your list of species in the same directory where you run it. In my example the list is called species.txt. Modify accordingly.

ADD REPLY • link 7.8 years ago by 5heikki 11k

0

Entering edit mode

Hi if I want to download proteomes for organisms that have been completely sequenced, should I only change "wget "$0,$NF"_genomic.fna.gz" to "wget "$0,$NF"_protein.faa.gz"? It seems what I downloaded is much larger than it should be. For example, I downloaded 11.9 GB of sequence data for 50 given organisms while someone else who worked on the same list downloaded 68MB. Thanks.

ADD REPLY • link 7.6 years ago by taojincs ▴ 50