I have a long list of complete bacterial organism names (more than 100000, thus impossible to search and download it line by line). Format is one name on each line. I need to download GCA (It must be GCA instead of GCF) fasta files of the corresponding genomes from https://www.ncbi.nlm.nih.gov/genome/browse/ (Specify Levels as Complete).
I have to achieve this through command lines. How to do it efficiently? Thank you.
cat species.txt
Porphyromonas levii
Porphyromonas somerae
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
IFS=$'\n';for next in$(cat species.txt);doawk -v SPECIES=^"$next"'BEGIN{FS="\t"}{if($8 ~ SPECIES && $12=="Complete Genome"){print $20}}' assembly_summary.txt \
|awk'BEGIN{OFS=FS="/"}{print "wget "$0,$NF"_genomic.fna.gz"}';done \
| sh
NOTE: Only 8,413 Bacterial genomes have "Complete Genome" assembly level status (not even 10% of your list of names). For example, nothing will be downloaded in the example shown above. Do you really need to limit yourself to such a small subset?
Hi if I want to download proteomes for organisms that have been completely sequenced, should I only change "wget "$0,$NF"_genomic.fna.gz" to "wget "$0,$NF"_protein.faa.gz"? It seems what I downloaded is much larger than it should be. For example, I downloaded 11.9 GB of sequence data for 50 given organisms while someone else who worked on the same list downloaded 68MB. Thanks.
This code didn't generate anything for me. Also it didn't give me any error. Did you manage to solve the issue?
My answer: C: How to retrieve single protein fasta file for multiple species?