I know that this question is already 2 years old, but I hope that my answer might be useful to others anyway.
I implemented a standardized way to automate the genome retrieval process in R (see biomartr package).
To retrieve all bacterial reference genomes and corresponding CDS, proteome, and gff files from several database sources one can simply type:
# download all bacterial reference genomes from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "genome")
# download all bacterial reference coding sequences from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "CDS")
# download all bacterial reference proteomes from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "proteome")
# download all bacterial reference gff files from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "gff")
or
# download all bacterial reference genomes from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "genome")
# download all bacterial reference coding sequences from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "CDS")
# download all bacterial reference proteomes from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "proteome")
# download all bacterial reference gff files from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "gff")
For more details about downloading specific genomes from specific kingdoms or subkingdoms of life please consult the Meta-Genome Retrieval vignette.
Please note that to promote computational reproducibility in genomics and metagenomics studies, biomartr stores log files for each downloaded genome, proteome, or CDS file.
An example log file looks as follows:
File Name: Escherichia_coli_genomic_refseq.fna.gz
Organism Name: Escherichia_coli
Database: NCBI refseq
URL:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
Download_Date: Wed Feb 15 15:17:50 2017
refseq_category: reference genome
assembly_accession: GCF_000005845.2
bioproject: PRJNA57779
biosample: SAMN02604091
taxid: 511145
infraspecific_name: strain=K-12 substr. MG1655
version_status: latest
release_type: Major
genome_rep: Full
seq_rel_date: 2013-09-26
submitter: Univ. Wisconsin
I hope this helps.
Thank you , i downloaded the proteins, but on unzipping them, i realised they are not fasta, how can i use them to create a blastable database, i was thinking they would be in fasta format?
The genomes are in FASTA format. Please see the BLAST manual to learn how to create a database.