Question

where can i download ncbi and swissuniprot ftp protein, gene and genome sequences for bacterial genomes?

0

Entering edit mode

10.4 years ago

samuelksm • 0

Am trying to create a local database of bacterial protein, gene and genome sequences, these will be separate but i cannot find the bacterial ftp file for the protein sequences, gene sequences, and genome sequences.

does any one know the actual link to the download?

blast • 5.0k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by samuelksm • 0

1

Entering edit mode

8.5 years ago

Hajk-Georg Drost ▴ 180

I know that this question is already 2 years old, but I hope that my answer might be useful to others anyway.

I implemented a standardized way to automate the genome retrieval process in R (see biomartr package).

To retrieve all bacterial reference genomes and corresponding CDS, proteome, and gff files from several database sources one can simply type:

# download all bacterial reference genomes from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "genome")

# download all bacterial reference coding sequences from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "CDS")

# download all bacterial reference proteomes from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "proteome")

# download all bacterial reference gff files from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "gff")

or

# download all bacterial reference genomes from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "genome")

# download all bacterial reference coding sequences from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "CDS")

# download all bacterial reference proteomes from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "proteome")

# download all bacterial reference gff files from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "gff")

For more details about downloading specific genomes from specific kingdoms or subkingdoms of life please consult the Meta-Genome Retrieval vignette.

Please note that to promote computational reproducibility in genomics and metagenomics studies, biomartr stores log files for each downloaded genome, proteome, or CDS file.

An example log file looks as follows:

File Name: Escherichia_coli_genomic_refseq.fna.gz

Organism Name: Escherichia_coli

Database: NCBI refseq

URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz

Download_Date: Wed Feb 15 15:17:50 2017

refseq_category: reference genome

assembly_accession: GCF_000005845.2

bioproject: PRJNA57779

biosample: SAMN02604091

taxid: 511145

infraspecific_name: strain=K-12 substr. MG1655

version_status: latest

release_type: Major

genome_rep: Full

seq_rel_date: 2013-09-26

submitter: Univ. Wisconsin

I hope this helps.

ADD COMMENT • link 8.5 years ago by Hajk-Georg Drost ▴ 180

Ram · Accepted Answer · 2015-04-10

3

Entering edit mode

10.4 years ago

Kamil ★ 2.3k

Check out the NCBI ftp site here: ftp://ftp.ncbi.nlm.nih.gov/

You can browse around for your specific files of interest.

Beware that there are a lot of bacterial genomes in "genomes/Bacteria" so the page will take a long time to load. You can see a summary of the genomes here:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/summary.txt

A 2.7G FASTA file with all genomes: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.fna.tar.gz

Extract the FASTA files from the archive:

tar xf all.fna.tar.gz
cd Wolbachia_wRi_uid59371/

head -n3 NC_012416.fna
>gi|225629872|ref|NC_012416.1| Wolbachia sp. wRi, complete genome
TGATCAATTTTAATGTTTTTATACCCTTTACAACCCATCAAAAAATCACCATAATTTTTAGTATGTATTA
AGTAGTATTAGCTTTTCATTTTGCAGTAAGCTATTGATTATCTTATATTTTTCTAATTATTGCTTTTTTC

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Kamil ★ 2.3k

0

Entering edit mode

Thank you , i downloaded the proteins, but on unzipping them, i realised they are not fasta, how can i use them to create a blastable database, i was thinking they would be in fasta format?

ADD REPLY • link 10.4 years ago by samuelksm • 0

0

Entering edit mode

The genomes are in FASTA format. Please see the BLAST manual to learn how to create a database.

ADD REPLY • link 10.4 years ago by Kamil ★ 2.3k