Dowload all complete genomes grom GenBank .gbk full format
3
0
Entering edit mode
8.0 years ago
mmart12 ▴ 30

Hi all, I would like to know if there's a way to download the complete genbank file (description + genome sequence) for all the strains of one bacterial species in Genbank at once.

Thank you!

genome sequence • 4.4k views
ADD COMMENT
1
Entering edit mode
8.0 years ago
5heikki 11k

At once, no. Programmatically, yes. See the ftpfaq and pay special attention to "assembly summary" files.

ADD COMMENT
0
Entering edit mode

To elaborate a bot on 5heikki's suggestion: The following commands for example should download all complete genome assemblies for E.coli (taxid: 562)

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
awk -v taxid=562 -v status="Complete Genome" -F $'\t' '$7==taxid && $12==status {print $20 "/" $1 "_" $16 "_genomic.gbff.gz"}' assembly_summary.txt | xargs wget
ADD REPLY
0
Entering edit mode

There is a pretty decent R interface for downloading these data in the ape package. Check out this tutorial

ADD REPLY
0
Entering edit mode
8.0 years ago
Tonor ▴ 480

There is a way to do it "manually" - although wouldn't recommend if the species has alot of complete genomes.

For example, Escherichia coli O157:H7, which has the NCBI taxonomy ID: 83334

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=83334&lvl=3&lin=f&keep=1&srchmode=1&unlock

Click on the "Nucleotide" - "Subtree links" in the Entrez Records to view all nucleotide sequences assigned to this taxon (and all the taxon's children - thats what the subtree links means - direct links would just give you all the sequences of this taxon but not its children as well).

This will take you to the NCBI Nucleotide database, with all the Escherichia coli O157:H7 nucleotide sequences dispalyed:

https://www.ncbi.nlm.nih.gov/nuccore/?term=txid83334%5BOrganism%3Aexp%5D

You want only complete genomes, so add "Complete Genome" in the entry title into the search criteria:

https://www.ncbi.nlm.nih.gov/nuccore/?term=txid83334%5BOrganism%3Aexp%5D+AND+(complete+genome+%5BTitle%5D)

A quick way to do all the above, is find the taxon ID for your species, go to NCBI nucleotide, and type in "txid83334[Organism:exp] AND (complete genome [Title])" into the search - replacing the 83334 with whatever taxon ID you need.

This will return (as of Nov 2016) 41 complete genomes for the taxon 83334, then on the top right click "Summary", and select "Genbank (full)", then on the top left click "Send", and "Complete Record", to "File", and select what format you want e.g. "Genbank (full)", or "XML".

This will give you all complete genomes of a taxon/species - not necessarily one per strain.

ADD COMMENT
0
Entering edit mode

Note that all the Refseq genomes are derived from GenBank genomes, i.e. if you fetch those 41 genomes, you have basically downloaded each genome twice. Also, during submission you get to decide if an assembly is complete or not. IRL complete, chromosome, scaffold and contig status assemblies don't necessarily differ that much from each other. E.g. O157 assembly sizes are quite similar (contig counts are another thing though):

enter image description here

ADD REPLY
0
Entering edit mode

It won't be every genome twice, only the refseq genome twice, and you can easily filter out the RefSeqs based on their accession numbers - they have an underscore ( _ ) placed between the prefix and the digits: https://support.ncbi.nlm.nih.gov/link/portal/28045/28049/Article/502/What-are-Reference-Sequence-RefSeq-accession-numbers-and-what-information-is-embedded-in-their-format

ADD REPLY
0
Entering edit mode
8.0 years ago
natasha.sernova ★ 4.0k

See my answer to this post: where can I get environmental bacteria genome in fasta format (as many as possible)?

There is an old copy of NCBI, you can download all the gbk-files at once as a gz-file.

After opening it you can select gbk-files for strains you need.

ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/

ADD COMMENT

Login before adding your answer.

Traffic: 1869 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6