Hi all, I would like to know if there's a way to download the complete genbank file (description + genome sequence) for all the strains of one bacterial species in Genbank at once.
Thank you!
Hi all, I would like to know if there's a way to download the complete genbank file (description + genome sequence) for all the strains of one bacterial species in Genbank at once.
Thank you!
There is a way to do it "manually" - although wouldn't recommend if the species has alot of complete genomes.
For example, Escherichia coli O157:H7, which has the NCBI taxonomy ID: 83334
Click on the "Nucleotide" - "Subtree links" in the Entrez Records to view all nucleotide sequences assigned to this taxon (and all the taxon's children - thats what the subtree links means - direct links would just give you all the sequences of this taxon but not its children as well).
This will take you to the NCBI Nucleotide database, with all the Escherichia coli O157:H7 nucleotide sequences dispalyed:
https://www.ncbi.nlm.nih.gov/nuccore/?term=txid83334%5BOrganism%3Aexp%5D
You want only complete genomes, so add "Complete Genome" in the entry title into the search criteria:
A quick way to do all the above, is find the taxon ID for your species, go to NCBI nucleotide, and type in "txid83334[Organism:exp] AND (complete genome [Title])" into the search - replacing the 83334 with whatever taxon ID you need.
This will return (as of Nov 2016) 41 complete genomes for the taxon 83334, then on the top right click "Summary", and select "Genbank (full)", then on the top left click "Send", and "Complete Record", to "File", and select what format you want e.g. "Genbank (full)", or "XML".
This will give you all complete genomes of a taxon/species - not necessarily one per strain.
Note that all the Refseq genomes are derived from GenBank genomes, i.e. if you fetch those 41 genomes, you have basically downloaded each genome twice. Also, during submission you get to decide if an assembly is complete or not. IRL complete, chromosome, scaffold and contig status assemblies don't necessarily differ that much from each other. E.g. O157 assembly sizes are quite similar (contig counts are another thing though):
It won't be every genome twice, only the refseq genome twice, and you can easily filter out the RefSeqs based on their accession numbers - they have an underscore ( _ ) placed between the prefix and the digits: https://support.ncbi.nlm.nih.gov/link/portal/28045/28049/Article/502/What-are-Reference-Sequence-RefSeq-accession-numbers-and-what-information-is-embedded-in-their-format
See my answer to this post: where can I get environmental bacteria genome in fasta format (as many as possible)?
There is an old copy of NCBI, you can download all the gbk-files at once as a gz-file.
After opening it you can select gbk-files for strains you need.
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
To elaborate a bot on 5heikki's suggestion: The following commands for example should download all complete genome assemblies for E.coli (taxid: 562)
There is a pretty decent R interface for downloading these data in the ape package. Check out this tutorial