I had to write to NCBI about this.
Here is my recipe, adapted from Case 1 in this document: ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf
-- at Bash/Mac OSX prompt in the desired directory:
curl 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt' | \ awk '{FS="\t”} \!/^#/ {print $20} ' | \ sed ‐r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)(GCA_.+)|\1\2/\2_genomic.fna.gz|' >genomic_file
-- final command, in the same directory, where you want to install the files:
wget -i genomic_file
Genbank is where all current complete and incomplete sequences are being stored and updated since Dec 12, 2015. Note that if you want a different taxonomic branch, you have to look at the NCBI ftp site (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank) and replace "bacteria" in the ftp address above with the folder you'd like. To get only Refseq genomes (complete and canonically curated sequences), replace "genbank" with "refseq". (This is also described in the factsheet.) "NOTE: if you need the assembly submiƩed to GenBank, you will need to change the curl command’s “refseq” to “genbank”. Since these assembly’s accession iniƟal are different, you will need change sed command’s “GCF” to “GCA”
If copying and pasting the lines above, or from the factsheet, gives you an error, I would paste them into a text editor with syntax highlighting (Emacs, textedit, etc.) and re-type any weirdly colored characters, and quotes on principle (right-slanted double quotes are interpreted differently than left-slanted, etc.).
Certain Bash prompts may require an escape character ("\") for special characters used within AWK commands. If you still have errors, as a second round of treatment, try removing the escape \ in front of the !.
Finally, the folder content is not a final database for standalone BLAST. (I used blastn, part of the Entrez Direct suite of tools provided by NCBI). You will have to use makeblastdb (included if you downloaded the suite to get blastn) to alter the format of and index the files for use by blastn. Furthermore (I had to write to NCBI about this too), the number of files in such a complete taxonomic database is too much for makeblastdb to handle. However, if you cat them into one file, it's fine!
cat *.fna > all_bacteria_fna_files.fna
makeblastdb -in all_bacteria_fna_files.fna -parse_seqids -dbtype nucl -title bacteria -out bacteria
Then, you have to make sure blastn has the folder containing the new database designated as a database variable.
export $BLASTDB=":$HOME/genomes/bacteria/genbank_2_3_2016"
Then you can run blastn on your new bacterial database. (Or, as you can see, this should work with any complete taxonomic download.) Good luck!!
Kim
Here are some of my extremely messy notes on the process. Feel free to ignore.
- bacteria
- Use awk/sed/curl recipe from NCBI (ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf to get files by parsing the local genome/...assembly_summary.txt file for directories for species of interest
- get subdirectory “bacteria” from genbank (content of this directory: NCBI ftp genomes/genbank README, ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/README.txt: "2) genbank: content includes primary submissions of assembled genome sequence and associated annotation data, if any, as exchanged among members of the International Nucleotide Sequence Database Collaboration, of which NCBI's GenBank database is a member. The GenBank directory area includes genome sequence data for a larger number of organisms than the RefSeq directory area; however, some assemblies are unannotated. The sub-directory structure includes: a. archaea b. bacteria c. fungi d. invertebrate e. other - this directory includes synthetic genomes f. plant g. protozoa h. vertebrate_mammalian i. vertebrate_other”)
- http://www.linuxtopia.org/online_books/linux_tool_guides/the_sed_faq/sedfaq5_010.html
- 5.11. My script aborts with an error message, "event not found".
This error is generated by the csh or tcsh shells, not by sed. The exclamation mark (!) is special to csh/tcsh, and if you use it in command-line or shell scripts--even within single quotes--it must be preceded by a backslash. Thus, under the csh/tcsh shell:
sed '/regex/!d' # will fail
sed '/regex/!d' # will succeed
The exclamation mark should not be prefixed with a backslash when the script is called from a file, as "-f script.file".
- put into emacs and re-typed anything that it colored as being a… strange character (some underscores were replacing spaces), as well as all single and double quotes
- final command: wget -i genomic_file
- FINISHED --2016-02-10 01:56:15--
- Downloaded: 58953 files, 62G in 18h 8m 34s (1002 KB/s)
After scanning the site, it appears to contain information about few bacteria and only a handful of metagenome data sets. Am I missing something?
Eric, try for example this query to get strain names and scaffold id: mysql -h pub.microbesonline.org -u guest -pguest genomics -B -e ' source scaf.sql' > scaf.out "scaf.sql": SELECT Taxonomy.name, Scaffold.scaffoldId FROM ScaffoldSeq INNER JOIN Scaffold ON Scaffold.scaffoldId=ScaffoldSeq.scaffoldId INNER JOIN Taxonomy ON Taxonomy.taxonomyId=Scaffold.taxonomyId; To get scaffold sequence add ScaffoldSeq.sequence in first line Try to explore this page http://meta.microbesonline.org/programmers.html#Taxonomy
All I get in scaf.out is the mysql help, so it looks like there is a mistake somewhere. At this point, I am not sure that this ressource will help me.