Hi,
I want to download all available genomes of multiple bacterial and archeal genera.
Downloading the genbanks for a single species is relatively easy (if you already know the exact folder_name on the ftp-server.
e.g:
wget -A .gbk ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C5_uid58741/
to get all genbank-files associated with methanococcus maripaludis C5.
However, what is driving me crazy is trying to go though the genomes-subfolders recursively using wildcards. For example if I want to get ALL genbanks of ALL methanococcus species (or if I did not want to find out the exact folder names by hand) something like:
wget -A .gbk ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus*
This always gives me error messages. but I KNOW its possible in principle. I found these Instructions on Github for exactly the task I want, but they do not seem to work (perhaps the wget syntax has changed?)
wget -cNrv -t 45 -A *.gbk,*.fna "ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Genus_species*" -P .
Can anybody please tell me what I'm doing wrong?
Could you maybe try this tiny mod:
Sadly that did not work either. I get an "404: Not Found"-Error, as with all my attempts to use wildcards in the directory names.
Does this command work for you (e.g. using "Methanococcus" and genus name)? Could it be that it is simply a problem with my network/proxy settings?
This exact command works for me, downloading multiple Methanococcus directories' FNA and GBK files:
I
Ctrl-C
'd a few seconds into it. Als -R
reveals:The prefix part is kinda ignored though, I think. I see the command creating a directory named "ftp.ncbi.nih.gov" and then subdirectories as on the server,
Using that exact command I get this error:
So it seems there is an issue with getting the downloads thorough the proxy server.
But that seems strange to me, because the download works perfectly if I use no wildcards (as in my first example).
I'll try downloading from my home-computer later and then transfering the files to my workstation per usb-stick.
EDIT:
Also I get the Warning:
Try using
--accept-regex
instead of-A
. (Just a random suggestion)Also, what is your OS and your
$SHELL
?I working on servers using Red Hat Linux. The default shell (and the one I'm using) there is Bash.
Switching to
--accept-regex
did not help, by the way.EDIT: Same problem persists on a workstation with ubuntu as OS and zsh as default shell (connected via the same proxy as the servers).
But still hoping that it'll work when I try from home later...
Well, I guess the sysadmin might have some kind of restriction on number of files you can download with a wildcard. To test that, could you maybe try with the pattern
Methanococcus_aeolicus_Nankai*
instead ofMethanococcus*
Thanks, I'll try that when I get back to the office. But for now I can say, that from my home computer, the downloads with wget work perfectly. So its definitively the proxy settings at my institute.