Downloading multiple species from ftp.ncbi.nih.gov using wget and wildcards
3
3
Entering edit mode
10.1 years ago
JV ▴ 470

Hi,

I want to download all available genomes of multiple bacterial and archeal genera.

Downloading the genbanks for a single species is relatively easy (if you already know the exact folder_name on the ftp-server.

e.g:

wget -A .gbk ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C5_uid58741/

to get all genbank-files associated with methanococcus maripaludis C5.

However, what is driving me crazy is trying to go though the genomes-subfolders recursively using wildcards. For example if I want to get ALL genbanks of ALL methanococcus species (or if I did not want to find out the exact folder names by hand) something like:

wget -A .gbk ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus*

This always gives me error messages. but I KNOW its possible in principle. I found these Instructions on Github for exactly the task I want, but they do not seem to work (perhaps the wget syntax has changed?)

wget -cNrv -t 45 -A *.gbk,*.fna "ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Genus_species*" -P .

Can anybody please tell me what I'm doing wrong?

ftp wget sequence NCBI genbank • 22k views
ADD COMMENT
1
Entering edit mode

Could you maybe try this tiny mod:

wget -cNrv -t 45 -A "*.gbk,*.fna" "ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Genus_species*" -P .
ADD REPLY
1
Entering edit mode

Sadly that did not work either. I get an "404: Not Found"-Error, as with all my attempts to use wildcards in the directory names.

Does this command work for you (e.g. using "Methanococcus" and genus name)? Could it be that it is simply a problem with my network/proxy settings?

ADD REPLY
1
Entering edit mode

This exact command works for me, downloading multiple Methanococcus directories' FNA and GBK files:

wget -cNrv -t 45 -A "*.gbk,*.fna" "ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus*" -P .

I Ctrl-C'd a few seconds into it. A ls -R reveals:

$ ls -R
ftp.ncbi.nih.gov

./ftp.ncbi.nih.gov:
genomes

./ftp.ncbi.nih.gov/genomes:
Bacteria

./ftp.ncbi.nih.gov/genomes/Bacteria:
Methanococcus_aeolicus_Nankai_3_uid58823 Methanococcus_maripaludis_C5_uid58741    Methanococcus_maripaludis_C6_uid58947    Methanococcus_maripaludis_C7_uid58847

./ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus_aeolicus_Nankai_3_uid58823:
NC_009635.fna NC_009635.gbk

./ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C5_uid58741:
NC_009135.fna NC_009135.gbk NC_009136.fna NC_009136.gbk

./ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C6_uid58947:
NC_009975.fna NC_009975.gbk

./ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C7_uid58847:
NC_009637.fna

The prefix part is kinda ignored though, I think. I see the command creating a directory named "ftp.ncbi.nih.gov" and then subdirectories as on the server,

ADD REPLY
1
Entering edit mode

Using that exact command I get this error:

Proxy request sent, awaiting response... 404 Not Found
2014-10-24 17:07:02 ERROR 404: Not Found.

So it seems there is an issue with getting the downloads thorough the proxy server.

But that seems strange to me, because the download works perfectly if I use no wildcards (as in my first example).

I'll try downloading from my home-computer later and then transfering the files to my workstation per usb-stick.

EDIT:

Also I get the Warning:

Warning: wildcards not supported in HTTP
ADD REPLY
1
Entering edit mode

Try using --accept-regex instead of -A. (Just a random suggestion)

Also, what is your OS and your $SHELL?

ADD REPLY
1
Entering edit mode

I working on servers using Red Hat Linux. The default shell (and the one I'm using) there is Bash.

Switching to --accept-regex did not help, by the way.

EDIT: Same problem persists on a workstation with ubuntu as OS and zsh as default shell (connected via the same proxy as the servers).

But still hoping that it'll work when I try from home later...

ADD REPLY
1
Entering edit mode

Well, I guess the sysadmin might have some kind of restriction on number of files you can download with a wildcard. To test that, could you maybe try with the pattern Methanococcus_aeolicus_Nankai* instead of Methanococcus*

ADD REPLY
1
Entering edit mode

Thanks, I'll try that when I get back to the office. But for now I can say, that from my home computer, the downloads with wget work perfectly. So its definitively the proxy settings at my institute.

ADD REPLY
3
Entering edit mode
10.1 years ago
Carlos Borroto ★ 2.1k

How about using rsync instead?

$ rsync --dry-run -avP --include "*.gbk" --include "*.fna" --include "Methanococcus*/" --exclude "*" ftp.ncbi.nih.gov::genomes/Bacteria/ /tmp

Warning Notice!

You are accessing a U.S. Government information system which includes this
computer, network, and all attached devices. This system is for
Government-authorized use only. Unauthorized use of this system may result in
disciplinary action and civil and criminal penalties. System users have no
expectation of privacy regarding any communications or data processed by this
system. At any time, the government may monitor, record, or seize any
communication or data transiting or stored on this information system.

-------------------------------------------------------------------------------

Welcome to the NCBI rsync server.


receiving file list ...
27 files to consider
./
Methanococcus_aeolicus_Nankai_3_uid58823/
Methanococcus_aeolicus_Nankai_3_uid58823/NC_009635.fna
Methanococcus_aeolicus_Nankai_3_uid58823/NC_009635.gbk
Methanococcus_maripaludis_C5_uid58741/
Methanococcus_maripaludis_C5_uid58741/NC_009135.fna
Methanococcus_maripaludis_C5_uid58741/NC_009135.gbk
Methanococcus_maripaludis_C5_uid58741/NC_009136.fna
Methanococcus_maripaludis_C5_uid58741/NC_009136.gbk
Methanococcus_maripaludis_C6_uid58947/
Methanococcus_maripaludis_C6_uid58947/NC_009975.fna
Methanococcus_maripaludis_C6_uid58947/NC_009975.gbk
Methanococcus_maripaludis_C7_uid58847/
Methanococcus_maripaludis_C7_uid58847/NC_009637.fna
Methanococcus_maripaludis_C7_uid58847/NC_009637.gbk
Methanococcus_maripaludis_S2_uid58035/
Methanococcus_maripaludis_S2_uid58035/NC_005791.fna
Methanococcus_maripaludis_S2_uid58035/NC_005791.gbk
Methanococcus_maripaludis_X1_uid70729/
Methanococcus_maripaludis_X1_uid70729/NC_015847.fna
Methanococcus_maripaludis_X1_uid70729/NC_015847.gbk
Methanococcus_vannielii_SB_uid58767/
Methanococcus_vannielii_SB_uid58767/NC_009634.fna
Methanococcus_vannielii_SB_uid58767/NC_009634.gbk
Methanococcus_voltae_A3_uid49529/
Methanococcus_voltae_A3_uid49529/NC_014222.fna
Methanococcus_voltae_A3_uid49529/NC_014222.gbk

sent 248 bytes  received 1604 bytes  3704.00 bytes/sec
total size is 60678518  speedup is 32763.78
ADD COMMENT
0
Entering edit mode

Thanks, but it seems the same proxy problems apply here. Doesn't work from my office computer. Will have to contact the sysadmin about this.

ADD REPLY
0
Entering edit mode

Is the NCBI rsync service still alive? I can't make it work..

ADD REPLY
0
Entering edit mode

It is working from here. Maybe a firewall or proxy issue on your side?

ADD REPLY
0
Entering edit mode

Can you post some exact command that works for you?

ADD REPLY
0
Entering edit mode

I used the exact command I posted in my answer. Can you post the error you are getting?

ADD REPLY
0
Entering edit mode

Ah, it actually works from home so I guess it's a firewall issue. Damn it. This would be so handy ;_;

ADD REPLY
1
Entering edit mode
10.1 years ago
JV ▴ 470

OK I found a workaround and want to post it here, in case anybody has the same problem sometime...

First what we determined to be a likely reason for this problem (As per discussion with some of my collegues): It seems the problem MAY be, that ncbi itself restricts the number of parallel requests to their ftp server. The wildcards in my request get expanded to a large number of parallel requests. Possibly the internet connection at my institue is too fast, and therefore the parallel requests sum up pretty fast, while my home connection is slow, so the requests are seemingly sent one after another instead of at the same time. This is not a researched explanation and not made from a fully qualified viewpoint, but as the following workaround worked for me, that is fine enough for me:


Step1:

Get the most current directory structure (WITHOUT contents) of the ncbi server.

This can be done EITHER by running

wget -r --no-remove-listing --spider ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria

(WARNING will take a LONG time) OR (more comfortably and faster) by logging in to the ncbi ftp server with Filezilla, going to the "/genomes/Bacteria" subfolder, marking all Subfolders, rightclicking and choose "Copy URLS do Clipboard". Then pasting the URLS to a text file.


Step 2:

Make a list of the genera you want to download the gbks from (one per line)


Step 3:

Iterate trough your list of genera, pipe the contents and "grep" them from your list of ftp-subfolders; give the results as arguments to "wget".

Example:

cat genus.list | while read genus; do grep $genus urllist.txt| while read url; do `wget -cNrv -t 45 -A "*.gbk" $url -P .; done; done`

You can regularly repeat these steps in the same working directory to update your genbanks IF new genomes from your genera of interest have been added to NCBI

ADD COMMENT
1
Entering edit mode

Looks great! A quick suggestion: Instead of the command:

wget -r --no-remove-listing --spider ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria

that recursively traverses the entire directory hierarchy, downloading all the folders in the process (-r and --spider still lead to files being downloaded). why not run this:

wget -l 2 --no-remove-listing --spider ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria

The above command does not download actual files and spiders to only 2 levels, so it extremely fast.

ADD REPLY
0
Entering edit mode

Thanks, thats a good tip!

I also started using the "--ignore-directories" argument to exclude all the "wgs", "pubmed" etc folders in the root-directory of the ftp server in step 3.

This is because , even though I call wget with a very specific url in each iteration, it still goes through ALL of the folders of the ncbi-ftp-server, downloads an "index.html" and immidiately deletes it again. This is also taking a lot of time.

Do you perhaps also have a tip on how to stop wget from doing this? (or maybe it doen't matter so much anymore, if it only goes down two subfolders into each subdiretory-tree after adding your suggestion)

ADD REPLY
0
Entering edit mode

I think the index.html serving is a server-side configuration, not sure if we can do something about it. Also, there is no --ignore-directories option. Do you mean --exclude-directories, perhaps?

ADD REPLY
0
Entering edit mode

yes --exclude-directories was what I meant (Got that a bit mixed up here).

ADD REPLY
0
Entering edit mode

By the way: For me replacing -r with -l 2 did not work for the download step (step 3). I thought I could make it easy for me there.... If I do that, I get only a bunch of html documents linking to the pages on the ftp-server that contains the .gb files (not the files themselves and also not the directory structure).

ADD REPLY
0
Entering edit mode

The -l 2 was for the first step which you mentioned took a lot of time. It is a spider step that is useful to fetch URLs. The actual data download step, which is step 3 in your workflow, will need the -r option. I apologize - I should've made my earlier statement clearer.

ADD REPLY
0
Entering edit mode

no no, your statement was clear (you meant it for the fist step). I was just hoping I could apply it to the third step also and was confused when it did not work

ADD REPLY
0
Entering edit mode

Also, yes. The -r is required for the actual download. The -l option was just to limit the spider-ing level.

ADD REPLY
0
Entering edit mode
9.9 years ago
freedy96 • 0

Hey try LongPathTool to solve errors in wget.

ADD COMMENT
0
Entering edit mode

Ummm, I hope you read the contents of the thread. OP has no problem with long file names. LongPathTool is built for Windows and that makes sense, because no UNIX based OS suffers from "file path too long" errors. I don't see how LongPathTool can help with wget on Linux-based systems.

ADD REPLY

Login before adding your answer.

Traffic: 2560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6