Hi Every one
I am writing this post so that if someone out there is struggling with problems on how to download the data in batch from Ensembl, they can be helped. Life was easy when ensemble had ftp
links and we could use a regex *.gz
in front of the URL to download multiple fasta files. However, since Chrome remove support for ftp
Ensembl is migrating all its URL from ftp
to http
. This is both good and bad news. Good because now you can view the directories in the browser which were previously giving errors with FTP links and bad because downloading multiple data with wget
and curl
using HTTP links
is not straightforward. I got my entire day wasted to figure out if I could use wget
, curl
or rsync
somehow to download multiple files from Ensembl. I finally found a solution and I encourage others to extend this thread with more insights. Here I take an example of a protist
wget -r --no-parent --no-check-certificate -nd -nc -np -e robots=off -A.gz http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-52/protists/fasta/protists_alveolata1_collection/theileria_equi_strain_wa_gca_000342415/cdna/
Explanation of the tags was obtained from Source
- -r signifies that wget should recursively download data in any subdirectories it finds.
- -nd copies all matching files to the current directory. If two files have identical names it appends an extension.
- -nc does not download a file if it already exists.
- -np prevents files from parent directories from being downloaded.
- -e robots=off tells wget to ignore the robots.txt file. If this command is left out, the robots.txt file tells wget that it does not like web crawlers and this will prevent wget from working.
- -A.gz restricts downloading to the specified file types (with .gz suffix in this case)
- –no-check-certificate disregards the SSL certificate check. This is useful if the SSL certificate is setup incorrectly, but make sure you only do this on servers you trust.
Thanks for this post, big help. In my case, I wanted to select specific .gz files from an ensembl directory. Thought I'd post as a comment to highlight the additional useful option --accept-regex
With -A ".gz" all files ending with .gz in the .../dna/ directory are downloaded (many of which I don't need)
With --accept-regex you can download only the files you want which in my case was the primary assembly, without masking, for all chromosomes.
I included both options in my command because without -A, an index file I don't want is still downloaded