I'd like to set up a local mirror of certain large databases like the nt BLAST database, interpro etc.
The biomirror project looks like a good candidate, but they seem to advocate using GridFTP, and have even deprecated rsync. I would have thought a simpler solution would be something hacked together with cron and rsync, or am I missing something?
So, my question is:
What solutions have you used for mirroring large biological databases, and what mistakes should I avoid making?
I just use a simple shell script in the system's cron.daily folder and use the "mirror" option of the "lftp" command. Here is one which mirrors just the virus and bacteria genomes into my local folder called "/bio/db/ncbigenomes/". You will have to adjust the SRC and DEST folders, and the $HOST variable to point to your local biomirror.
#!/bin/sh
#
# sudo vi /etc/cron.daily/biomirror
#
HOST=biomirror.aarnet.edu.au
for G in Viruses Plasmids Bacteria Bacteria_DRAFT ; do
lftp -c "open ftp://$HOST/ ; mirror --delete \
/biomirror/ncbigenomes/$G /bio/db/ncbigenomes/$G"
done
We have a local blast Nt DB in our lab with proprietary sequences and mirror DB.
I wrote a script for updating the DB on a monthly basis, it is launched by the cron. The script (written in PHP. Yes, I know ...) use E-utils functions (from NCBI) to query NCBI, parse XMLs and retrieve sequences of interest. This is quite fun to write and I have something that works and do exactly what I want.