I am building a large database for my analysis tool ~3 Terabytes.
The issue is distributing this database for people to use it.
A solution we have learned from working with sequencing cores is to ship a hard drive by FedEx. A possibly expensive and annoying proposition.
Another solution is to host the files for download, but this seems possibly expensive or possibly impractical for the audience. (On this point of practicality -- most Universities have gigabit level bandwidth, perhaps 3 Terabytes is no longer a ludicrous download size, simply a big one.) At 1 gigabit, I estimate the download would take 400 minutes.
Since this is a niche' problem, I am hoping someone may have had a similar issue and can provide advice?
Thank you Jeremy Cox
Any way the database can be split up? Without knowing what your tool is for I can only guess, but maybe a given researcher may only be interested in a specific species (e.g. human), or large groups of species (e.g. bacteria). This could reduce the size of the file needed to be downloaded. It may improve the accessibility of your tool and make it approachable. I know I wouldn't want to download a 3TB database just to use an exceedingly small portion of it.
As far as the download being too large for slower connections to handle, this what weekends are for: set up the download Friday evening and let it run over the weekend. Put your database on a FTP server, and if someone wants to download it they'll just have to live with whatever connection speed they get. Worst case, they can mail you a drive with return postage.
Good question. I could offer smaller versions of the database, but I fear some people will want "the whole enchillada" as it were. Previewing the database through the web is also an excellent suggestion. But for high volume computations, they'll want to download database and use their own machines. So ultimately, the problem doesn't go away. But if we can cut down the number of people who want to ship a disk, that would be a big success. Good thinking.