Hi all,
Does anyone have recommendations/experience on using Amazon S3 or Glacier for NGS data storage? We're currently exploring migrating from local (lab-owned) servers to AWS, but I'm just not that comfortable with the logistics and risk/reward trade-off.
A bit of background info: We operate a sequencing core (low-ish throughput, say 250GB a month) and are currently managing everything on a few locally-hosted machines. To start, I'd probably want to migrate all of the sequencing data (e.g. BCLs and FASTQs, currently about 10TB) to S3, and also set up some mechanism for automatically transferring sequencing runs to S3 as well. We have enough local storage so that we could do this every month or so, perhaps. We also have enough local and university compute infrastructure where I don't envision needing EC2 any time soon.
So then, how should I proceed? I'm thinking to use S3 with Lifecycle rules to automatically move to Glacier storage, as this data will only be retrieved in rare cases. This seems to have the advantage of the S3 API with an rsync-like syntax. I'm not familiar enough with Glacier's API to know exactly how to transfer Illumina run folders directly.
On a more detailed note, how would you guys recommend structuring on S3? Would you create a separate bucket for each sequencing instrument? Or for each run?
Sorry for the length of the post - any advice would be greatly appreciated.
Since this is archival storage, you may also want look at nearline storage at Google. Google cloud is HIPAA compliant and they will sign a business associate agreement (check for local policy restrictions before you plan to use any cloud provider). It is also possible to use NetBackup to directly do backups to google storage.
A note is that AWS has lots of resources for making your storage HIAA complaint and they also have a special zone that for US government institutions, or those operating under similar constraints around security/privacy
You should look into Backblaze: https://www.backblaze.com/b2/cloud-storage-providers.html
I like Backblaze (I use them for personal backup), but I wonder if it's better to future-proof and stick with Amazon in case we migrate to EC2 in a few years time. Also, has anyone actually done a migration? I'm really curious about the inevitable oops and "oh I didn't know that"s that will come up