Hi all,
I’m trying to find out and compile info on how big labs (e.g. 40+ people) or small organisations who produce large amounts of NGS data manage the data storage issue. As a small organisation, we produce 10-15 TBs of HTS data every year from large genomics and metagenomics projects. We are often dependent on an external HPC cluster service provider for storing our TBs of data in parallel to doing analysis there, along with a few more Tbs on local server. We have both 'active' and ‘cold’ (not actively used anymore) datasets from our research projects and we try to maintain these datasets for at least 7 years.
Since many of us are dealing with a large bunch of NGS data from projects with TBs in size, I was wondering
- how/where do you guys store your "active" and "cold" HTS data? Do you have an in-house server with TBs of storage facility or online Cloud (e.g. Amazon) or other options?
- is it cost-effective to use cloud-based servers (e.g. Amazon, ASIA) for "cold" data storage and build a small server locally for working with "active" data to mitigate this issue? Any ideas on the cost involved to build a small cluster/server with Tbs of storage?
I have talked to a few of my colleagues and it sounds like everybody is doing it somehow but looking for better options. Since many of us are struggling with HTS data storage with some backup facility, I was wondering if there are any cost-effective solutions? Many organisations are investing quite a lot on eResearch (i.e. data science) and many of us are already know that genomics-based data storage is really a big issue across organisations for researchers and needs more attention.
We might share our ideas and see if we can follow others’ approaches.
I am also interested in this subject, but on a different angle: how are you keeping track of the raw data + metadata? Custom DB with path to cloud location? iRODS? ELN? We don't actually generate the data, but rather get it from CROs or public repositories. Cheers.
Answer to this is likely what expertise you have access to locally and what policies (security/institutional) that you need to adhere to. Where are you storing your data? Locally or in cloud?
Good points. We are using cloud storage (AWS), and we would have some support for setting this up, either internal expertise or contractors (I work for a private company). The basic policy is that the data is for internal usage only. At the moment we only have a couple of sequencing experiments, a couple dozen fastqs, so it's a good time to start thinking about this. We already have in place a ELN (Signals) and a LIMS (benchling) which we could leverage, but I am not sure this is the way to go. The main goal is to have a way to search for a particular dataset and know where it is located, but in a way that can be integrated with other data sources in the future - basically anything other than an excel file. Thank you for your input.
If you already have metadata about the samples in your ELN/LIMS would it not be best to link cloud storage location information there? One big issue is likely of access control. I assume you have it implemented on your end in ELN/LIMS based on account privileges but you would essentially have to also replicate that on cloud end. Unless you have a pretty flat structure internally and don't have to worry about access controls.
That was pretty much my low-tech solution, but I was just wondering of there was better way of going about it. (Meta)Data retrieved from SRA/GEO can also go into lims. Thank you.