Question

Forum:NGS Data storage solutions for small organisations or big labs

4

Entering edit mode

5.5 years ago

bioinfo ▴ 840

Hi all,

I’m trying to find out and compile info on how big labs (e.g. 40+ people) or small organisations who produce large amounts of NGS data manage the data storage issue. As a small organisation, we produce 10-15 TBs of HTS data every year from large genomics and metagenomics projects. We are often dependent on an external HPC cluster service provider for storing our TBs of data in parallel to doing analysis there, along with a few more Tbs on local server. We have both 'active' and ‘cold’ (not actively used anymore) datasets from our research projects and we try to maintain these datasets for at least 7 years.

Since many of us are dealing with a large bunch of NGS data from projects with TBs in size, I was wondering

how/where do you guys store your "active" and "cold" HTS data? Do you have an in-house server with TBs of storage facility or online Cloud (e.g. Amazon) or other options?
is it cost-effective to use cloud-based servers (e.g. Amazon, ASIA) for "cold" data storage and build a small server locally for working with "active" data to mitigate this issue? Any ideas on the cost involved to build a small cluster/server with Tbs of storage?

I have talked to a few of my colleagues and it sounds like everybody is doing it somehow but looking for better options. Since many of us are struggling with HTS data storage with some backup facility, I was wondering if there are any cost-effective solutions? Many organisations are investing quite a lot on eResearch (i.e. data science) and many of us are already know that genomics-based data storage is really a big issue across organisations for researchers and needs more attention.

We might share our ideas and see if we can follow others’ approaches.

big-data HTS storage NGS metagenomics • 5.3k views

ADD COMMENT • link updated 19 months ago by Ram 44k • written 5.5 years ago by bioinfo ▴ 840

0

Entering edit mode

I am also interested in this subject, but on a different angle: how are you keeping track of the raw data + metadata? Custom DB with path to cloud location? iRODS? ELN? We don't actually generate the data, but rather get it from CROs or public repositories. Cheers.

ADD REPLY • link 3.8 years ago by A. Domingues ★ 2.7k

1

Entering edit mode

Answer to this is likely what expertise you have access to locally and what policies (security/institutional) that you need to adhere to. Where are you storing your data? Locally or in cloud?

ADD REPLY • link 3.8 years ago by GenoMax 148k

0

Entering edit mode

Good points. We are using cloud storage (AWS), and we would have some support for setting this up, either internal expertise or contractors (I work for a private company). The basic policy is that the data is for internal usage only. At the moment we only have a couple of sequencing experiments, a couple dozen fastqs, so it's a good time to start thinking about this. We already have in place a ELN (Signals) and a LIMS (benchling) which we could leverage, but I am not sure this is the way to go. The main goal is to have a way to search for a particular dataset and know where it is located, but in a way that can be integrated with other data sources in the future - basically anything other than an excel file. Thank you for your input.

ADD REPLY • link 3.8 years ago by A. Domingues ★ 2.7k

1

Entering edit mode

If you already have metadata about the samples in your ELN/LIMS would it not be best to link cloud storage location information there? One big issue is likely of access control. I assume you have it implemented on your end in ELN/LIMS based on account privileges but you would essentially have to also replicate that on cloud end. Unless you have a pretty flat structure internally and don't have to worry about access controls.

ADD REPLY • link 3.8 years ago by GenoMax 148k

0

Entering edit mode

That was pretty much my low-tech solution, but I was just wondering of there was better way of going about it. (Meta)Data retrieved from SRA/GEO can also go into lims. Thank you.

ADD REPLY • link 3.8 years ago by A. Domingues ★ 2.7k

score 2 · Answer 1 · 2019-06-11

For us active data is always on high performance local cluster storage. We are a bigger organization/sequencing center and have access to plenty of storage (not infinite but adequate for ~6 months, hundreds of TB). We also use a large quantum tape library solution that is presented as storage partition. Data copied there automatically goes on tapes. We keep them for 3 years.

You can consider cold storage on google or AWS. While cold cloud storage is cheap, you will incur a cost to retrieve the data, which can be expensive. You can consider converting data to uBAM or CRAM (if a reference is available) to save on space in general.

If data is going to be published you would eventually want to submit it to SRA/ENA/DDBJ so you can store a copy there. There is a facility to embargo it until publication (or at least 1 yr I think) so you are covered.

score 2 · Answer 2 · 2019-06-17

I think this is an important, and unsolved, problem.

For "burning hot" (upto 2 months lifespan), we use the centrally managed HPC scratch space, which has 600TB of space (lustre, connected by infiniband directly to compute nodes), but is shared across the entire institution (about 1000 users).

For "hot" data, we use a high-performance, cluster-out storage cluster (NetApp) run and managed by the institution on which we buy space at $300/TB/year for mirrored and backed up storage we currently have 20TB here and I expand it whenever I have spare cash lying around.

For cold data we use cloud (where legally allowed). My institution has an agreement with the cloud provider where we have no limit on the amount of space we can use, but we have a daily up/down bandwidth limit of 1 TB per research group.

As noted by @genomax, in the longest of terms we rely on SRA/ENA to store our raw data. Our biggest problem is that raw data is generally only a small fraction of the data associated with a project. A project with 100GB of raw data can easily produce over a 1TB of analysis products. Some of these can be safely discarded, but its hard to know which - and other have to be retained for record keeping purposes. Its this intermediate "grey" data that really poses a problem.

score 1 · Answer 3 · 2019-06-17

We have similar levels of data generation, about 20-30TB a year at present. We can't go for cloud data storage due to legal reasons.

24 TB SSD - hot data. On compute cluster
||
100-150 TB online (warm data). Local SLURM cluster. Mix of scratch partitions, and snapshotted Netapp.
||
10 TB tape (cold data, backup from warm to tape every 6-12 months).

We also backup to two local 60 TB MooseFS storage run on a) 60 TB internal RAIDs spread across 6 workstations and b) 60 TB of very large (6-8 TB) external hard disks spread across 3 workstations. Slow but seems stable and it manages two redandant copies of each chunk. Open source software as well, so definitely a very cheap option, but no deduplication.

I am starting to look at ZFS because of it's snapshotting and deduplication properties.

score 0 · Answer 4 · 2019-06-11

0

Entering edit mode

5.5 years ago

harold.smith.tarheel ★ 5.0k

My decidedly heterodox position (probably due to my foundational training in molecular biology) is to store the 'cold' data as library DNA in a -80˚ freezer. DNA is a technologically stable and incredibly information-dense platform - a small freezer could easily accommodate petabytes-to-exabytes equivalent of data at a fraction of the price of digital media. Plus, storage costs for most cold datasets are wasted, in the sense that they'll never be reanalyzed, which makes resequencing of the few reusable ones cost-effective.

But I've found that most users of our sequencing facility are strongly opposed to this suggestion - I would be interested to hear feedback from the Biostars community.

ADD COMMENT • link 5.5 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

I agree in principal with your solution but it may only be viable for an individual lab. Sequencing facilities deal with tens of thousands of samples and storage of libraries at -80C for years quickly becomes unwieldy.

ADD REPLY • link 5.5 years ago by GenoMax 148k

0

Entering edit mode

I think that DNA should be stable long term at RT under proper storage conditions.

ADD REPLY • link 5.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

You'd be surprised how easy it is. 40K+ clones (two whole-genome RNAi libraries for C. elegans) fit in a couple freezer racks, and we retrieve samples from those regularly (much more frequently than cold datasets).

ADD REPLY • link 5.5 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

I guess it depends on your use case - i've never had a data set I havn't gone back to at least once a year.

ADD REPLY • link 5.5 years ago by i.sudbery 20k