Storing large amounts of data will become a problem for the bioinformatics, sooner or later. I've faced this problem recently and a lot of questions that I've never thought before just surfaced. The most obvious are: How to decide the filesystem? How to partition a large (TB range) HD? When is a cheap solution (e. g. a bunch of low-end HDs) inappropriate?
These are pressing issues here at brazilian medical community. Everyone wants to buy a NGS machine, mass spec or microarray but no one perceives the forthcomming data flood.
In practical terms, how do you store your data? A good reason for a given decision would be great too.
Edit:
I've asked this question not so long ago and thing got HOT here. They just finished to build a whole facility to deal with cancer. A lot of people aquired NGS machines and TB scale seems be a thing of the past. Now we are discussing what to keep and how to manage the process of data triage/filtering. So, I do really need new tips from the community. Is someone facing a similar problem (too many data)?
Another edit:
Well, things are pretty fast paced these days. 4TB HDDs are the standard, SDDs are common, servers with onboard Infiniband abound. Also, projects with huge throughput (e. g. Genomics England and it's presumed 300GB per tumour sample). Annotation got way too many layers. Outsourcing sequencing is rather common. This question seems a bit obsolete at the moment.
When I read through the slides, it struck me how complex storage solutions are (maybe he is exaggerating a bit because they want to sell their own competence?).Anyway, I believe the most crucial part of the storage is not the vendor or technology but the competence represented by the people planning and running it, with full-time sys-admins. The bioinformaticians role is to understand and specify the requirements. Disaster is guaranteed if only one poor bioinformatician is hired to do the research part and to build up the infrastructure.
When I read through the slides, I just struck me how complex storage solutions are. I think the most crucial part of the storage is not the vendor or technology but the competence represented by the people planning and running it, with full-time sys-admins. The bioinformaticians role is to understand and specify the requirements. Disaster is guaranteed if only one poor bioinformatician is hired to do the research and to build up the infrastructure.
Wow !!! This Bioteam is very nice. Thank ya, mndoci! I really appreciate case studies. BTW, I'm the poor bioinformatician. Not alone as we have good IT infra/people. But, NGS/arrays/related will hit hard the diagnostic barrier this year. You can imagine what a very large/rich reference hospital will do. Anyway, storage solutions are complex in our scale and needs. None of us have the required experience. Our cardio division uses a complete proprietary solution with a proprietary database and still suffer from problems regularly. They didn't get the specific needs. So, any tip is handy!
Michael, storage solutions are extremely complex and very finicky. There is a reason some of the big storage vendors can charge as much as they can, cause they are essentially selling performance and reliability as a brand. At scale though that starts breaking down and you are better served by commodity hardware with the software layer handling failure. And yes, you can't live in a world with a non-expert handling your infrastructure needs.