Help selecting storage hardware for bioinformatics core facilities
3
3
Entering edit mode
8.5 years ago
jiacov ▴ 50

I was just recently asked to help someone set up a storage system for a young core facility and I really lack the expertise to aid her. My setup involves 3 Synology NASes in RAID5. I have one (~40TB) that is mounted by various linux machines and is used for daily work. Most all processes read and write to the NAS while running jobs, with exceptions of course when programs recommend to use local storage. The second NAS gets booted from time to time for a backup and remains off. The third NAS is in a different physical location but is shared by lots of teams in the Institute and gets a monthly image of the most important source/script directories but cannot backup all the fastq/bam/bigWig/etc.

But I have so many issues with these Synology boxes that I cannot recommend them. They are not meant to be used in a multi-user linux environment and do not obey traditional file permission settings that would make them actually useful for multiple users.

So, my first thought was to try to generate a discussion here, where people might share their solutions, both good and bad, to try to help her as well as others faced with the same problem. Google points me here, but the threads are old and were never very popular. I hope this thread does not face that same fate.

In any case, ChIPseq, RNAseq, HiC, 16S all sorts of medium/large sized data are being generated. 40TB is a good starting point, but it needs to be scalable as a few years ago, I was content with 12TB systems. The solution does not have to be all-in-one. The needs are both to permanently store raw data files and pipeline scripts as well as to handle read/writes for processing, temporary files, etc. Multiple users and integration into a linux environment is essential.

If you need any more details, let me know, but I would hope that this question is science-agnostic.

RAID NAS storage hardware • 4.5k views
ADD COMMENT
1
Entering edit mode

I am not sure this is the right place for you to get answers. Most people here are bioinformaticians and in most places, this kind of issue is handled by dedicated sysadmins. You may have better luck asking on serverfault. Also I can't think of a core facility that doesn't have access to some sort of data center. My feeling is that lab-based storage is not a good medium/long term solution. As the system grows, will the lab have enough electricity, air conditioning capacity ... Also how do you ensure security e.g. can anybody walk into the lab and take a disk with sensitive data (e.g. medical information) ?

ADD REPLY
0
Entering edit mode

Similar topics have been discussed here before. There are quite a few of us who also administer systems or have quite in-depth knowledge of using or setting up HPC clusters, storage systems, etc.

ADD REPLY
0
Entering edit mode

I'm not sure that matters, this is still not a bioinformatics question and I would certainly be tempted to close this as offtopic. Serverfault is definitely the right place for these questions.

ADD REPLY
0
Entering edit mode

Is this going to be in a data center or sitting somewhere in the lab/office ?

ADD REPLY
0
Entering edit mode

It will be sitting somewhere in the lab.

ADD REPLY
0
Entering edit mode

Whatever they go with, avoid using an HP 3PAR. We're using one with a bit under a petabyte of storage and performance is often a headache. At my previous institute we had some sort of hadoop-based solution that seemed to perform pretty well. That'd be nicely scalable to boot.

ADD REPLY
0
Entering edit mode

Thank you both for the links. Following and reading, I have an idea of the type of hardware. We are in France, so we will have to have a meeting to find a time to schedule a meeting to define the guidelines of a future meeting where we can discuss a room with A/C and backup power to house the thing.

In addition, we are extremely limited in terms of providers. I did not say this at first because I want to know the true best answer before seeing what Dell and HP have in that hardware space. But unless one of these companies has already sold to the CNRS, I doubt it would be in either of our interests to conduct the transaction. It would be hard for them to deploy their service guarantee and the government is not the quickest or easiest to deal with in terms of paperwork and slow payment.

So, next part of question, does Dell or HP make a competitive or at least functional storage solution for this application. I see now how a NAS like I use can be constructed to give awesome I/O and want to get one of my own. I even found some French vendors that link off the Supermicro site, so it might not be impossible to buy one, just any pricing seems to allude me as I surf these sites. I am afraid to know how much an entry level unit might cost from either of responses from Dan Gaston or genomax2.

ADD REPLY
0
Entering edit mode

They both have solutions, I would recommend seeing if your IT department already has a contact with the vendors that you can deal with, as CNRS probably has a special rate they pay, and if you only deal with Dell and HP it could actually be a substantial discount. I feel your pain, we ended up going with Dell's FX2 converged system for our HPC component, four server nodes cost us about $35000 Canadian. The architecture is worth looking into due to scalability and flexibility, and they have storage-oriented nodes that integrate in to the system.

ADD REPLY
0
Entering edit mode

Dell owns EMC so Isilon falls under Dell's umbrella :-)
That said, if you want to stay with Dell then you could look at the Dell EqualLogic FS7600 and FS7610 NAS systems or Dell Compellant NAS. You may also want to split the storage for sequencers with a direct attached storage unit like (MD1200) which you can share out from a front-end server to the sequencers (and this can stay in the lab in a mobile half-rack). MD1000/MD1200 are workhorse units and we used them before switching to isilon. None of these are going to be as cheap as 45drive but you can count on a person to show up the same day with replacement disks/parts (at least we can in the US) if you have the right kind of service contract.
And as @Dan said elsewhere don't go on published pricing on the web sites etc. Your institutional pricing could be substantially cheaper. If you can piggy-back on a large order then the discounts will be bigger.

ADD REPLY
0
Entering edit mode

Here's a Link to the Dell FX/2 Converged Architecture I was talking about: PowerEdge FX/2

This sort of system can save some substantial costs, particularly if you're thinking about later expansion. Its sort of like blade-style architecture but is different and more flexible. All of the nodes (between 2 and 8 depending on what nodes you want in the enclosure) communicate directly with each other over the mid-plane, at least when you have the I/O Aggregator option installed. So you can get some blazingly fast speeds within an enclosure. The IO Aggregators are basically a mini-switch all on their own as well so you can eventually go to multiple enclosures chained together into fairly complex network topologies while needing fewer cables and fewer of the expensive network switches you often have to use. You could easily set up a single enclosure with two storage nodes and two compute nodes (I believe) sitting on top of each other in a single 2U form factor (it is pretty loud though).

Definitely something to consider, I'm very happy with mine for the compute portion of our set up.

ADD REPLY
2
Entering edit mode
8.5 years ago
GenoMax 147k

Without an understanding of your budget and tolerance for loss/failure this question can only be answered in general terms.
This hardware shouldn't really be sitting in a lab (not ideal for a disk array of ~40TB, I see above that someone other then you seems to have said that but perhaps that person is from your lab).
On low end you can go with dense JBOD type devices (an example with no specific significance: http://www.supermicro.com/products/nfo/storage.cfm and I see that @Dan has recommended 45drives) or at high-end an isilon/netapp storage device. With an higher end device you will have higher performance, (almost) unlimited potential for expansion but that is going to come at cost. We use a 100 TB isilon array for collecting data from multitude of sequencers and it has proven extremely reliable. It sits in the lab in a rack and has tolerated the less than ideal conditions very well.

ADD COMMENT
0
Entering edit mode

Good additions. My only quibble would be that based on personal experience, and relying on observations from people in the field like BioTeam, big Enterprise systems like Isilon aren't leagues better than other solutions in terms of performance and robustness. Especially given the price differential.

ADD REPLY
0
Entering edit mode

Sure, at a scale that OP is referring to. That was the reason I added a note about budget/tolerance for loss/failure.
But at a petabyte plus of data on a cluster with 10000+ cores there is no better alternative.

ADD REPLY
0
Entering edit mode

The needs of companies like Amazon, Google, etc really put this to the test. They needed to be able to operate at massive scales without spending that kind of money. Of course they have such a completely different scale that they are doing massive redundancy and treating entire servers as field replaceable units. My three storage server will actually have over 1 PB of raw storage capacity.

ADD REPLY
0
Entering edit mode

Google/Amazon are probably using a custom unix based OS to manage storage (like OneFS in case of isilon). Pricing is reasonable. Especially if one figures in the cost of (not requiring) local personnel, infrastructure, electricity, cooling, the cloud storage starts making sense (at an organizational level of course, not for a core).
NL isilon servers are capable of 150+ MB/s sustained throughput (new X nodes are supposed to be 5x faster).

ADD REPLY
0
Entering edit mode

In this case I wasn't making a direct comparison to using a cloud vendor, just that commodity hardware can be used for highly performant, highly reliable, compute and storage applications. The majority of software has actually been open sourced by the various companies involved as well.

ADD REPLY
1
Entering edit mode
8.5 years ago
DG 7.3k

So up front I would generally recommend NOT having it set up in the lab. If the core facility is located at an Institution that has a data centre trying to set up their storage in an appropriate facility is a good idea. Cooling, access controls, etc are all good to have. IN addition, while labs are loud, adding servers that can be deafening to the mix will create a very loud lab potentially.

Now that that is out of the way, my recommendation is to read a lot. The guys at BioTeam have a lot of great presentations out there discussing hardware trends and IT in Genomics. I found it very helpful when setting up my own solution. If the core has the budget, they do consulting and could be available to help her design and set up a system that fits her needs.

In my case I ultimately went with storage servers from 45 Drives. They were a partner with BackBlaze in designing their storage pods and have taken that to build a pretty low cost storage solution. Dollar per terabyte they are definitely much more affordable than many other solutions. They are currently used by a number of core facilities and genomics centres in North America.

ADD COMMENT
1
Entering edit mode

I think you mean NOT "recommend having it set up in the lab."

ADD REPLY
0
Entering edit mode

Yes, that is what I meant. Thanks, editing :)

ADD REPLY
0
Entering edit mode
8.5 years ago
fanli.gcb ▴ 730

Just adding my $.02 as I work in a newer core facility in roughly the same boat, we are currently running a pair of Linux boxes with ~20TB RAID6 storage. It took us about 2 years to fill that up (we're pretty low throughput tbh), and I'm now looking at adding a JBOD like @genomax2 suggested for another ~60TB or so. @Dan, that 45 Drives solution looks pretty sweet. Maybe I'll try to convince the boss we need THAT.... :)

I really do have to agree with other answers and stress that this kind of a setup should be in a data center, not in a lab.

ADD COMMENT
0
Entering edit mode

They are pretty great. We currently have the servers only 1/3rd populated with drives at the moment (4TB WD datacentre drives) which gives us 180 TB of raw capacity. Even with redundancy on each server (zfs3 and BTRFS on an experimental server) we have 136 TB of capacity for probably under $50K Canadian.

ADD REPLY
0
Entering edit mode

What kind of cases are you using for this? Can say 3 different multithreaded applications running on different workstations all using the linux boxes for storage all get good I/O speeds?

ADD REPLY
1
Entering edit mode

If all your infrastructure is local you could set the storage up on a private network with 10G ports all around and a right switch. It will be more pricey but you would get the best performance. Ultimately you will be limited by the network speed out the back of of the storage appliance. Keeping traffic out of routers/ensuring end-to-end 10G links would give you the best performance.

ADD REPLY
0
Entering edit mode

Yup, I highly recommend 10G. These days 10G is fairly cheap until you have to go to a Cisco switch (like we did). If you have more freedom there are perfomant 10G switches out there that are about 1/3rd of the cost of even institutional Cisco pricing. So for relatively simple set-ups where Cisco is massive overkill...

ADD REPLY
0
Entering edit mode

Its usually your networking that limits this to an extent, coupled with the actual IO load on disks, the Filesystem that is being used, etc.

ADD REPLY

Login before adding your answer.

Traffic: 2543 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6