I was just recently asked to help someone set up a storage system for a young core facility and I really lack the expertise to aid her. My setup involves 3 Synology NASes in RAID5. I have one (~40TB) that is mounted by various linux machines and is used for daily work. Most all processes read and write to the NAS while running jobs, with exceptions of course when programs recommend to use local storage. The second NAS gets booted from time to time for a backup and remains off. The third NAS is in a different physical location but is shared by lots of teams in the Institute and gets a monthly image of the most important source/script directories but cannot backup all the fastq/bam/bigWig/etc.
But I have so many issues with these Synology boxes that I cannot recommend them. They are not meant to be used in a multi-user linux environment and do not obey traditional file permission settings that would make them actually useful for multiple users.
So, my first thought was to try to generate a discussion here, where people might share their solutions, both good and bad, to try to help her as well as others faced with the same problem. Google points me here, but the threads are old and were never very popular. I hope this thread does not face that same fate.
In any case, ChIPseq, RNAseq, HiC, 16S all sorts of medium/large sized data are being generated. 40TB is a good starting point, but it needs to be scalable as a few years ago, I was content with 12TB systems. The solution does not have to be all-in-one. The needs are both to permanently store raw data files and pipeline scripts as well as to handle read/writes for processing, temporary files, etc. Multiple users and integration into a linux environment is essential.
If you need any more details, let me know, but I would hope that this question is science-agnostic.
I am not sure this is the right place for you to get answers. Most people here are bioinformaticians and in most places, this kind of issue is handled by dedicated sysadmins. You may have better luck asking on serverfault. Also I can't think of a core facility that doesn't have access to some sort of data center. My feeling is that lab-based storage is not a good medium/long term solution. As the system grows, will the lab have enough electricity, air conditioning capacity ... Also how do you ensure security e.g. can anybody walk into the lab and take a disk with sensitive data (e.g. medical information) ?
Similar topics have been discussed here before. There are quite a few of us who also administer systems or have quite in-depth knowledge of using or setting up HPC clusters, storage systems, etc.
I'm not sure that matters, this is still not a bioinformatics question and I would certainly be tempted to close this as offtopic. Serverfault is definitely the right place for these questions.
Is this going to be in a data center or sitting somewhere in the lab/office ?
It will be sitting somewhere in the lab.
Whatever they go with, avoid using an HP 3PAR. We're using one with a bit under a petabyte of storage and performance is often a headache. At my previous institute we had some sort of hadoop-based solution that seemed to perform pretty well. That'd be nicely scalable to boot.
Thank you both for the links. Following and reading, I have an idea of the type of hardware. We are in France, so we will have to have a meeting to find a time to schedule a meeting to define the guidelines of a future meeting where we can discuss a room with A/C and backup power to house the thing.
In addition, we are extremely limited in terms of providers. I did not say this at first because I want to know the true best answer before seeing what Dell and HP have in that hardware space. But unless one of these companies has already sold to the CNRS, I doubt it would be in either of our interests to conduct the transaction. It would be hard for them to deploy their service guarantee and the government is not the quickest or easiest to deal with in terms of paperwork and slow payment.
So, next part of question, does Dell or HP make a competitive or at least functional storage solution for this application. I see now how a NAS like I use can be constructed to give awesome I/O and want to get one of my own. I even found some French vendors that link off the Supermicro site, so it might not be impossible to buy one, just any pricing seems to allude me as I surf these sites. I am afraid to know how much an entry level unit might cost from either of responses from Dan Gaston or genomax2.
They both have solutions, I would recommend seeing if your IT department already has a contact with the vendors that you can deal with, as CNRS probably has a special rate they pay, and if you only deal with Dell and HP it could actually be a substantial discount. I feel your pain, we ended up going with Dell's FX2 converged system for our HPC component, four server nodes cost us about $35000 Canadian. The architecture is worth looking into due to scalability and flexibility, and they have storage-oriented nodes that integrate in to the system.
Dell owns EMC so Isilon falls under Dell's umbrella :-)
That said, if you want to stay with Dell then you could look at the Dell EqualLogic FS7600 and FS7610 NAS systems or Dell Compellant NAS. You may also want to split the storage for sequencers with a direct attached storage unit like (MD1200) which you can share out from a front-end server to the sequencers (and this can stay in the lab in a mobile half-rack). MD1000/MD1200 are workhorse units and we used them before switching to isilon. None of these are going to be as cheap as 45drive but you can count on a person to show up the same day with replacement disks/parts (at least we can in the US) if you have the right kind of service contract.
And as @Dan said elsewhere don't go on published pricing on the web sites etc. Your institutional pricing could be substantially cheaper. If you can piggy-back on a large order then the discounts will be bigger.
Here's a Link to the Dell FX/2 Converged Architecture I was talking about: PowerEdge FX/2
This sort of system can save some substantial costs, particularly if you're thinking about later expansion. Its sort of like blade-style architecture but is different and more flexible. All of the nodes (between 2 and 8 depending on what nodes you want in the enclosure) communicate directly with each other over the mid-plane, at least when you have the I/O Aggregator option installed. So you can get some blazingly fast speeds within an enclosure. The IO Aggregators are basically a mini-switch all on their own as well so you can eventually go to multiple enclosures chained together into fairly complex network topologies while needing fewer cables and fewer of the expensive network switches you often have to use. You could easily set up a single enclosure with two storage nodes and two compute nodes (I believe) sitting on top of each other in a single 2U form factor (it is pretty loud though).
Definitely something to consider, I'm very happy with mine for the compute portion of our set up.