Question

Hardware Suitable For Generic Nextgen Sequencing Processing?

5

Entering edit mode

14.2 years ago

Geoffjentry ▴ 320

Hello. My lab finds itself in a difficult situation as we are currently looking to acquire new hardware for the next year or two, which is something that must be done in a short time frame for budgetary reasons. The difficult part is that we are currently exploring multiple nextgen sequencing technologies as there might be a need to replace the system that we're currently using. The data for our current platform is embarrassingly parallel in its processing and would work well for a blade/cluster type situation, although I have no idea what optimal hardware for processing other nextgen data would be (e.g. illumina, pacific bioscience). Due to the potential change of platforms, I'm leery to simply base new hardware on the needs of the current platform, in case some of the assumptions no longer applies.

For people that do nextgen sequencing, what sort of hardware solutions are you using? Clusters? Large servers? Would love suggestions on specific brand/models as well.

I should also mention that for reasons out of my control (powers that be, and all that), CUDA and the like aren't an option for us.

next-gen sequencing hardware • 7.5k views

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 14.2 years ago by Geoffjentry ▴ 320

4

Entering edit mode

Some recommendations on this question: Any Hardware Recommendations For A Molecular Biology Lab That'S Getting Into Bioinformatics?

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 14.2 years ago by Simon Cockell 7.4k

0

Entering edit mode

@Simon: I did see that thread, thanks though!

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 14.2 years ago by Geoffjentry ▴ 320

1

Entering edit mode

Can you provide a little more info? I ask because the answers to this question will largely depend on what you're doing with that sequence data. De novo genome assembly programs typically require huge amounts of RAM (really, the more the better). Modern algorithms for mapping reads, though, need CPU but not a lot of RAM. The final question is: how busy will these CPUs be? If they'll be idle 75% of the time or more, you might look into EC2 or other cloud-computing options.

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 14.2 years ago by Chris Miller 22k

1

Entering edit mode

@chris: I wish I could. On the "what", it's really a mixed bag ranging from digital gene expression to assembly. It's currently mapping reads however, thus the clusterability. Currently we're doing the processing on a HP linux machine w/ 16 CPU cores and 96GB of RAM, and a bulk of the processes take 4-8GB of RAM and as much CPU as they can get. The largest problem we have w/ the current platform's software is that they make use of SQLite DBs and will quickly flood the machine's IO limitation if we have many processes running.

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 14.2 years ago by Geoffjentry ▴ 320

0

Entering edit mode

A quick note: you will most likely also need substantial system administration expertise as well; this is particularly true when investing into cluster computing type of solutions.

ADD REPLY • link 14.2 years ago by Istvan Albert 101k

0

Entering edit mode

@chris part 2: I've looked at EC2. It's not really an option for the same reason CUDA isn't - the people signing the checks don't like that idea.

@Istvan: Sysadmin isn't a big deal. We already do most of the management of our servers ourselves, and have "real" sysadmins behind that if anything happens.

ADD REPLY • link 14.2 years ago by Geoffjentry ▴ 320

Ram · Answer 1 · 2010-09-26

Okay, well then I'll go ahead and throw some info out there in the hopes that it's useful to you.

What I can tell you is that the cluster we share time on has 8-core machines with 16GB of RAM each and they're sufficient for most of our needs. We don't do much assembly, but we do do a ton of other genomic processing, ranging from mapping short reads all the way up to snp calling and pathway inference. I also still do a fair amount of array processing.

Using most cluster management tools, (PBS, LSF, whatever), it should be possible to allow a user to reserve more than one CPU per node, effectively giving them up to 16 GB for a process if they reserve the whole node. Yeah, that means some lost cycles, but I don't seem to use it that often - 2GB is still sufficient for most things I run. It'd also be good to set up a handful of machines with a whole lot of RAM - maybe 64GB? That gives users who are doing things like assembly or loading huge networks into RAM some options.

I more often run into limits on I/O. Giving each machine a reasonably sized scratch disc and encouraging your users to make smart use of it is a good idea. Network filesystems can be bogged down really quickly when a few dozen nodes are all reading and writing data. If you're going to be doing lots of really I/O intensive stuff (and dealing with short reads, you probably will be), it's probably worth looking into faster hard drives. Certainly 7200RPM, if not 10k. Last time I looked 15k drives were available, but not worth it in terms of price/performance. That may have changed.

I won't get into super-detail on the specs - you'll have to price that out and see where the sweet spot is. I also won't tell you how many nodes to get, because again, that depends on your funding. I will say that if you're talking a small cluster for a small lab, it may make sense to just get 3 or 4 machines with 32 cores and a bunch of RAM, and not worry about trying to set up a shared filesystem, queue, etc - it really can be a headache to maintain. If you'll be supporting a larger userbase, though, then you may find a better price point at less CPUs per node, and have potentially fewer problems with disk I/O (because you'll have less CPUs per HD).

People who know more about cluster maintenance and hardware than I do, feel free to chime in with additions or corrections.