Question

What are the IT Requirements for High Throughput Sequence Analysis

1

Entering edit mode

12.4 years ago

Davy ▴ 410

Hi All,

So my department is considering spending some money on upgrading our computing facilities which are pretty under-powered for any kind of serious sequencing analysis. I've been asked to come up with a rough idea of what were going to need for the next 5 to 10 years.

The obvious points are

lots of CPUs
lots of RAM
lots of Storage

but I was hoping someone might know some resource I could read or take a look at, that might enable me to come up with something that isn't a complete guess.

I was thinking somewhere in the region of 8 or 12 cores per node, at 100 nodes total, 96 - 128 GB RAM per node, and (probably ridiculous, but) 5000 Tb storage. We have a lot of samples that will be sequenced (probably not whole genome) exome, I would imagine, plus various other sequencing activities like RNA-seq and Chip-Seq. Things I'm woefully ignorant of are the architecture of these systems. Should we be building a distributed system (all the tech will likely be housed in one place), what kind of tech do we need to run the right software that I'll be able to make full use of, for the mapping and variant calling, etc. Power and cooling requirements, space requirements.

Since it will be mostly me setting up the pipelines and pushing the data through it, I want to come up with some concrete numbers that will ensure that we can get analyses done quickly, and that we will have a system that we can scale as our needs increase (future-proofing).

Hope someone knows of something,

Cheers,
Davy.

next-gen • 6.1k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 12.4 years ago by Davy ▴ 410

1

Entering edit mode

Here is a somewhat older but still relevant post Big Ass Servers & Storage

ADD REPLY • link 12.4 years ago by Istvan Albert 102k

0

Entering edit mode

Would you consider a cloud solution that wouldn't require extra hardware and is more easily scalable?

ADD REPLY • link 12.4 years ago by shadowsage554 • 0

Istvan Albert · Answer 1 · 2013-02-10

5

Entering edit mode

12.4 years ago

User 59 13k

How many samples is 'a lot'? I'm just a bit worried that you state you're ignorant of the 'architecture of the systems' (do you mean the analysis or the infrastructure?). If you're ignorant of the systems, you're not the best person to be pricing one up or speccing one out. You need to speak to someone (preferably a large range of vendors) with a concrete set of requirements. Work from there. And from long experience I can tell you that I/O is likely to be more critical to your choices than how many cores you have available. There has been some discussion on this site before: NGS data centers: storage, backup, hardware etc..

There seems to be some presentations from a workshop last year on HPC infrastructure for NGS:

http://bioinfo.cipf.es/courses/hpc4ngs

And another from 2011;

http://www.bsc.es/marenostrum-support-services/hpc-events-trainings/res-scientific-seminars/next-gen-sequencing-2011

There's a good primer here:

http://www.slideshare.net/cursoNGS/pablo-escobar-ngs-computational-infrastructure

And vendors with interest in the space:

http://www.emc.com/industry/healthcare.htm

ADD COMMENT • link updated 12.4 years ago by Istvan Albert 102k • written 12.4 years ago by User 59 13k

4

Entering edit mode

"And from long experience I can tell you that I/O is likely to be more critical to your choices than how many cores you have available." - just wanted to stress this part of your answer, no point having badass HPC if it takes hours (or days!) to pull up files from archive for analysis.

ADD REPLY • link 12.4 years ago by zx8754 12k

0

Entering edit mode

Thanks for the info. As you say, I know full well I am not the best person to be doing this, I'm plenty familiar with the analysis of small number of samples (20 - 100) in targeted regions (usually about 10 mb in total), but the architecture difficult to get my around, which is why I've been chosen. An impetus to learn I suppose.

ADD REPLY • link 12.4 years ago by Davy ▴ 410

score 0 · Answer 2 · 2013-07-08

Just based on my own own experience:

1) A high performance Sun Grid Engine cluster. Most NGS analysis (mapping with bwa, snp calling with gatk etc) can easily be run in parallel by just splitting the data. You need a high performance head node and a number of cluster nodes. I wouldn't worry to much about the size or speed of the cluster nodes, just make sure the cluster is upgradable and extendable when newer faster machines become cheaper or you get more money (because other groups also want to chip in ). Something to start with for example would be 10 nodes with 8 cores and 32 GB mem each. Or a multiple of this.

2) High performance shared data storage were the cluster can read and write to from network shares. This is an important part were spending money makes sense. Most NGS compute clusters are IO limited. (input out to a central server and reading writing to the compute nodes is a bottleneck. NGS is not just a compute problem but also a data problem ). Look at the high end solution that the big NGS centers have and see if you can buy the same.

3) A hardware agnostic massively scalable object storage system for archiving (long term storage) of raw NGS and derived data. Once the computing is done you want to move the data to a less expensive storage system. I have good experience with commercial hardware agnostic massively scalable object storage systems. These are software based and run on a operating system derived from linux. You can buy / use any hardware as long as it can run linux. You put all the nodes in a lan and they boot from a usb key with the software. They form a storage cluster. Every file is stored in duplicate or triplicate on different nodes. Read and write request are broadcasted to the cluster and the node that has the most resources free executes the request. If a node breaks down you can trow it out of the window and the missing data is automaticly replicated from the other copies to the same redundancy level as specified on other nodes. If you want to upgrade or extend your storage cluster you buy new machines, put them in the network, and throw away the old ones, and you don't need to do any administration.

4) For analysis that can't be run in parallel or when you don't want to invest the work to make it parallel you need a big ass server. Something like 48 cpu's 1 TB memory and a smaller variant of this so you can have the big one run denovo assembly for weeks and still have smaller big ass server for other work.

5) A fast network between all your machines.