Ngs Data Centers: Storage, Backup, Hardware Etc..
6
13
Entering edit mode
13.5 years ago

Hi all,

I was wondering what kind of infrastructure(s) (hardware, backup...) are currently used by the large sequencing centers (Sanger , etc...).

At this time, the only interesting informations I found are Guy Coates's slides ( http://www.slideshare.net/gcoates ).

Any hint would be appreciated.

(Community wiki)

next-gen sequencing data • 7.4k views
ADD COMMENT
8
Entering edit mode
13.5 years ago

Yes indeed, the kilokluster is not used anymore. the new clusters at UCSC are called swarm, encodek and memk. Swarm is the only interesting cluster here, with something around 1000 CPUs on ~ 250 nodes, built in part by clusterguys

The key part of all clusters here is a 250 TB IBM GPFS filesystem. GPFS costs money but it sits in the kernel and has been designed from the start for distributed operation. I don't know how it works and where the caching servers are located, I can only say that it is extremely fast. Way faster in distributed mode than ZFS, and even the metadata information arrives on time. The network links are normal 1 GB ethernet from what I know. Big clusters cannot use NFS at all, to my knowledge.

The other important part are local harddisks, 250 GB each, they should be bigger, but that's what we have. Genomes and all big datafiles are stored on local harddisks and are loaded from there, that avoids the clogging the network.

We do not use LSF or Sun/Oracle grid engine but parasol, a light-weight job scheduler written by Jim Kent. It can handle millions of jobs and submits a job within milliseconds, and can recover from some failures automatically. Some parts of the group have plugged more complex layers on top of that, notably JobTree by Ben Paten but that's only if you have complex workflows. JobTree can restart the right jobs after any errors.

The use of local harddisks and a distributed filesystem is in the end very similar to what Amazon, Yahoo and Google are doing. They have distributed file systems (like Google FS, HDFS) + use local storage (HDFS) + a fast scheduler, just we do not have to rewrite our software to use the local storage in a certain way (like MapReduce or Hadoop), we just input and output files. But this is also because, in genomics, our datafiles are very very small compared to what big internet companies have and jobs are often computation-intense and not as IO-intense.

I know a bit about the Sanger infrastructure but I better let that to someone from the EBI to explain more in detail.

ADD COMMENT
1
Entering edit mode

You probably know Chris already from the blogosphere, he has lots on info on his blog about clusters for genomics http://blog.bioteam.net/

At the moment, if you plan to build your own cluster, I would go for one of those massive 48-CPU machines and run parasol on it. That's in my opinion the simplest way to get a small cluster running, but I have never tried it (has anyone here?)

ADD REPLY
0
Entering edit mode

many thanks Maximilianh ! :-)

ADD REPLY
0
Entering edit mode

Excellent update on the UCSC systems, thanks for this Max. Info on the Sanger system would probably be appreciated by all as well.

ADD REPLY
0
Entering edit mode

maximilianh - although I agree huge SMPs are nice, why'd you want to run a batch system on a single machine? IME, batch queuing is better avoided if possible.

ADD REPLY
0
Entering edit mode

if your software is not multithreaded, then batch queue is the only alternative, right?

ADD REPLY
0
Entering edit mode

By the way, GPFS is using multiple fileservers, Here we use around 30 fileservers for 250 compute nodes

ADD REPLY
0
Entering edit mode

NFS should not really be a problem even for large clusters, now that there is pNFS (parallel NFS). Also, some vendors have proprietary NFS like solutions that give you direct access to the storage blades/nodes (after talking to a directory blade), which should make it possible to scale up quite well (we run one such at UPPMAX). Might depend on what is a large cluster, but at least a ~340 node cluster (2xquad core in each node) against a 600TB such parallel storage system, is not really any problem at all, from our experience.

ADD REPLY
4
Entering edit mode
13.5 years ago

OK, from what I know from voeglein.de (or better, its owner), they have a lot more than 1000 CPUs, but also normal 1 GB links. Their pipelines also read and write to the local filesystem. But then everything over there is wrapped in PERL scripts that use their own system (the "beehive" or also the "ensembl-pipeline", two competeing job systems, like JobTree at UCSC).

The unusual thing at Ensembl is that the Perl scripts actually write EVERYTHING to mysql. The Ensembl/EBI pipelines seem to rely completely on Mysql. Due to the growth of the cluster, one single Mysql server cannot handle this load of course and the transfer would clog the network links, so they have all nodes in a one rack write to their own dedicated mysql server. Even for next-gen read mapping they use this system. They have custom mysql configs for the MyISAM system that seem to make writes fast enough and they must have lots of Mysql servers running. Somehow, but I don't know how, they must regroup all Mysql tables at the end of a run from the racks to a central MySQL server, but I am less sure about that, perhaps their job systems can keep the data distributed to the racks.

The EBI is using LSF as a scheduler mostly, but their Perl-jobsystems seem to handle SGE as well. I don't know what they use as a file system.

The distribution of files to local harddisks by the way is a matter of separate research. Phycisists are the only people in science that deal with a lot bigger datasets than us, and they have some implementations of broadcast ftp clients than can transfer >100GB to thousands of nodes in parallel, with minimum overhead like mcp http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4324 or uftp http://www.tcnj.edu/~bush/uftp.html . I am quite sure that neither UCSC nor the EBI is using that though, we just have some scripts that do round-robin scp operations to the individual nodes whenever a new genome comes out.

Would be nice if someone from the EBI could give some information directly though.

ADD COMMENT
1
Entering edit mode

They use Lustre. Extensively relying on mysql is double-sided. Most people still use LSF only. Read mapping is mainly done without beehive or mysql.

ADD REPLY
0
Entering edit mode

Thanks Heng! As I said, I am not the best person to ask on the EBI stuff and most of this was just what I had heard while being on a workshop there.

ADD REPLY
2
Entering edit mode
13.5 years ago

Information on the UCSC Genome Bioinformatics KiloKluster system can be found here and here

EDIT 1:

Information on the Swiss Institute for Bioinformatics VITAL-IT system is here and here.

EDIT 2:

Information on the Eichler lab computational facilities can be found here.

ADD COMMENT
0
Entering edit mode

very interesting Casey. Thanks !

ADD REPLY
0
Entering edit mode

Seems to be a bit dated, and I presume they no longer use 800MHz P3s. Or if they do, it's unlikely to be among the most powerful bio-clusters. :-)

ADD REPLY
0
Entering edit mode

@Ketil, agreed that the UCSC information is out of date, but I think it reveals some of their architecture. It would be good to get more up to date info on this system for sure, e.g. http://www.moderntech.com.hk/chinese/panasas/docs/SuccessStory_UCSC_FINAL.pdf

ADD REPLY
0
Entering edit mode

That is ancient and architectures have evolved over the past few years, both in terms of rack design and data center design as well as storage architectures.

ADD REPLY
2
Entering edit mode
13.4 years ago
Gentle Yang ▴ 190

The hardware of BGI(Shenzhen,China) can be found here : http://www.genomics.cn/en/platform.php?id=61

I know less about BGI's software condition except SUN GRID Engine and NVIDIA GPU .

ADD COMMENT
1
Entering edit mode
13.0 years ago
Samuel Lampa ★ 1.3k

It seems you were focusing mainly on the hardware aspect, but something that seems really important in the backbone infrastructure at Sanger, is their implementation of the iRODS system, "Rule Oriented Data Management system", for which they recently published a paper on the implementation of 1. They also have some slides about it [2].

I recently summarized iRODS like so:

"Basically, iRODS is a rule-oriented data management system, that sits as a logical layer on top of actual file systems, provides a unified file identifier namespace, and can automate things like data migration between fast cache-like storage and longer time archiving storage, meta data tagging etc. (or, the automation itself can be controlled by manual tagging). Client access is done via the shell through the i-commands, via a web-file manager interface, fuse module, Java or PHP API. All in all, iRODS looks surprisingly mature, and to provide good flexibility while keeping the tech-stack reasonably simple."

iRODS seem to be adopted by more and more big players now. Also broad Broad has an implementation [3], NSF is founding a national (US) initiative which will use it [4] and the newly initialized EUDAT project will use it as a key component [5]. Of course, UPPMAX/UPPNEX is working on an implementation too [6] :)

1 http://www.biomedcentral.com/1471-2105/12/361/abstract

[2] http://event.twgrid.org/isgc2011/slides/life/Gen-Tao%20Chiang.pdf

[3] http://distributedbio.com/blog/archives/64

[4] http://www.renci.org/news/releases/nsf-datanet

[5] "personal communication" at Bio-IT in Hanover Oct 2011 :) EUDAT website: http://www.eudat.eu

[6] http://saml.rilspace.org/irods-for-managing-next-gen-sequencing-data-in-an-hpc-environment

ADD COMMENT

Login before adding your answer.

Traffic: 1976 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6