Does Ec2 Have Fast And Cheap Shared File Systems?
6
8
Entering edit mode
13.2 years ago
lh3 33k

I overheard a conversation about moving an exome SNP calling pipeline to Amazon's EC2. If I understand the part I heard correctly, EC2 does not provide a fast and cheap shared file system. If we want to call SNPs region by region, the preferred way is actually to transfer (scp or something alike?) all the alignments to each of the local virtual machine (VM) first. Is this true? Can we get a fast shared file system from EC2? Do we need to pay more for this feature? I have essentially zero experience with EC2.

Of course for this exome SNP calling issue, we may restructure the pipeline to make it much more friendly for a cluster without a shared file system. Nonetheless, this leads to quite a lot of work for the developers who were discussing this. Lacking a fast shared file system in general makes development more difficult.

cloud • 35k views
ADD COMMENT
8
Entering edit mode
13.2 years ago
Pablo Pareja ★ 1.6k

In some cases just EBS could be a good option (e.g. when there's no parallel programming involved or synchronization issues); however depending on your needs, a combination of several AWS services would make a much better fit.

In my case, between the projects where I used AWS, there's one where I had to develop a pipeline dealing with a decent amount of metagenomics data (analysis + storage + visualization of results).

In that case, I went for a combined solution. For the analysis phase, I took advantage of both SNS and S3 services for the synchronization of the different instances which where doing all the computation needed. Then for storage and visualization I'm using an EBS volume combined with S3.

It's important to know that calls to such services like S3 inside amazon network (I mean from your EC2 instances for example) are incredibly fast, I'd dare to say that they're even faster than local hard-drive transfer sometimes. Data transfer pricing is really low compared to for example that of launching EC2 instances, you can check it here (I'm just giving the links for S3 and EC2 services):

EC2 --> http://aws.amazon.com/ec2/pricing/

S3 --> http://aws.amazon.com/s3/pricing/

Summing up, I'd say that the power of AWS resides on the combined use of its different services. At the beginning it may look like an important investment of time + resources but at least in our case it pays off greatly.

ADD COMMENT
6
Entering edit mode
13.2 years ago

In addition to Pablo's very good answer, Amazon resources can also be used similarly to how you'd set up a local cluster. This is the approach used by StarCluster:

http://web.mit.edu/stardev/cluster/

and Galaxy's CloudMan:

http://wiki.g2.bx.psu.edu/Admin/Cloud

I'm more familiar with the latter, which uses an SGE cluster manager. The filesystem is EBS and shared across the cluster using NFS.

Relevant to your exome SNP calling question, I recently wrote up an example of this that uses CloudMan to manage the cluster and BWA, GATK, Picard and snpEff in the analysis pipeline:

http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/

ADD COMMENT
1
Entering edit mode

Yes, that's right. You need NFS, or some other mechanism, to create a shared filesystem on top of it.

ADD REPLY
1
Entering edit mode

The local disk is most often EBS backed as well, so it really just depends on how you prefer to organize your data across volumes. Your choices are either EBS or instance store; I've always heard EBS recommended for disk-based storage so have really only used them. Here's the StackOverflow discussion: http://stackoverflow.com/questions/3630506/benefits-of-ebs-vs-instance-store-and-vice-versa

ADD REPLY
0
Entering edit mode

In Amazon, EBS can be mounted to one instance. Is that right?

ADD REPLY
0
Entering edit mode

I see. You mean we can set up NFS/SGE inside EC2? Can we use the "local" disk of each instance instead of using EBS?

ADD REPLY
0
Entering edit mode

Hmm... I forget that we can set up NFS in EC2. How about using the local disk of each instance instead of using EBS?

ADD REPLY
5
Entering edit mode
13.2 years ago

William Spooner suggested using a shared file system (e.g. lustre or Gluster). If you create a shared file system, you spread it across e.g. 30 file server nodes, each server node will serve its EBS partition as a part of one huge file system. Each worker node can mount this file system. This should be very similar to what you have on the Sanger clusters, the system has built in redundancy and should be very fast, as the I/O load is spread over 30 servers.

The file system that Amazon suggests for this is called GlusterFS, you can get completely set-up AMI images and support from a company. You can get a trial account and try out the AMI, I have not tried this myself yet. http://aws.amazon.com/solutions/solution-providers/gluster/

ADD COMMENT
1
Entering edit mode

Thanks. Very useful.

ADD REPLY
1
Entering edit mode

The problem of NFS is that bandwidth is too low. The problem of S3 is that seek time is too slow. Only a cluster file system resolves both issues.

ADD REPLY
4
Entering edit mode
13.2 years ago

Sure; you can set up NFS, Lucene, WHY shared file systems on AWS. But the limitation will always be the network. For modern HPC clusters each node typically has two fast network interfaces, one dedicated to the filesystem, and one to other network traffic. You do not (yet) have this luxury on AWS. This reminds me of the good old days of HPC, where writing results over NFS mounts was deeply frowned upon (storms, races, you name it). You got round this by scp'ing (or similar) files around before/after each job. Better still, by writing the results back to a central relational database. This latter approach is still heavily used by the Ensembl pipelines; see eHive for a fairly sophisticated implementation. In conclusion, S3, EBS and RDBs can be used to create very robust, high-throughput pipelines without needing shared filesystems.

ADD COMMENT
1
Entering edit mode

Yes, I forget we can set up NFS. As to Ensembl, its pipeline was constructed back to the days when NFS was too slow. With the high-performance network filesystems nowadays, I see this database-centric method is outdated. I have heard that the system administrators are also complaining about this approach. It is nearly impossible to ask Ensembl to rewrite their pipeline, but this is not something we should learn, at all. A fast shared file system is more convenient and efficient nowadays.

ADD REPLY
0
Entering edit mode

Yes, I forget we can set up NFS. As to Ensembl, its pipeline was constructed back to the days when NFS was too slow. With the high-performance network filesystems nowadays, I see this database-centric method is outdated. I have heard that the system administrators are also complaining about this approach. It is nearly impossible to ask Ensembl to rewrite their pipeline, but this is not something we should learn, at all.

ADD REPLY
1
Entering edit mode
13.2 years ago
Gww ★ 2.7k

I am assuming you could use Elastic Block Storage to store the alignments independently from the instances.

ADD COMMENT
0
Entering edit mode

Actually EBS alone is insufficient. It is not shared.

ADD REPLY
0
Entering edit mode

EBS alone is insufficient as it cannot be mounted to multiple instances.

ADD REPLY
1
Entering edit mode
13.2 years ago
lh3 33k

Core Amazon EC2 Services

  • EC2="Elastic Cloud Computing". EC2 provides instances (or computing nodes). The typical configuration of my interest is probably the "large instance". Each large instance has 4 64-bit CPUs, 7.5GB memory and 850GB local disk space.[?][?]The price is $2 per CPU per day, even if the CPU is idle.

  • EBS="Elastic Block Store". EBS functions like an ordinary hard disk and has fast I/O [reference]. However, EBS can only be mounted to one instance. As such, it is not very useful when we want to access the data with more than 8 CPUs.[?][?]For EBS, we pay $0.1 per allocated GB per month and $0.01 per 100,000 I/O requests. Each I/O request access at most 128KB data [reference]. For sequential read/write, the I/O cost should be minor.[?][?]One caveat is EBS is priced for allocated space, not for used space. It is also said to have higher failure rate than S3. EBS is sort of like a private scratch space. A frequent use case of EBS is to create a snapshot of EBS in S3. This unfortunately adds cost on both EBS and S3 requests.

  • S3="Simple Storage Service". S3 allows us to upload and download data with HTTP requests. S3 supports parallel reading and writing. We can use s3fs or other 3rd-party mostly commercial software to mount an S3 bucket in EC2.[?][?]For S3, we pay $0.14 per used GB per month or $0.093 without backup. We are also charged $0.01 per 1,000 write requests or 10,000 read requests. The charge is also request based. We may retrieve huge blocks of data for sequential read/write to minimize the requesting cost [reference].

Newbie's comments on EC2

For NGS data processing, we mostly use sequential I/O. The cost on I/O is usually minor. The storage cost also seems less than the CPU cost for typical uses. To reduce the cost, we would prefer to fully parallelize each step. Anyway, it seems to me that optimizing a pipeline for EC2 is more complicated than developing for clusters we own. This is not something an elementary developer can manage cost-effectively.

As to my own question, for shared data (especially read-only BAMs), mounting S3 bucket in EC2 seems to be the solution. Although S3 is several times slower than EBS and the tmp disk on the instance, I/O is not the bottleneck for mapping. This is probably also true for SNP/indel calling. EBS and the snapshot functionality are also useful.

ADD COMMENT

Login before adding your answer.

Traffic: 2739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6