I overheard a conversation about moving an exome SNP calling pipeline to Amazon's EC2. If I understand the part I heard correctly, EC2 does not provide a fast and cheap shared file system. If we want to call SNPs region by region, the preferred way is actually to transfer (scp or something alike?) all the alignments to each of the local virtual machine (VM) first. Is this true? Can we get a fast shared file system from EC2? Do we need to pay more for this feature? I have essentially zero experience with EC2.
Of course for this exome SNP calling issue, we may restructure the pipeline to make it much more friendly for a cluster without a shared file system. Nonetheless, this leads to quite a lot of work for the developers who were discussing this. Lacking a fast shared file system in general makes development more difficult.
Yes, that's right. You need NFS, or some other mechanism, to create a shared filesystem on top of it.
The local disk is most often EBS backed as well, so it really just depends on how you prefer to organize your data across volumes. Your choices are either EBS or instance store; I've always heard EBS recommended for disk-based storage so have really only used them. Here's the StackOverflow discussion: http://stackoverflow.com/questions/3630506/benefits-of-ebs-vs-instance-store-and-vice-versa
In Amazon, EBS can be mounted to one instance. Is that right?
I see. You mean we can set up NFS/SGE inside EC2? Can we use the "local" disk of each instance instead of using EBS?
Hmm... I forget that we can set up NFS in EC2. How about using the local disk of each instance instead of using EBS?