I work in an environment where a variety of computing resources are available, from my desktop to in-building servers and a country-deployed supercomputer infrastructure. Because of the variety of needs we face when doing data analysis, often more options are better. This is why we have recently started using EC2 from Amazon Web Services.
In this context, it would be nice to have an update on how Biostar members use the cloud for their computing needs. It would be nice if you could state which services you use, in what context and for what kind of projects/analysis. If you don't use the cloud, maybe you could also write about your experience and what turned you off.
I think new insights into how and when to use cloud computing could benefit a lot of small or medium-sized labs doing bioinformatics. It would surely help us!
NOTE: You do not have to be a big player to post an answer. Please share your experience!
At Ensembl we use the cloud to speed up our services around the world. This improves download speed for our users in the States and in Asia. Our main servers are in the UK, but we have cloud services on the East and West coasts of the USA and in Singapore. These are provided by Amazon EC2.
We produce a genome browser that integrates genome, gene, variation, regulation and comparative genomic data. We release this in pretty browser format, but also have a free to use Perl API. We have lots of shiny tools for accessing all this data (eg Variant Effect Predictor, BioMart, REST-API).
The reason we use the cloud is that our American and Asian users weren't getting the same performance as our European users. By giving them local mirrors, we improved it for them.
Own-trumpet blowing alert:
we're also mentioned in today's Nature.
Hi, sorry, I thought my name gave that away (I figured it was best to be completely unsubtle and express my vested interest in my name). I'll edit my answer to explain a bit more about Ensembl.
We do not use cloud computing at our center (The Genome Institute - WashU Medical School) because we estimate that overall it is not cost effective compared to using our own cluster as long as we keep said cluster busy (>4000 cpus, 15-20Pb storage, etc.).
Where I have found it very useful is in an educational context. We use the cloud (Amazon AWS) for hands on tutorials associated with various workshops organized by the Canadian Bioinformatics Workshops series. We obtain access to a series of EC2 cloud instances a few weeks before each workshop starts. We spend this time installing software, timing exercises, and making sure everything works as expected. Then a few days before the course we freeze development, decide on the type of instances we will need (memory, number of CPUs, etc.), and create an Amazon Machine Instance (AMI). When the students arrive, we spin up one instance for each student and assign it to them for the duration of the workshop. Data that is needed by all students during the exercises is stored in an S3 bucket that is mounted on all instances. This creates a very consistent and predictable environment for all students. We have not had serious problems with up to 40 students hammering the same S3 storage. Since each student has their own instance, they do not compete with each other for CPU cycles. We are able to perform alignments and assembly of NGS data (small to modest amounts) quickly enough to accommodate the flow of an educational setting.
There are many advantages to this approach in this setting. The main downside is that for cost reasons, we can only make the student instances available for the duration of the course.
Thanks Malachi, this is very interesting information, including a lot of details about your set up! It is very pertinent to share about the education grand program, which people may not be aware of (I was not).
Been a while since I saw this. We still find the cloud very for delivering hands-on bioinformatics workshops. This tutorial post might be useful to this thread: Post not found.
Few weeks ago I use EC2 to run Trinity for a large de novo RNAseq assembly, I plan to use more often for many other tasks if I have the resources and money.
For things like routine ChIP-seq, RIP-seq, and RNA-seq analysis, I haven't found a reason to make the jump to the cloud yet.
With only a couple of new experiments a month to worry about, pipelines finish in a handful of days per experiment on a single dedicated machine (8 CPU, 24GB RAM). Sure, this could be sped up by the cloud or the cluster we have available, but this would require extra sysadmin-type work, data transfer time, and most importantly, data storage costs.
Keeping everything local works well and is efficient to maintain -- I'm willing to trade a day or two of compute time for simplicity and low cost. Most of the bioinformatics effort goes into downstream analysis of these data, which doesn't need that much horsepower (at least in terms of hardware).
I think that if cloud storage were cheaper, I would reconsider.
I've been using Amazon EC2 recently while building nowomics. I use it for data integration and hosting databases so probably have a different experience from those running analysis pipelines.
As I have no other servers it's been a great way to get started on a new project. It took a while to find my way around (I found the documentation way too verbose) but once I got set up and had scripted some basic operations the flexibility is fantastic.
To get started it's been cheap but the default storage on EBS is slow, there are options to pay more for EBS-optimised instances, provisioned IO and SSDs which I've heard are much better. Also to get higher RAM servers for running databases effectively gets expensive. If you need to run always-on services getting reserved instances is essential, you can now buy/sell incomplete reserved terms (usually 1 or 3 years) in a marketplace. So if you don't exactly know what you need you can buy reserved instances for just a few months or sell reserved time you no longer need.
I've found network IO can be patchy, particularly on smaller instance types, and had to build in provision for e.g. dropped connections and timeouts when downloading files. S3 storage has been great and really simple to build into workflows.
If I were at an institution with good infrastructure and some existing hardware I would see no compelling reason to make the jump for data integration and hosting databases. However, if you're starting something new where requirements will change over time I think EC2 is a great option.
Hi Emily. Maybe you could add some info in your answer about what your group does, for example what services you provide.
Apparently, Emily is working for Ensembl. It seems that Ensembl uses EC2 primarily for data sharing.
Hi, sorry, I thought my name gave that away (I figured it was best to be completely unsubtle and express my vested interest in my name). I'll edit my answer to explain a bit more about Ensembl.
Thank you. It's what I gathered from your name too, but often explicit is better than implicit :)