Question

Forum:How Do You Use Cloud Computing For Bioinformatics In 2013?

6

Entering edit mode

12.0 years ago

Eric Normandeau 11k

There has been a few questions about the cloud and its uses in bioinformatics on Biostar, but most of the questions date back a bit (see for example: Is Amazon's EC2 commonly used for bioinformatics? and Experiences with cloud computing in bioinformatics

I work in an environment where a variety of computing resources are available, from my desktop to in-building servers and a country-deployed supercomputer infrastructure. Because of the variety of needs we face when doing data analysis, often more options are better. This is why we have recently started using EC2 from Amazon Web Services.

In this context, it would be nice to have an update on how Biostar members use the cloud for their computing needs. It would be nice if you could state which services you use, in what context and for what kind of projects/analysis. If you don't use the cloud, maybe you could also write about your experience and what turned you off.

I think new insights into how and when to use cloud computing could benefit a lot of small or medium-sized labs doing bioinformatics. It would surely help us!

NOTE: You do not have to be a big player to post an answer. Please share your experience!

cloud • 7.7k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 12.0 years ago by Eric Normandeau 11k

score 11 · Answer 1 · 2013-06-13

11

Entering edit mode

12.0 years ago

Emily 24k

At Ensembl we use the cloud to speed up our services around the world. This improves download speed for our users in the States and in Asia. Our main servers are in the UK, but we have cloud services on the East and West coasts of the USA and in Singapore. These are provided by Amazon EC2.

We produce a genome browser that integrates genome, gene, variation, regulation and comparative genomic data. We release this in pretty browser format, but also have a free to use Perl API. We have lots of shiny tools for accessing all this data (eg Variant Effect Predictor, BioMart, REST-API).

The reason we use the cloud is that our American and Asian users weren't getting the same performance as our European users. By giving them local mirrors, we improved it for them.

Own-trumpet blowing alert: we're also mentioned in today's Nature.

ADD COMMENT • link 12.0 years ago by Emily 24k

0

Entering edit mode

Hi Emily. Maybe you could add some info in your answer about what your group does, for example what services you provide.

ADD REPLY • link 12.0 years ago by Eric Normandeau 11k

1

Entering edit mode

Apparently, Emily is working for Ensembl. It seems that Ensembl uses EC2 primarily for data sharing.

ADD REPLY • link 12.0 years ago by lh3 33k

1

Entering edit mode

Hi, sorry, I thought my name gave that away (I figured it was best to be completely unsubtle and express my vested interest in my name). I'll edit my answer to explain a bit more about Ensembl.

ADD REPLY • link 12.0 years ago by Emily 24k

1

Entering edit mode

Thank you. It's what I gathered from your name too, but often explicit is better than implicit :)

ADD REPLY • link 12.0 years ago by Eric Normandeau 11k

Ram · Answer 2 · 2013-06-15

We do not use cloud computing at our center (The Genome Institute - WashU Medical School) because we estimate that overall it is not cost effective compared to using our own cluster as long as we keep said cluster busy (>4000 cpus, 15-20Pb storage, etc.).

Where I have found it very useful is in an educational context. We use the cloud (Amazon AWS) for hands on tutorials associated with various workshops organized by the Canadian Bioinformatics Workshops series. We obtain access to a series of EC2 cloud instances a few weeks before each workshop starts. We spend this time installing software, timing exercises, and making sure everything works as expected. Then a few days before the course we freeze development, decide on the type of instances we will need (memory, number of CPUs, etc.), and create an Amazon Machine Instance (AMI). When the students arrive, we spin up one instance for each student and assign it to them for the duration of the workshop. Data that is needed by all students during the exercises is stored in an S3 bucket that is mounted on all instances. This creates a very consistent and predictable environment for all students. We have not had serious problems with up to 40 students hammering the same S3 storage. Since each student has their own instance, they do not compete with each other for CPU cycles. We are able to perform alignments and assembly of NGS data (small to modest amounts) quickly enough to accommodate the flow of an educational setting.

There are many advantages to this approach in this setting. The main downside is that for cost reasons, we can only make the student instances available for the duration of the course.

Amazon has an AWS in education grant program that works well in this very modest, short term educational setting.

score 2 · Answer 3 · 2013-06-14

2

Entering edit mode

12.0 years ago

JC 13k

Few weeks ago I use EC2 to run Trinity for a large de novo RNAseq assembly, I plan to use more often for many other tasks if I have the resources and money.

ADD COMMENT • link 12.0 years ago by JC 13k

score 2 · Answer 4 · 2013-06-14

For things like routine ChIP-seq, RIP-seq, and RNA-seq analysis, I haven't found a reason to make the jump to the cloud yet.

With only a couple of new experiments a month to worry about, pipelines finish in a handful of days per experiment on a single dedicated machine (8 CPU, 24GB RAM). Sure, this could be sped up by the cloud or the cluster we have available, but this would require extra sysadmin-type work, data transfer time, and most importantly, data storage costs.

Keeping everything local works well and is efficient to maintain -- I'm willing to trade a day or two of compute time for simplicity and low cost. Most of the bioinformatics effort goes into downstream analysis of these data, which doesn't need that much horsepower (at least in terms of hardware).

I think that if cloud storage were cheaper, I would reconsider.

score 1 · Answer 5 · 2013-06-17

I've been using Amazon EC2 recently while building nowomics. I use it for data integration and hosting databases so probably have a different experience from those running analysis pipelines.

As I have no other servers it's been a great way to get started on a new project. It took a while to find my way around (I found the documentation way too verbose) but once I got set up and had scripted some basic operations the flexibility is fantastic.

To get started it's been cheap but the default storage on EBS is slow, there are options to pay more for EBS-optimised instances, provisioned IO and SSDs which I've heard are much better. Also to get higher RAM servers for running databases effectively gets expensive. If you need to run always-on services getting reserved instances is essential, you can now buy/sell incomplete reserved terms (usually 1 or 3 years) in a marketplace. So if you don't exactly know what you need you can buy reserved instances for just a few months or sell reserved time you no longer need.

I've found network IO can be patchy, particularly on smaller instance types, and had to build in provision for e.g. dropped connections and timeouts when downloading files. S3 storage has been great and really simple to build into workflows.

If I were at an institution with good infrastructure and some existing hardware I would see no compelling reason to make the jump for data integration and hosting databases. However, if you're starting something new where requirements will change over time I think EC2 is a great option.

score 0 · Answer 6 · 2013-06-17

At DNAnexus, we use cloud computing to enable bioinformaticians to solve their problems instead of worrying about the underlying infrastructure.

Our experience is thoroughly documented on our wiki and on the DNAnexus Answers forum.

Please ask away about our experience and what we provide, I'll update this post and respond!

score 0 · Answer 7 · 2013-06-17

A few cloud based examples in this page: http://www.hindawi.com/journals/bmri/contents/bioinformatics/

Cloud Computing for Protein-Ligand Binding Site Comparison,
A High Performance Cloud-Based Protein-Ligand Docking Prediction Algorithm,
Streaming Support for Data Intensive Cloud-Based Sequence Analysis,
Exploiting GPUs in Virtual Machine for BioCloud,
wFReDoW: A Cloud-Based Web Environment to Handle Molecular Docking Simulations of a Fully Flexible Receptor Model, GPU-Based Cloud Service for
Smith-Waterman Algorithm Using Frequency Distance Filtration Scheme,
Translational Biomedical Informatics in the Cloud: Present and Future,