Question

RNA seq pipeline on the cloud

0

Entering edit mode

5.7 years ago

jmgoldstein7 • 0

Hello, relatively inexperienced bioinformatician here tasked with setting up a RNA-seq pipeline (tailored towards looking at DEG and pathway enrichment analysis for perturbed vs unperturbed cancer samples). I am very new to cloud computing and been looking at both google cloud and AWS. There are a lot of resources and it is a bit overwhelming so I was wondering if anybody had some insight as to the most efficient method for this. My main issue is that, since it seems like the alignment step is usually run on the command line, I can't really have an entire pipeline in one script so that I could simply run as a Jupternotebook/ on AWS sagemaker or Google datalab. And then there is the issue of where to keep the GR38 reference sequence and GTF file and how to upload new sequences for analysis ever time. Is there a commonly accepted best practice for a method such as this? Any advice would help. Thank you.

RNA-Seq R • 3.1k views

ADD COMMENT • link updated 5.7 years ago by jared.andrews07 ★ 18k • written 5.7 years ago by jmgoldstein7 • 0

score 0 · Answer 1 · 2019-04-01

0

Entering edit mode

5.7 years ago

jared.andrews07 ★ 18k

You definitely can run command line tools from Jupyter Notebooks. I highly recommend SoS, which supports running multiple kernels within the same notebook (like bash, R, Python, etc) as well as a ton of useful magics that let you pass data objects between those different kernels relatively seamlessly. Using it, you can easily run an entire pipeline within a single notebook. It also has a workflow syntax for very easy parallelization that I much prefer over things like CWL. I am always surprised it's not mentioned more here.

As for the cloud computing, you probably want something like AWS's Elastic File Storage or Fsx for Lustre. This page may help you out. Generally, you'll want to keep your big data off the cloud as much as possible to minimize storage costs. I don't know what the equivalents for Google Cloud are, but I'm sure they have similar services. It is indeed overwhelming to get into cloud computing, but generally, the docs are fairly good once you know what you're looking for. I'd recommend messing with AWS's free tier services just to get your hands dirty. Learn to upload files, set up a compute instance, and run basic commands. Once you do that, the whole task will become much less daunting.

ADD COMMENT • link 5.7 years ago by jared.andrews07 ★ 18k

0

Entering edit mode

thank you for the info, so you would recommend keeping something like the reference genome stored locally and just uploading it for every analysis over storing it in an S3 bucket? Also, I don't see many recommendations about Galaxy, which I have been able to set up running on an EC2 instance. It seems to have the ability for a very customizable workflow so I'm surprised it isn't mentioned more. Is it not used very wisely anymore?

ADD REPLY • link 5.7 years ago by jmgoldstein7 • 0

0

Entering edit mode

About storing the reference genome locally v.s. in a bucket, I would say locally.

If store it on a bucket:
* Reading will happen over the network, which is likely slower than locally. This is the biggest issue I would say.
* You may have to pay access fees every time you read your reference. This usually depends on the region your instances and your bucket it. It should be pretty cheap, but more than reading for a local disk.
* You have to pay for the bucket storage (shouldn't be too much for just one reference).

If you store it locally (to the instance) on a disk:
* Reading will be much faster.
* You have to pay for the disk (which you probably would have attached to your instance anyways, and a reference shouldn't take up too much of it).
* If you have multiple instances, you would need some way to make that reference available to all machines (e.g. they all have it locally, or all download/copy from the bucket it at boot time and then process from local storage.)

The way I usually do it is I keep my references in the bucket, but use a copy to a local disk for actual processing for better performance. Also anecdotally, I've experienced performance issues with Google Cloud Compute when reading the same file with multiple threads/processes from a bucket, to a point where it wasn't really usable.

ADD REPLY • link 5.7 years ago by manuel.belmadani ★ 1.4k

0

Entering edit mode

thank you for the recommendation!

ADD REPLY • link 5.7 years ago by jmgoldstein7 • 0

0

Entering edit mode

Galaxy is definitely still used, and a valid option if you want to go that route. I don't personally use it, but it is convenient when you don't have/don't want to set up an environment yourself. It allows you to plug 'n play a lot of tools, and (I think) has recipes for lots of common workflows. If you don't have anything crazy in mind, it could be a good option.

In your case, since you're just doing RNA-seq, I don't think you even really need a ton of computing power given the trend to move towards pseudo-alignment count-based methods (e.g. Salmon, kallisto, ht-seq, etc) for quantitation followed by differential expression analysis with your favorite R package (DESeq2, limma, etc). This process is much quicker than alignment-based methods and can usually be done locally on a decent computer.

ADD REPLY • link 5.7 years ago by jared.andrews07 ★ 18k

0

Entering edit mode

My company wants a pipeline set up on the cloud so that others can access it and use it. Luckily I believe I found a solution that satisfies everything, Galaxy has the ability to be run on AWS so I can simply create a workflow using the aforementioned tools (definitely with a pseudoaligner) and run it as a galaxy instance on AWS. Thanks

ADD REPLY • link 5.7 years ago by jmgoldstein7 • 0