Hello, relatively inexperienced bioinformatician here tasked with setting up a RNA-seq pipeline (tailored towards looking at DEG and pathway enrichment analysis for perturbed vs unperturbed cancer samples). I am very new to cloud computing and been looking at both google cloud and AWS. There are a lot of resources and it is a bit overwhelming so I was wondering if anybody had some insight as to the most efficient method for this. My main issue is that, since it seems like the alignment step is usually run on the command line, I can't really have an entire pipeline in one script so that I could simply run as a Jupternotebook/ on AWS sagemaker or Google datalab. And then there is the issue of where to keep the GR38 reference sequence and GTF file and how to upload new sequences for analysis ever time. Is there a commonly accepted best practice for a method such as this? Any advice would help. Thank you.
thank you for the info, so you would recommend keeping something like the reference genome stored locally and just uploading it for every analysis over storing it in an S3 bucket? Also, I don't see many recommendations about Galaxy, which I have been able to set up running on an EC2 instance. It seems to have the ability for a very customizable workflow so I'm surprised it isn't mentioned more. Is it not used very wisely anymore?
About storing the reference genome locally v.s. in a bucket, I would say locally.
If store it on a bucket:
* Reading will happen over the network, which is likely slower than locally. This is the biggest issue I would say.
* You may have to pay access fees every time you read your reference. This usually depends on the region your instances and your bucket it. It should be pretty cheap, but more than reading for a local disk.
* You have to pay for the bucket storage (shouldn't be too much for just one reference).
If you store it locally (to the instance) on a disk:
* Reading will be much faster.
* You have to pay for the disk (which you probably would have attached to your instance anyways, and a reference shouldn't take up too much of it).
* If you have multiple instances, you would need some way to make that reference available to all machines (e.g. they all have it locally, or all download/copy from the bucket it at boot time and then process from local storage.)
The way I usually do it is I keep my references in the bucket, but use a copy to a local disk for actual processing for better performance. Also anecdotally, I've experienced performance issues with Google Cloud Compute when reading the same file with multiple threads/processes from a bucket, to a point where it wasn't really usable.
thank you for the recommendation!
Galaxy is definitely still used, and a valid option if you want to go that route. I don't personally use it, but it is convenient when you don't have/don't want to set up an environment yourself. It allows you to plug 'n play a lot of tools, and (I think) has recipes for lots of common workflows. If you don't have anything crazy in mind, it could be a good option.
In your case, since you're just doing RNA-seq, I don't think you even really need a ton of computing power given the trend to move towards pseudo-alignment count-based methods (e.g. Salmon, kallisto, ht-seq, etc) for quantitation followed by differential expression analysis with your favorite R package (DESeq2, limma, etc). This process is much quicker than alignment-based methods and can usually be done locally on a decent computer.
My company wants a pipeline set up on the cloud so that others can access it and use it. Luckily I believe I found a solution that satisfies everything, Galaxy has the ability to be run on AWS so I can simply create a workflow using the aforementioned tools (definitely with a pseudoaligner) and run it as a galaxy instance on AWS. Thanks