Question

Amazon AWS setup for the bioinformatic analysis particularly scRNAseq

0

Entering edit mode

5.3 years ago

akh22 ▴ 120

I am trying to setup EC2 Amanzon AWS Rstudio and Linux AMI for scRNAseq analysis, and I am curios to see what sort of set up are sufficient and cost-effective for the bioinformatic use.

For Rstuio, I tried the Rstudio AMI (developed by Louis Aslett )and mainly run Seurat (integration), SingleR(LTLA version), Slingshot and other pseudotimers. Also I run 10X Cell Ranger, Velocyte, Scanpy, and other python packages on E2 Linux AMI. All the data are stored in my S3 and I move my data from S3 to EBS and move back after the analysis is done.

I tried an instance with md5x12large which comes with 48 vCPU, 192Gib, and 2X900GB, and I think I was getting killed by the storage cost.

Anyway, I'd really appreciate if anybody uses E2 AWS could comment on the ideal E2 setup for bioinformatic analysis.

AWS Rstudio scRNAseq • 3.7k views

ADD COMMENT • link updated 5.3 years ago by Kevin Blighe 89k • written 5.3 years ago by akh22 ▴ 120

0

Entering edit mode

I would try to figure out the needs for each of your applications, since they are very different. If you are concerned about costs, you should optimize each step.

Also, if cost is a concern, maybe Amazon is not the best option. The advantage of Amazon is the ability to quickly increase or decrease your computational capacity. You are paying a premium for that.

ADD REPLY • link 5.3 years ago by igor 13k

score 0 · Answer 1 · 2020-04-25

Storage and data transfer costs are what usually end up costing a lot on Amazon - indeed.

Single cell datasets are not getting any smaller - let's put it that way. The mass cytometry folk (including I...) are already accustomed to screening millions of single cells, but only across a few dozen markers, so far. At a push, mass cytometer panels may be able to get up to 50+ markers, if not already. However, the singe cell RNA-seq Word is now pushing toward those numbers of cells but across the entire transcriptome.

I am not sure of the cost effectiveness of your plan, unless you are going to be processing these datasets regularly for clients who will be paying you for all costs. To give you an idea, I have recently been processing one particular dataset of 29 'large' samples, and the integration step takes >500GB RAM. The total size of the dataset is ~84GB (compressed). With these types of numbers, Amazon will crucify you with costing.

Kevin