Question

Computational resources for WGS variant calling

3

Entering edit mode

5.1 years ago

alesssia ▴ 580

Dear all,

we have WGS data for about 2000 individuals (30x, ~100G per file). We would like to align them using bwakit, and then perform the variant calling using GATK haplotype caller, something I have never done before at this scale (and with such large files)

We have limited computational resources, and we will be applying for an external OpenStack cluster (something I am not familiar with), for which I need to prepare a list of computational requirements, and I would like to gather some suggestions from someone more expert than me.

In your opinion, how much memory would I need for each sample? And how long will it take?

I have been told that each node in the OpenStack cluster is 48 cores, with 512GB of RAM (therefore it would be 24 x 256, 12 x 128 etc.), with a local disk of 50GB and local storage mounted via NFS.

Thank you very much in advance, any suggestion will be highly appreciated!

WGS bwakit GATK variant calling OpenStack • 1.4k views

ADD COMMENT • link updated 4.9 years ago by Jeremy Leipzig 22k • written 5.1 years ago by alesssia ▴ 580

score 3 · Answer 1 · 2020-01-14

you have enough memory (GATK 3.X might take 64GB for some steps). In theory, you might finish this in 12 days or so.

The local storage for scratch space might be an issue, maybe 50GB is a typo? My phone has more than that.

Perhaps, more importantly, is that you bring an experienced bioinformatics engineer on board to design this pipeline and the proper handling of sequencing and technical metadata. Otherwise, debugging the pipeline and handling subsequent runs will take far more time than a computer could ever waste.