Question

Computational power requirement to study genetic variations in Human WGS data

2

Entering edit mode

9.5 years ago

ravi.uhdnis ▴ 220

Hi everyone,

Our target is to sequence about 100-125 human (whole genome) samples using HiSeq 2500 illumina platform (PE sequencing) in order to analyze the "GENETIC VARIATIONS" in these genomes & to correlate these genetic variants with a disease association. We are sequencing with aim of approx. 30X coverage and in order to detect SNPs, INDELs etc. I am new in the NGS field and have no experience in handling and analysis Human WGS data handling and analysis. I am looking for answer "How much computational power/setup will be sufficient to do and handle this type of work", ranging from data quality check, data processing, mapping/alignment, post alignment processing, variant calling and subsequent analysis. We have this much computational setup with us:

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
master                  lx24-amd64      8  0.01   47.2G    6.1G   96.0G  280.0K
node1                   lx24-amd64      8  0.00   47.2G  123.6M     0.0     0.0
node10                  lx24-amd64     16  0.01   47.2G    6.6G     0.0     0.0
node11                  lx24-amd64      8  0.01   47.2G    2.8G     0.0     0.0
node12                  lx24-amd64      8  0.00   47.2G    6.5G     0.0     0.0
node13                  lx24-amd64      8  0.00   47.2G    6.5G     0.0     0.0
node14                  lx24-amd64      8  0.00   47.2G    2.8G     0.0     0.0
node15                  lx24-amd64      8  0.00   47.2G  120.0M     0.0     0.0
node2                   lx24-amd64      8  0.00   47.2G  123.0M     0.0     0.0
node3                   lx24-amd64      8  0.00   47.2G  122.0M     0.0     0.0
node4                   lx24-amd64      8  0.01   47.2G  121.6M     0.0     0.0
node5                   lx24-amd64      8  0.00   47.2G  121.4M     0.0     0.0
node6                   lx24-amd64      8  0.00   47.2G  120.9M     0.0     0.0
node7                   lx24-amd64      8  0.00   47.2G  120.7M     0.0     0.0
node8                   lx24-amd64      8  0.01   47.2G  120.8M     0.0     0.0
node9                   lx24-amd64      8  0.01   47.2G  121.0M     0.0     0.0

There are 16 nodes, each with minimum 8 processors [Intel(R) Xeon(R) CPU X5550 @ 2.67GHz] having 4 cpu cores. We have SGE on these clusters. It would be great if you guys can share your knowledge and expert comments with us. It will be very much beneficial for us in order to make our pipeline development in a systematic manner.

Thank you very much,

Regards

Ravi

Assembly genome SNP next-gen-sequencing • 2.0k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.5 years ago by ravi.uhdnis ▴ 220

Ram · Answer 1 · 2015-06-05

3

Entering edit mode

9.5 years ago

Sean Davis 27k

Sequencing data analysis is embarrassingly parallel at multiple levels. Therefore, more compute gets you answers faster, but with the system you have set up, you should be fine to get work done. Processing the data is really only the start of the road ahead; interpreting the results is still quite time-consuming, so even if it takes a few extra days because you haven't bought a bunch of new toys, the end result won't be too delayed.

You can always grab a genome of data and process it locally to get a sense of the actual times involved.

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.5 years ago by Sean Davis 27k

1

Entering edit mode

I totally agree with this. For a compute server, a weekend is an eternity :)

125 FASTA files at 30x (or 200 mil. 50bp reads) depth for the human genome would take less than a day to pile up with 128 cores. In fact, since your core to input files size is so similar, I'd forget going parallel just run 8 jobs per node. It will be faster than dealing with cluster issues/overhead, in both set up and execution :)

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.5 years ago by John 13k

1

Entering edit mode

For mapping, at least, 1 job per node with 8 threads will be much more memory-efficient, and thus probably more efficient overall (less cache-thrashing, fewer tlb misses, etc) compared to 8 single-threaded processes running concurrently.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.5 years ago by Brian Bushnell 20k

1

Entering edit mode

Good point - memory could be a bottle neck with 'only' 47.2G, and then the overhead from threading will be minimal compared to disk swap :) I didn't think of that

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.5 years ago by John 13k