Hi everyone,
Our target is to sequence about 100-125 human (whole genome) samples using HiSeq 2500 illumina platform (PE sequencing) in order to analyze the "GENETIC VARIATIONS" in these genomes & to correlate these genetic variants with a disease association. We are sequencing with aim of approx. 30X coverage and in order to detect SNPs, INDELs etc. I am new in the NGS field and have no experience in handling and analysis Human WGS data handling and analysis. I am looking for answer "How much computational power/setup will be sufficient to do and handle this type of work", ranging from data quality check, data processing, mapping/alignment, post alignment processing, variant calling and subsequent analysis. We have this much computational setup with us:
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
master lx24-amd64 8 0.01 47.2G 6.1G 96.0G 280.0K
node1 lx24-amd64 8 0.00 47.2G 123.6M 0.0 0.0
node10 lx24-amd64 16 0.01 47.2G 6.6G 0.0 0.0
node11 lx24-amd64 8 0.01 47.2G 2.8G 0.0 0.0
node12 lx24-amd64 8 0.00 47.2G 6.5G 0.0 0.0
node13 lx24-amd64 8 0.00 47.2G 6.5G 0.0 0.0
node14 lx24-amd64 8 0.00 47.2G 2.8G 0.0 0.0
node15 lx24-amd64 8 0.00 47.2G 120.0M 0.0 0.0
node2 lx24-amd64 8 0.00 47.2G 123.0M 0.0 0.0
node3 lx24-amd64 8 0.00 47.2G 122.0M 0.0 0.0
node4 lx24-amd64 8 0.01 47.2G 121.6M 0.0 0.0
node5 lx24-amd64 8 0.00 47.2G 121.4M 0.0 0.0
node6 lx24-amd64 8 0.00 47.2G 120.9M 0.0 0.0
node7 lx24-amd64 8 0.00 47.2G 120.7M 0.0 0.0
node8 lx24-amd64 8 0.01 47.2G 120.8M 0.0 0.0
node9 lx24-amd64 8 0.01 47.2G 121.0M 0.0 0.0
There are 16 nodes, each with minimum 8 processors [Intel(R) Xeon(R) CPU X5550 @ 2.67GHz] having 4 cpu cores. We have SGE on these clusters. It would be great if you guys can share your knowledge and expert comments with us. It will be very much beneficial for us in order to make our pipeline development in a systematic manner.
Thank you very much,
Regards
Ravi
I totally agree with this. For a compute server, a weekend is an eternity :)
125 FASTA files at 30x (or 200 mil. 50bp reads) depth for the human genome would take less than a day to pile up with 128 cores. In fact, since your core to input files size is so similar, I'd forget going parallel just run 8 jobs per node. It will be faster than dealing with cluster issues/overhead, in both set up and execution :)
For mapping, at least, 1 job per node with 8 threads will be much more memory-efficient, and thus probably more efficient overall (less cache-thrashing, fewer tlb misses, etc) compared to 8 single-threaded processes running concurrently.
Good point - memory could be a bottle neck with 'only' 47.2G, and then the overhead from threading will be minimal compared to disk swap :) I didn't think of that