Hello, I am trying to run BaseRecalibrator tool from GATK package and it takes forever (more than 4 days per one bam file). The command I'm using is:
gatk BaseRecalibrator -I NG-01_1_S1_dedup_bwa.bam -R /rumi/shams/genomes/hg38/hg38.fa --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites 1000G_phase1.snps.high_confidence.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.dbsnp138.vcf -O NG-01_1_S1_dedup_bwa_BSQR.table
(I run it through Conda installation of GATK (link), which shouldn't matter)
I've googled a lot about it; it looks like there were a lot of discussions on this subject on GATK forums but for some reason the GATK forum webpages are not available anymore.
As far as I know BaseRecalibrator is not parallelizable unless I run it with Spark. However, the Spark version of the program (BaseRecalibratorSpark) is in beta version so I am cautious about using it.
The bam files I run it on are rather large (~40G each); I run 10 commands in parallel on a server with 88 cores and 400G RAM; the processes have been running for 4 days each and they are still not done. However, it looks like generally BaseRecalibrator can run in ~5 hours per exome (for example, @Nicolas Rosewick's comments in this post)
Any recommendations on how can I speed it up?
Is it still running? Use
top
to check. Maybe it got killed and the node is running in idle mode without actually terminating the submitted job. Maybe there is a massive I/O bottlebeck?yes, they are certainly running, I am checking every now and then haha
How would you detect an I/O bottleneck? I think this could be the case potentially
In the end, it is not the size of the BAM file but the number of alignments that matter, but I would assume a file of that size would have, say, 600 million alignments. Now, split that into 10 - we are talking about 60 million alignments per process; I would expect that the recalibration would take a few hours - not days. So I would say something is wrong with the process, some bottleneck.