Question

BaseRecalibrator takes forever to run. Any suggestions?

0

Entering edit mode

4.8 years ago

khorms ▴ 230

Hello, I am trying to run BaseRecalibrator tool from GATK package and it takes forever (more than 4 days per one bam file). The command I'm using is:

gatk BaseRecalibrator -I NG-01_1_S1_dedup_bwa.bam -R /rumi/shams/genomes/hg38/hg38.fa --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites 1000G_phase1.snps.high_confidence.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.dbsnp138.vcf -O NG-01_1_S1_dedup_bwa_BSQR.table

(I run it through Conda installation of GATK (link), which shouldn't matter)
I've googled a lot about it; it looks like there were a lot of discussions on this subject on GATK forums but for some reason the GATK forum webpages are not available anymore.
As far as I know BaseRecalibrator is not parallelizable unless I run it with Spark. However, the Spark version of the program (BaseRecalibratorSpark) is in beta version so I am cautious about using it.
The bam files I run it on are rather large (~40G each); I run 10 commands in parallel on a server with 88 cores and 400G RAM; the processes have been running for 4 days each and they are still not done. However, it looks like generally BaseRecalibrator can run in ~5 hours per exome (for example, @Nicolas Rosewick's comments in this post)
Any recommendations on how can I speed it up?

gatk whole exome • 2.1k views

ADD COMMENT • link updated 12 months ago by Istvan Albert 101k • written 4.8 years ago by khorms ▴ 230

0

Entering edit mode

Is it still running? Use top to check. Maybe it got killed and the node is running in idle mode without actually terminating the submitted job. Maybe there is a massive I/O bottlebeck?

ADD REPLY • link 4.8 years ago by ATpoint 85k

0

Entering edit mode

yes, they are certainly running, I am checking every now and then haha
How would you detect an I/O bottleneck? I think this could be the case potentially

ADD REPLY • link 4.8 years ago by khorms ▴ 230

0

Entering edit mode

In the end, it is not the size of the BAM file but the number of alignments that matter, but I would assume a file of that size would have, say, 600 million alignments. Now, split that into 10 - we are talking about 60 million alignments per process; I would expect that the recalibration would take a few hours - not days. So I would say something is wrong with the process, some bottleneck.

ADD REPLY • link 12 months ago by Istvan Albert 101k

score 1 · Answer 1 · 2023-11-18

BBTools has a very fast base quality score recalibration tool. You can run it like this:

#first do preprocessing like adapter trimming, then alignment

calctruequality.sh in=mapped.sam ref=ref.fa callvars ploidy=2

bbduk.sh in=mapped.sam out=recal.sam ordered recalibrate

Both phases are multithreaded and operate at hundreds of Mbp/s.