BaseRecalibrator takes forever to run. Any suggestions?
1
0
Entering edit mode
4.9 years ago
khorms ▴ 230

Hello, I am trying to run BaseRecalibrator tool from GATK package and it takes forever (more than 4 days per one bam file). The command I'm using is:

gatk BaseRecalibrator -I NG-01_1_S1_dedup_bwa.bam -R /rumi/shams/genomes/hg38/hg38.fa --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites 1000G_phase1.snps.high_confidence.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.dbsnp138.vcf -O NG-01_1_S1_dedup_bwa_BSQR.table

(I run it through Conda installation of GATK (link), which shouldn't matter)
I've googled a lot about it; it looks like there were a lot of discussions on this subject on GATK forums but for some reason the GATK forum webpages are not available anymore.
As far as I know BaseRecalibrator is not parallelizable unless I run it with Spark. However, the Spark version of the program (BaseRecalibratorSpark) is in beta version so I am cautious about using it.
The bam files I run it on are rather large (~40G each); I run 10 commands in parallel on a server with 88 cores and 400G RAM; the processes have been running for 4 days each and they are still not done. However, it looks like generally BaseRecalibrator can run in ~5 hours per exome (for example, @Nicolas Rosewick's comments in this post)
Any recommendations on how can I speed it up?

gatk whole exome • 2.1k views
ADD COMMENT
0
Entering edit mode

Is it still running? Use top to check. Maybe it got killed and the node is running in idle mode without actually terminating the submitted job. Maybe there is a massive I/O bottlebeck?

ADD REPLY
0
Entering edit mode

yes, they are certainly running, I am checking every now and then haha
How would you detect an I/O bottleneck? I think this could be the case potentially

ADD REPLY
0
Entering edit mode

In the end, it is not the size of the BAM file but the number of alignments that matter, but I would assume a file of that size would have, say, 600 million alignments. Now, split that into 10 - we are talking about 60 million alignments per process; I would expect that the recalibration would take a few hours - not days. So I would say something is wrong with the process, some bottleneck.

ADD REPLY
1
Entering edit mode
13 months ago

BBTools has a very fast base quality score recalibration tool. You can run it like this:

#first do preprocessing like adapter trimming, then alignment

calctruequality.sh in=mapped.sam ref=ref.fa callvars ploidy=2

bbduk.sh in=mapped.sam out=recal.sam ordered recalibrate

Both phases are multithreaded and operate at hundreds of Mbp/s.

ADD COMMENT

Login before adding your answer.

Traffic: 2044 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6