How to make variant calling run faster?
3
0
Entering edit mode
7.1 years ago
steve ★ 3.5k

We are doing a lot of variant calling using GATK MuTect2. Evidently, multi-threading does not work in this program, so we have to run it single threaded. It takes a very long time, ~24-36 hours to finish. Anyone know how to speed it up?

I had an idea of splitting the .bam into separate files per-chromosome and then running them all separately in parrallel, but I was not sure how difficult it would be and how hard it would be to merge everything back together into a single .vcf output at the end. Any thoughts?

variant calling mutect2 gatk • 7.0k views
ADD COMMENT
0
Entering edit mode

looks like my second part there is already mentioned here

ADD REPLY
0
Entering edit mode

Sambamba has a neat mpileup tool doing exactly what has been suggested [https://manpages.debian.org/testing/sambamba/sambamba-pileup.1.en.html]

ADD REPLY
2
Entering edit mode
7.1 years ago

Your second option is the right one. Breaking up BAMs by chromosome is easy as seen in your linked question. Concatenating the VCFs is also pretty easy with bcftools. Just look for the concat command.

ADD COMMENT
0
Entering edit mode

I ended up doing the reverse, and splitting my .bed file with target regions into one .bed per chromosome, and then running the same original .bam files against each of the split .bed files in parrallel. So far it has reduced 30+ hour variant calling down to ~4 hours as per the longest-running chromosome (most finished <2 hours). Helpful details here

ADD REPLY
1
Entering edit mode
7.1 years ago
pfs ▴ 280

If speed is your main concern and tool development is not an option I would change my pipeline to use Samtools for variant calling. I have benchmark variant calling tools in the past and Samtools is a lot faster than GATK haplotypecaller v3.5. I can not speak for newer versions of GATK.

ADD COMMENT
2
Entering edit mode

It's also a lot less accurate by many other benchmarks. I would not recommend samtools for variant calling, particularly if you care about indels or low frequency variants. It will not detect variants with a VAF <0.20. It definitely is quicker than GATK, but it's not as sensitive or accurate.

See this paper for benchmarks of many different callers (though there are many other papers out there like this as well).

ADD REPLY
0
Entering edit mode

That is absolutely not true, based on my own benchmarking. SAMtools (BCFtools mpileup piped into BCFtools call) can easily detect low frequency variants.

but it's not as sensitive or accurate.

You have not qualified this statement at all, Jared. From my own work in the clinical setting (and work of others), SAMtools has higher sensitivity to Sanger sequencing compared to GATK's tools.

ADD REPLY
1
Entering edit mode

I appreciate you qualifying your statements with hard facts and not anecdotal evidence. Here are some other publications supporting GATK generally having higher sensitivity, precision, and recall than SAMtools.

You're correct in stating that variant calling is a complex task with many variables (sequencing platform, aligner, literally dozens of parameters for most callers, etc), but every comparison I find (and my own experience) has shown that SAMtools isn't great at recalling low frequency SNVs (especially with low read depth) or indels. There is a reason that most of the top entries in the FDA Precision Medicine Variant Calling Challenges used GATK. Variant calling has changed drastically over the last few years, so it's possible newer versions of htslib/SAMtools/BCFtools may be better - I can't say. But most side-by-side comparisons pretty clearly indicate that GATK is just objectively better in most cases. Of course, the differences are usually a few percent at most, so it's not like SAMtools is atrocious or anything. SAMtools may still be the best tool for the job in some cases - it's really up to users to decide. But in general, I'm not going to withhold recommending a great tool just because another works "good enough".

I am glad that SAMtools has worked well for you in the clinical setting, which is an area that deserves a comparison on its own. I would be interested in seeing any reports from clinical laboratories backed by Sanger, your own benchmarking work, or any studies that indicate the superiority of SAMtools over GATK for general variant/small indel calling. I have no loyalty to either, so I'm more than happy to be proven wrong.

ADD REPLY
0
Entering edit mode

In addition, the problem with reading a single instantiated report, like the Scientific Reports manuscript that you've cited, is that there are are literally hundreds of parameters going into each variant calling pipeline. No single publication can account for all parameters and their utility across a diverse range of samples.

GATK frequently misses variants that are confirmed in samples via Sanger, while SAMtools easily detects these. This has been reported in independent clinical laboratories.

ADD REPLY
1
Entering edit mode
7.1 years ago

You can make variant-calling much faster by using BBMap's callvariants.sh tool (which is fully multithreaded) on a raw, unsorted sam file, so you don't have to waste time sorting, indexing and compressing/decompressing bam. It works on bam files too, though, so if you want to use a bam as input I suggest you have use samtools 1.4 or greater in the command line so that the decompression will also be multithreaded. callvariants.sh is around 90x faster than the samtools mpileup pipeline, in my testing

ADD COMMENT

Login before adding your answer.

Traffic: 1687 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6