Pararellization in GATK 4
4
5
Entering edit mode
5.2 years ago
raf.marcondes ▴ 110

Hi all,

I'm trying (and failing) to multi-thread HaplotypeCaller in GATK 4. I read in a few places online that multi-threading in GATK 4 has been made more tricky, maybe even unfeasible, but all the places where I read that seem to be more than 1 yr old. Is there a new solution to that problem?

PS: I've read in a few places about Spark, but I still don't have no idea what it is or how to use it.

Here's what I have at this point:

   java -Xmx16g -XX:ParallelGCThreads=1 -jar gatk-package-4.1.3.0-local.jar HaplotypeCaller -R myfasta.fasta -I mybam.bam -O mygvcf.g.vcf --emit-ref-confidence GVCF --min-dangling-branch-length 1 --min-pruning 1 --num_cpu_threads_per_data_thread 2

A USER ERROR has occurred: num_cpu_threads_per_data_thread is not a recognized option
GATK haplotypecaller multi-threading • 18k views
ADD COMMENT
4
Entering edit mode
5.2 years ago
h.mon 35k

For start, you should not be using java -jar gatk-package-4.1.3.0-local.jar with GATK4, the recommended and supported method of running GATK4 is using the bundled script:

gatk --java-options "-Xmx16g -XX:ParallelGCThreads=1" [...]

In GATK4, multithreading is implemented using Spark, see Document how multi-threading support works in GATK4. As you noted, documentation is scattered and scarce - e.g. (How to) Run Spark-enabled GATK tools on a local multi-core machine.

Based on this Spark GATK4 page, you can try:

 gatk --java-options "-Xmx16g -XX:ParallelGCThreads=1" --spark-master local[2] \
    HaplotypeCaller -R myfasta.fasta -I mybam.bam -O mygvcf.g.vcf \
    --emit-ref-confidence GVCF --min-dangling-branch-length 1 --min-pruning 1

edit: another common method for parallelizing HaplotypeCaller is using the -L option to restrict calling to one chromosome, and process several chromosomes simultaneously - see Intervals and interval lists.

ADD COMMENT
1
Entering edit mode

I note that the official documentation for the spark implementation of HaplotypeCaller still says the following:

This tool DOES NOT match the output of HaplotypeCaller. * * It is still under development and should not be used for production work. * * For evaluation only. * * Use the non-spark HaplotypeCaller if you care about the results.

Does anybody happen to know if the BROAD folks are simply being extra cautious, or should this warning be taken at face value?

ADD REPLY
0
Entering edit mode

Honestly, I don't know if they are just overly cautious or if there are still problems with HaplotypeCallerSpark.

raf.marcondes , in view of the above warning, you should stick to a "poors man" parallelism using -L.

ADD REPLY
0
Entering edit mode

keep in mind that you can ask Broad personnel yourself on their forum: https://gatkforums.broadinstitute.org/gatk/categories/ask-the-team

ADD REPLY
0
Entering edit mode

Thank you SO MUCH for your helpful answer! I'm unsure what you mean by "running GATK4 using the bundled script" though. Do I need to re-install GATK in a different way to do that?

ADD REPLY
1
Entering edit mode

I don't think you need to reinstall. This is the contents of my GATK folder:

ls -lh ~/bin/GATK-4.1.4.0/
total 407M
-rwxr-xr-x 1 hmon hmon  20K Oct  8 15:34 gatk
-rw-r--r-- 1 hmon hmon 851K Oct  8 15:34 gatk-completion.sh
-rw-r--r-- 1 hmon hmon  964 Oct  8 15:34 gatkcondaenv.yml
-rw-r--r-- 1 hmon hmon 3.6K Oct  8 15:34 GATKConfig.EXAMPLE.properties
drwxr-xr-x 2 hmon hmon  68K Oct  8 15:34 gatkdoc
-rw-r--r-- 1 hmon hmon 271M Oct  8 15:34 gatk-package-4.1.4.0-local.jar
-rw-r--r-- 1 hmon hmon 135M Oct  8 15:34 gatk-package-4.1.4.0-spark.jar
-rw-r--r-- 1 hmon hmon 113K Oct  8 15:34 gatkPythonPackageArchive.zip
-rw-r--r-- 1 hmon hmon  38K Oct  8 15:34 README.md
drwxr-xr-x 5 hmon hmon 4.0K Oct  8 15:34 scripts
  

The first entry, named simply gatk, is a python wrapper script that should be used, instead of the jar file:

head -n 17 ~/bin/GATK-4.1.4.0/gatk
#!/usr/bin/env python
#
# Launcher script for GATK tools. Delegates to java -jar, spark-submit, or gcloud as appropriate,
# and sets many important Spark and htsjdk properties before launch.
#
# If running a non-Spark tool, or a Spark tool in local mode, will search for GATK executables
# as follows:
#     -If the GATK_LOCAL_JAR environment variable is set, uses that jar
#     -Otherwise if the GATK_RUN_SCRIPT created by "gradle installDist" exists, uses that
#     -Otherwise uses the newest local jar in the same directory as the script or the BIN_PATH
#      (in that order of precedence)
#
# If running a Spark tool, searches for GATK executables as follows:
#     -If the GATK_SPARK_JAR environment variable is set, uses that jar
#     -Otherwise uses the newest Spark jar in the same directory as the script or the BIN_PATH
#      (in that order of precedence)
#
  
ADD REPLY
4
Entering edit mode
4.1 years ago

Hi,

add "--native-pair-hmm-threads #Of_Threads", i use it as the below command :

java -jar //PathToGATK//.jar HaplotypeCaller --native-pair-hmm-threads 20 -R myfasta.fasta -I InPut.bam -O OutPut.vcf -L InBed.bed

Note: GATK_Version 4.1.8.1

Alternative solution: (i did not use it)

use GATK4 Spark tools which still in beta testing and perform inconsistently but it's dramatically improve runtime performance. see the link below for more information

https://gatk.broadinstitute.org/hc/en-us/articles/360036509052-HaplotypeCallerSpark-BETA-

Best wishes,

ADD COMMENT
0
Entering edit mode

Hi, I have tried your suggestion of adding --native-pair-hmm-threads 2 argument when running Mutect2. However, gatk reported that only 1 available thread is detected, therefore it go along with using only one thread despite I requested 2 threads.

08:54:03.826 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM 08:54:03.826 INFO IntelPairHmm - Available threads: 1 08:54:03.826 INFO IntelPairHmm - Requested threads: 2 08:54:03.826 WARN IntelPairHmm - Using 1 available threads, but 2 were requested 08:54:03.826 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation

Do you happen to know what might be the issue?

ADD REPLY
1
Entering edit mode

Sorry for late reply, if it says available threads is 1 then it is 1! check you machine specification and let us know

ADD REPLY
1
Entering edit mode
5.2 years ago
raf.marcondes ▴ 110

Just to follow up, I figured this out. Here's how to make HaplotypeCallerSpark work, using 2 cores:

gatk --java-options  "-Xmx16g -XX:ParallelGCThreads=1" HaplotypeCallerSpark --spark-master local[2] -R myfasta.fasta -I mybam.bam -O mygvcf.g.vcf --emit-ref-confidence GVCF --min-dangling-branch-length 1 --min-pruning 1
ADD COMMENT
0
Entering edit mode

Thanks, though "HaplotypeCallerSpark" seems to be in beta version at moment.

ADD REPLY
1
Entering edit mode
4.1 years ago
ashotmarg2004 ▴ 130

Have a look here: https://www.ibm.com/downloads/cas/ZJQD0QAL for more examples with Spark for GATK4.

ADD COMMENT
0
Entering edit mode

Did you try this? Does it work?

ADD REPLY

Login before adding your answer.

Traffic: 1796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6