HaplotypeCallerSpark How to use
1
0
Entering edit mode
5.3 years ago
nanoide ▴ 120

Hi all. So I wanted to try the HaplotypeCaller Spark implementation in GATK4. I'm aware it's beta and not totally recomended yet, but we want to try it.

So I wanted to ask about the usage, I looked for documentation but I'm not clear on some errors I'm getting. I also wanted to ask about the --strict option. It's supposed to give similar results to the non Spark haplotypecaller but with worse speeds. Does anyone know if even then, the running is faster?

So with haplotypecallersparkI'm using java 1.8 and the line:

gatk HaplotypeCallerSpark --java-options "-Xmx4g" -R ref.fa -I XX.bam -O XX.vcf.gz -ERC GVCF --native-pair-hmm-threads 8 --spark-master local[8] --conf 'spark.executor.cores=8'

First of all, in Stage 1 I'm getting too many threads. These are relevant lines in the log I think:

INFO DAGScheduler: Submitting 60 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[22] at mapToPair at SparkSharder.java:247) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
INFO TaskSchedulerImpl: Adding task set 1.0 with 60 tasks
INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 9311 bytes)

Then it starts to open threads up to 60 with lines like:

INFO Executor: Running task 0.0 in stage 1.0 (TID 1)

And some of them fail with:

java.lang.ArrayIndexOutOfBoundsException: -1
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
ERROR Executor: Exception in task 35.0 in stage 1.0 (TID 36)

So then Stage 1 is cancelled. I think this may be related with the use of so many threads, because I'm in a queing system. Does anyone knows how to control this?

Thanks

snp GATK • 2.4k views
ADD COMMENT
1
Entering edit mode
5.2 years ago

HaplotypeCallerSpark is in beta mode and behaves unexpectedly. My experience with it is that

  • it does not matter how big or small the dataset is, it may fail sometimes and sometimes it may work!
  • changing the value for --native-pair-hmm-threads parameter does not help. I tried it with 10 (optimum) , 4(default) and 54(hyperthreading)
    • sometimes it fails with 54 and works with 10
    • sometimes it fails with both 54 and 10 and works with default 4
    • sometimes it fails with default 4

I am also looking for a resolution

ADD COMMENT

Login before adding your answer.

Traffic: 2231 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6