Hi all. So I wanted to try the HaplotypeCaller Spark implementation in GATK4. I'm aware it's beta and not totally recomended yet, but we want to try it.
So I wanted to ask about the usage, I looked for documentation but I'm not clear on some errors I'm getting. I also wanted to ask about the --strict option. It's supposed to give similar results to the non Spark haplotypecaller but with worse speeds. Does anyone know if even then, the running is faster?
So with haplotypecallersparkI'm using java 1.8 and the line:
gatk HaplotypeCallerSpark --java-options "-Xmx4g" -R ref.fa -I XX.bam -O XX.vcf.gz -ERC GVCF --native-pair-hmm-threads 8 --spark-master local[8] --conf 'spark.executor.cores=8'
First of all, in Stage 1 I'm getting too many threads. These are relevant lines in the log I think:
INFO DAGScheduler: Submitting 60 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[22] at mapToPair at SparkSharder.java:247) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
INFO TaskSchedulerImpl: Adding task set 1.0 with 60 tasks
INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 9311 bytes)
Then it starts to open threads up to 60 with lines like:
INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
And some of them fail with:
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
ERROR Executor: Exception in task 35.0 in stage 1.0 (TID 36)
So then Stage 1 is cancelled. I think this may be related with the use of so many threads, because I'm in a queing system. Does anyone knows how to control this?
Thanks