Reducing block size used in Spark versions of GATK tools

0

Entering edit mode

5.3 years ago

aalith ▴ 20

Hi all,

I've been attempting to run GATK's MarkDuplicatesSpark on a bam file that's about 160G, however, I keep getting errors about running out of space on my device. I've allotted Docker 850G of space, which should be enough in my mind. The following command takes around 2 days to reach an error.

Command Line

gatk MarkDuplicatesSpark -I "mydata/sample.bam" -O sample.markeddups.bam --spark-master local[10] --verbosity ERROR --tmp-dir path/josh --conf 'spark.local.dir=./tmp'

Is there a way to reduce the block size of each little storage block that this Spark tool creates? I can't find a simple way of doing so in Docker or from the MarkDuplicatesSpark command line. Each chunk is currently around 50MB and there are about 12,000 "tasks." I am new to this work, so I'm not fully comfortable with interpreting what that means.

gatk spark • 766 views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 5.3 years ago by aalith ▴ 20

Login before adding your answer.