Hi all,
I've been attempting to run GATK's MarkDuplicatesSpark on a bam file that's about 160G, however, I keep getting errors about running out of space on my device. I've allotted Docker 850G of space, which should be enough in my mind. The following command takes around 2 days to reach an error.
Command Line
gatk MarkDuplicatesSpark -I "mydata/sample.bam" -O sample.markeddups.bam --spark-master local[10] --verbosity ERROR --tmp-dir path/josh --conf 'spark.local.dir=./tmp'
Is there a way to reduce the block size of each little storage block that this Spark tool creates? I can't find a simple way of doing so in Docker or from the MarkDuplicatesSpark command line. Each chunk is currently around 50MB and there are about 12,000 "tasks." I am new to this work, so I'm not fully comfortable with interpreting what that means.