Reducing block size used in Spark versions of GATK tools
0
0
Entering edit mode
5.3 years ago
aalith ▴ 20

Hi all,

I've been attempting to run GATK's MarkDuplicatesSpark on a bam file that's about 160G, however, I keep getting errors about running out of space on my device. I've allotted Docker 850G of space, which should be enough in my mind. The following command takes around 2 days to reach an error.

Command Line

gatk MarkDuplicatesSpark -I "mydata/sample.bam" -O sample.markeddups.bam --spark-master local[10] --verbosity ERROR --tmp-dir path/josh --conf 'spark.local.dir=./tmp'

Is there a way to reduce the block size of each little storage block that this Spark tool creates? I can't find a simple way of doing so in Docker or from the MarkDuplicatesSpark command line. Each chunk is currently around 50MB and there are about 12,000 "tasks." I am new to this work, so I'm not fully comfortable with interpreting what that means.

gatk spark • 766 views
ADD COMMENT

Login before adding your answer.

Traffic: 1981 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6