Question

GATK MarkDuplicatesSpark Space Issues

0

Entering edit mode

5.9 years ago

aalith ▴ 20

I'm using GATK's function to mark PCR duplicates in my bam files before running through base quality score recalibration then MuTect. My bam file is 166G. I keep getting errors about space, but I am running nothing else on Docker concurrently. I have given Docker 14 cores, 850G of storage, and 55G of memory. Before my most recent attempt, I cleared my cache by using "docker container prune"

The error is as follows (with several normal lines above it):

19/09/01 05:32:21 INFO ShuffleBlockFetcherIterator: Getting 15508 non-empty blocks out of 16278 blocks
19/09/01 05:32:21 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/09/01 05:32:21 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:21 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:22 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:22 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:27 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:27 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:27 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:27 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:29 ERROR Utils: Aborting task
org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device

My command looks like this:

gatk MarkDuplicatesSpark -I "mydata/files/merged.bam" -O merged.markeddups.bam --spark-master local[10] --tmp-dir path/josh

I have tried running MarkDuplicatesSpark with the optional flag to create a statistics file (-M merged.txt). I have also tried controlling the number of cores used with the --conf flag instead of the --spark-master flag (--conf 'spark.executor.cores=10').

Any suggestions on why I'm running out of memory? I think my machine has more than enough resources to handle this task. This command also takes 3 days to reach this error.

GATK MarkDuplicates • 2.9k views

ADD COMMENT • link 5.9 years ago by aalith ▴ 20

0

Entering edit mode

You are running out of disk space, not memory. Hard disk/SSD, not RAM. Spark is probably creating a whole lot of temporary files - that is not uncommon with these distributed data processing applications.

ADD REPLY • link 5.9 years ago by Ram 45k

0

Entering edit mode

That's what I thought, but does this make sense? I need more than 850 gigs allocated to Docker?

ADD REPLY • link 5.9 years ago by aalith ▴ 20

0

Entering edit mode

Wild suggestion: Maybe Docker is assigning too large a block size to each storage block, giving each file chunk more room than it needs? The error about 16278 blocks makes me think this. It's almost like each chunk is 50MB in size where they could be 4MB.

Also, check out this possibly related post: https://serverfault.com/questions/357367/xfs-no-space-left-on-device-but-i-have-850gb-available

ADD REPLY • link 5.9 years ago by Ram 45k

0

Entering edit mode

Thanks! That post is helpful, but I'm new to docker... how would I implement this in docker?

I may fall back to the regular MarkDuplicates! I'd just need to sort by queryname first