I'm using GATK's function to mark PCR duplicates in my bam files before running through base quality score recalibration then MuTect. My bam file is 166G. I keep getting errors about space, but I am running nothing else on Docker concurrently. I have given Docker 14 cores, 850G of storage, and 55G of memory. Before my most recent attempt, I cleared my cache by using "docker container prune"
The error is as follows (with several normal lines above it):
19/09/01 05:32:21 INFO ShuffleBlockFetcherIterator: Getting 15508 non-empty blocks out of 16278 blocks
19/09/01 05:32:21 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/09/01 05:32:21 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:21 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:22 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:22 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:27 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:27 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:27 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/09/01 05:32:27 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/09/01 05:32:29 ERROR Utils: Aborting task
org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
My command looks like this:
gatk MarkDuplicatesSpark -I "mydata/files/merged.bam" -O merged.markeddups.bam --spark-master local[10] --tmp-dir path/josh
I have tried running MarkDuplicatesSpark with the optional flag to create a statistics file (-M merged.txt). I have also tried controlling the number of cores used with the --conf flag instead of the --spark-master flag (--conf 'spark.executor.cores=10').
Any suggestions on why I'm running out of memory? I think my machine has more than enough resources to handle this task. This command also takes 3 days to reach this error.
You are running out of disk space, not memory. Hard disk/SSD, not RAM. Spark is probably creating a whole lot of temporary files - that is not uncommon with these distributed data processing applications.
That's what I thought, but does this make sense? I need more than 850 gigs allocated to Docker?
Wild suggestion: Maybe Docker is assigning too large a block size to each storage block, giving each file chunk more room than it needs? The error about 16278 blocks makes me think this. It's almost like each chunk is 50MB in size where they could be 4MB.
Also, check out this possibly related post: https://serverfault.com/questions/357367/xfs-no-space-left-on-device-but-i-have-850gb-available
Thanks! That post is helpful, but I'm new to docker... how would I implement this in docker?
I may fall back to the regular MarkDuplicates! I'd just need to sort by queryname first
Sorry, I don't know Docker. Maybe someone familiar with it can help you out.