Entering edit mode
7 weeks ago
ChumBucket2024
•
0
Hi everyone,
I am trying to run GATK MarkDuplicatesSpark on a computational cluster with a very large sam file size (131 GB), but ~11 hours into the run, I get Disk Quota Exceeded errors. I already tried setting my tmp directory to one with 18P storage, but it was the same issue.
gatk MarkDuplicates -I ${aligned_reads}/ERR9880493_sorted.bam -O ${aligned_reads}/ERR9880493_marked_duplicates.bam -M ${aligned_reads}/ERR9880493_markdup_metrics.txt --conf spark.local.dir= /MC38/tmp --tmp-dir /MC38/tmp
And this was the last 50 lines of the error message:
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2779)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1242)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1242)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1242)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3048)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2982)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2971)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:984)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2398)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2419)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2451)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:83)
... 27 more
Caused by: java.io.FileNotFoundException: /corral/mdacc/MCB24068/MC38/CD_Genomics/WES/aligned_reads/ERR9880493_marked_duplicates.bam.parts/_temporary/0/_temporary/attempt_202410020716311268354538371933037_0046_r_006446_0/.part-r-06446.sbi (Disk quota exceeded)
at java.base/java.io.FileOutputStream.open0(Native Method)
at java.base/java.io.FileOutputStream.open(FileOutputStream.java:293)
at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:235)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:449)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:412)
at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:575)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:564)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:595)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:734)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:709)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1233)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1210)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1091)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1078)
at org.disq_bio.disq.impl.formats.bam.HeaderlessBamOutputFormat$BamRecordWriter.<init>(HeaderlessBamOutputFormat.java:90)
at
org.disq_bio.disq.impl.formats.bam.HeaderlessBamOutputFormat.getRecordWriter(HeaderlessBamOutputFormat.java:192)
at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:360)
at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:126)
at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:88)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
10:28:42.902 INFO ShutdownHookManager - Shutdown hook called
10:28:42.903 INFO ShutdownHookManager - Deleting directory /corral/mdacc/MCB24068/MC38/tmp/spark-f75f8b1d-dc58-433e-a130-73f2dcbc37b2
I am stuck, I've tried to generate my ERR9880493_sorted.bam for a few days now and I am unable to successfully complete it.
Any help would be greatly appreciated.
Does that directory exist and does your account have write permissions to it? Looks like your code is trying to use
/corral/mdacc/MCB24068/MC38/tmp/*
.Please see https://gatk.broadinstitute.org/hc/en-us/articles/18965297287067-How-to-setup-and-use-temporary-folder-for-GATK-local-execution on how to properly set tmp directories.
Hi Max,
Yes, the directory exists. And I'm not sure why the full path got deleted when I was making the post, but the correct path is /corral/mdacc/MCB24068/MC38/tmp/*, that is what I used.
I'm using an HPC.
The directory has plenty of available storage and has drwxrwxrwx full permissions. I'm so lost.
You likely did not see the article I linked above. Please check that. You need to specify the temp dir as
Hi Max,
Alright. I did see it, I just assumed that adding the --tmp-dir /corral/mdacc/MCB24068/MC38/tmp/* argument would be doing the same thing.
But I will try it with --java-options "-Djava.io.tmpdir=/path/to/tmp".
Thank you