MarkDuplicatesSpark throwing disk quota errors
0
0
Entering edit mode
7 weeks ago

Hi everyone,

I am trying to run GATK MarkDuplicatesSpark on a computational cluster with a very large sam file size (131 GB), but ~11 hours into the run, I get Disk Quota Exceeded errors. I already tried setting my tmp directory to one with 18P storage, but it was the same issue.

gatk MarkDuplicates -I ${aligned_reads}/ERR9880493_sorted.bam -O ${aligned_reads}/ERR9880493_marked_duplicates.bam  -M ${aligned_reads}/ERR9880493_markdup_metrics.txt --conf spark.local.dir= /MC38/tmp --tmp-dir /MC38/tmp

And this was the last 50 lines of the error message:

 at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2779)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1242)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1242)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1242)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3048)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2982)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2971)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:984)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2398)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2419)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2451)
    at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:83)
    ... 27 more
Caused by: java.io.FileNotFoundException: /corral/mdacc/MCB24068/MC38/CD_Genomics/WES/aligned_reads/ERR9880493_marked_duplicates.bam.parts/_temporary/0/_temporary/attempt_202410020716311268354538371933037_0046_r_006446_0/.part-r-06446.sbi (Disk quota exceeded)
        at java.base/java.io.FileOutputStream.open0(Native Method)
        at java.base/java.io.FileOutputStream.open(FileOutputStream.java:293)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:235)
        at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:449)
        at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:412)
        at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:575)
        at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:564)
        at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:595)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:734)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:709)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1233)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1210)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1091)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1078)
        at org.disq_bio.disq.impl.formats.bam.HeaderlessBamOutputFormat$BamRecordWriter.<init>(HeaderlessBamOutputFormat.java:90)
        at 
org.disq_bio.disq.impl.formats.bam.HeaderlessBamOutputFormat.getRecordWriter(HeaderlessBamOutputFormat.java:192)
        at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:360)
        at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:126)
        at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:88)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:141)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)
10:28:42.902 INFO  ShutdownHookManager - Shutdown hook called
10:28:42.903 INFO  ShutdownHookManager - Deleting directory /corral/mdacc/MCB24068/MC38/tmp/spark-f75f8b1d-dc58-433e-a130-73f2dcbc37b2

I am stuck, I've tried to generate my ERR9880493_sorted.bam for a few days now and I am unable to successfully complete it.

Any help would be greatly appreciated.

GATK • 479 views
ADD COMMENT
0
Entering edit mode

/MC38/tmp

Does that directory exist and does your account have write permissions to it? Looks like your code is trying to use /corral/mdacc/MCB24068/MC38/tmp/*.

Please see https://gatk.broadinstitute.org/hc/en-us/articles/18965297287067-How-to-setup-and-use-temporary-folder-for-GATK-local-execution on how to properly set tmp directories.

ADD REPLY
0
Entering edit mode

Hi Max,

Yes, the directory exists. And I'm not sure why the full path got deleted when I was making the post, but the correct path is /corral/mdacc/MCB24068/MC38/tmp/*, that is what I used.

I'm using an HPC.

ADD REPLY
0
Entering edit mode

The directory has plenty of available storage and has drwxrwxrwx full permissions. I'm so lost.

ADD REPLY
0
Entering edit mode

You likely did not see the article I linked above. Please check that. You need to specify the temp dir as

gatk AnyToolName --java-options "-Djava.io.tmpdir=/path/to/tmp"
ADD REPLY
0
Entering edit mode

Hi Max,

Alright. I did see it, I just assumed that adding the --tmp-dir /corral/mdacc/MCB24068/MC38/tmp/* argument would be doing the same thing.

But I will try it with --java-options "-Djava.io.tmpdir=/path/to/tmp".

Thank you

ADD REPLY

Login before adding your answer.

Traffic: 1861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6