Question

MarkDuplicates memory issue

1

Entering edit mode

8.9 years ago

cacampbell ▴ 60

Hello,

I am having major problems with getting MarkDuplicates to run to completion.

Relevant Programs:

- Java 1.8.0 64-bit

- picard-tools 2.0.1

I was under the impression that Picard's MarkDuplicates was relatively memory inexpensive (compared to other tools used in variant calling pipeline). However, MarkDuplicates consistently allocates arrays that exceed the VM limit. What is going on?

Sample Script (automatically generated):

echo '#!/usr/bin/env bash
java -Xms12G -Xmx14G -jar /path/to/picard-tools-2.0.1/picard.jar MarkDuplicates INPUT=[filepath].bwamem.sorted.bam OUTPUT=[filepath].bwamem.sorted.dedup.bam             REMOVE_DUPLICATES=true MAX_RECORDS_IN_RAM=350000 ASSUME_SORTED=true             METRICS_FILE=[filepath].dedup_metrics.txt' | sbatch --job-name=[job name] --time=0 --mem=24G --partition=bigmemh

Okay, so here is my reasoning with these parameters:

- Set the initial heap space a little lower than the maximum heap space to target 12G as the optimal memory to be used by the process (constantly adjusts memory consumption)

- Set maximum heap space to 14G, this should be plenty

- Set the max records in RAM to 1/10 the recommended amount by GATK, according to the documentation

- Set the memory allocated by the dispatcher to be much greater than the maximum heap space for the process

Still, this doesn't work, with the same error arising after a long time, usually just after the process has "completed" and before files are written (I guess).

I have used Heap sizes that are generally very low (this helps according to old forum posts): 2GB, 4GB, 6GB, 8GB, 12GB, 16GB, 20GB

... and I have also tried a variety of higher values: 30GB, 50GB, 80GB, 100GB, 200GB, 300GB, 400GB, 480GB (max RAM for a node on our cluster). All have the same result.

Here is a sample stack trace:

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
    at java.lang.StringBuilder.append(StringBuilder.java:136)
    at htsjdk.samtools.SAMTextHeaderCodec.advanceLine(SAMTextHeaderCodec.java:131)
    at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:86)
    at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:503)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:166)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:125)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:261)
    at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:204)
    at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:291)
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:139)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

So, what parameters should I be using to get MarkDuplicates to run to completion? What am I doing wrong?

Don't know? Then what should I use instead of MarkDuplicates?

Thanks

__________________
EDIT
__________________

Some more information. It seems that my files have the problem of accumulating unmatched pairs very quickly. I guess this is caused by many records with mates that are far away on the same chromosome. I have no idea how to rectify this issue, but I am going to try to filter low quality reads before marking duplicates and see if that works. The problem is not actually memory, but poor memory usage in this particular case. I will try an even smaller MAX_RECORDS_IN_RAM value as well, as disk space is no concern.

alignment Picard MarkDuplicates • 13k views

ADD COMMENT • link 8.9 years ago by cacampbell ▴ 60

0

Entering edit mode

You might test (e.g., with /usr/bin/time -v) how much memory picard is actually using. We recently had an issue where the scheduler was killing things incorrectly, regardless of what sort of memory we requested.

ADD REPLY • link 8.9 years ago by Devon Ryan 104k

0

Entering edit mode

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit

This is definitely an issue of Picard running out of memory, rather than the scheduler killing it. Picard is very inefficient, but it shouldn't need that much memory. How big is the bam file?

Sometime people set the JAVA_OPTS environment variable, which screws everything up. It overrides your -Xmx argument. Try echo $JAVA_OPTS to see if it is set. If so, clear it and you should be fine.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks, will try this and report back.

EDIT: JAVA_OPTS has not been set :/

These BAM files are 80GB at maximum (a single lane each)

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by cacampbell ▴ 60

0

Entering edit mode

hi,

One another optimization you could add is COMPRESSION_LEVEL=0

as one of the parameters. By default, Picard tries to make a compressed BAM (smaller size). Using that parameter makes the BAM larger in size but Picard then runs much faster. I am not sure if this would help in memory issue but give it a try.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by Amitm ★ 2.3k

0

Entering edit mode

Er, if it's running out of memory, how would increasing the file size solve the problem?

ADD REPLY • link 8.9 years ago by harold.smith.tarheel ★ 5.0k

1

Entering edit mode

Compression takes memory. Having said that, if it crashes with 480GB of RAM then the small amount required for compression isn't going to be the issue.

ADD REPLY • link 8.9 years ago by Devon Ryan 104k

0

Entering edit mode

I understand, but as you said, the overhead for compression << your average BAM file. So the rationale for proposing this solution escapes me, but I thought I'd ask.

ADD REPLY • link 8.9 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

Thanks for taking the time to reply, I will try this.

ADD REPLY • link 8.9 years ago by cacampbell ▴ 60

0

Entering edit mode

If most of your reads are not properly paired, that will use much more memory. Perhaps you should verify that the files are ordered correctly. Some preprocessing steps (like using fastx toolkit for anything) will corrupt your files. What was the proper pairing rate, which would have been output by the mapping program? And what preprocessing steps did you take?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by Brian Bushnell 20k

0

Entering edit mode

verify files are ordered correctly

Okay, sounds like a good plan.

We used bwa mem for alignment, which I believe should have around an 85% mapping rate. For preprocessing, we used FASTQC to inspect the reads and cutadapt to remove our adapters. The quality 'looked good', according to FASTQC.

So, not sure what the issue is -- but I am currently applying a filtering step to the Sorted BAM files to see if that alleviates the memory problem.

Sort I am using:

samtools view -F 0x04 -f 0x02 -bq 1 {in} > {out}

As I understand it (and, I am just a novice), this should remove all alignments with flag 4 and keep only alignments with flag 2 (redundant, but that's okay for peace of mind), and then filter quality scores of 1 (P_error ~ .8). 0x02, 0x2, 2 is PROPER_PAIR and 0x004, 0x4, 4 is UNMAP, according to samtools docs.

When this is done (actually taking a long time), then I will attempt to mark and remove duplicates again.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by cacampbell ▴ 60

0

Entering edit mode

But specifically -

When you ran Cutadapt, did you do each file individually, or both at once?
When you mapped with bwa-mem, what did it say was the rate of properly-paired reads?

Also, if you still have the fastq input to the mapping program, you can download BBMap and run this:

reformat.sh in1=read1.fq in2=read2.fq vpair

This will verify that, according to the read names, the reads are correctly paired (ordered correctly).

Lastly - what kind of reference are you mapping against? If it's a metagenome with millions of tiny contigs, this kind of thing (a very low proper pair mapping rate) is expected.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by Brian Bushnell 20k

0

Entering edit mode

Each file individually
Unfortunately, I wasn't the one who did these alignments, just the one who is responsible for getting SNPs out of them, so those output files are long gone, on some other cluster, at some other institution.

I do still have the original fastq files, so it is conceivable that I could redo this whole process if necessary. I will check that the pairs are ordered correctly according to BBMap. What should I be expecting? What can I do about the possible outputs?

Okay, so I think you are right on, the reference is from a large conifer species, and the reads that we got (from collaborators in China) for alignment were very quick and cheap. If I don't get MarkDuplicates to work here (and I haven't had issue with it on different projects), then I may just abandon this step and proceed with the filtered, sorted, reads.

Thanks for the help, I appreciate the donation of your time.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by cacampbell ▴ 60

0

Entering edit mode

You're welcome. Also, BBMap has a utility [repair.sh] that can re-pair the reads in the fastq files if they were processed incorrectly in a prior step. You can also, of course, simply skip duplicate removal - if the data was not amplified, there's no reason to do it (which is important to know).

Still, the best approach is to start with the raw reads and process them correctly as pairs at every stage, and then map them as pairs; incorrectly-ordered reads sent to an aligner as pairs will give you inferior results due to the "rescue" operation of mapping programs and heuristics that evaluate potential mapping locations, as well as variant-callers which give different weights to variants seen in properly-paired versus unpaired reads.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by Brian Bushnell 20k