Hello,
I am having major problems with getting MarkDuplicates to run to completion.
Relevant Programs:
- Java 1.8.0 64-bit
- picard-tools 2.0.1
I was under the impression that Picard's MarkDuplicates was relatively memory inexpensive (compared to other tools used in variant calling pipeline). However, MarkDuplicates consistently allocates arrays that exceed the VM limit. What is going on?
Sample Script (automatically generated):
echo '#!/usr/bin/env bash java -Xms12G -Xmx14G -jar /path/to/picard-tools-2.0.1/picard.jar MarkDuplicates INPUT=[filepath].bwamem.sorted.bam OUTPUT=[filepath].bwamem.sorted.dedup.bam REMOVE_DUPLICATES=true MAX_RECORDS_IN_RAM=350000 ASSUME_SORTED=true METRICS_FILE=[filepath].dedup_metrics.txt' | sbatch --job-name=[job name] --time=0 --mem=24G --partition=bigmemh
Okay, so here is my reasoning with these parameters:
- Set the initial heap space a little lower than the maximum heap space to target 12G as the optimal memory to be used by the process (constantly adjusts memory consumption)
- Set maximum heap space to 14G, this should be plenty
- Set the max records in RAM to 1/10 the recommended amount by GATK, according to the documentation
- Set the memory allocated by the dispatcher to be much greater than the maximum heap space for the process
Still, this doesn't work, with the same error arising after a long time, usually just after the process has "completed" and before files are written (I guess).
I have used Heap sizes that are generally very low (this helps according to old forum posts): 2GB, 4GB, 6GB, 8GB, 12GB, 16GB, 20GB
... and I have also tried a variety of higher values: 30GB, 50GB, 80GB, 100GB, 200GB, 300GB, 400GB, 480GB (max RAM for a node on our cluster). All have the same result.
Here is a sample stack trace:
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421) at java.lang.StringBuilder.append(StringBuilder.java:136) at htsjdk.samtools.SAMTextHeaderCodec.advanceLine(SAMTextHeaderCodec.java:131) at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:86) at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:503) at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:166) at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:125) at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:261) at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:204) at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:291) at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:139) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
So, what parameters should I be using to get MarkDuplicates to run to completion? What am I doing wrong?
Don't know? Then what should I use instead of MarkDuplicates?
Thanks
__________________
EDIT
__________________
Some more information. It seems that my files have the problem of accumulating unmatched pairs very quickly. I guess this is caused by many records with mates that are far away on the same chromosome. I have no idea how to rectify this issue, but I am going to try to filter low quality reads before marking duplicates and see if that works. The problem is not actually memory, but poor memory usage in this particular case. I will try an even smaller MAX_RECORDS_IN_RAM value as well, as disk space is no concern.
You might test (e.g., with
/usr/bin/time -v
) how much memory picard is actually using. We recently had an issue where the scheduler was killing things incorrectly, regardless of what sort of memory we requested.This is definitely an issue of Picard running out of memory, rather than the scheduler killing it. Picard is very inefficient, but it shouldn't need that much memory. How big is the bam file?
Sometime people set the
JAVA_OPTS
environment variable, which screws everything up. It overrides your-Xmx
argument. Tryecho $JAVA_OPTS
to see if it is set. If so, clear it and you should be fine.Thanks, will try this and report back.
EDIT:
JAVA_OPTS
has not been set :/These BAM files are 80GB at maximum (a single lane each)
hi,
One another optimization you could add is
COMPRESSION_LEVEL=0
as one of the parameters. By default, Picard tries to make a compressed BAM (smaller size). Using that parameter makes the BAM larger in size but Picard then runs much faster. I am not sure if this would help in memory issue but give it a try.
Er, if it's running out of memory, how would increasing the file size solve the problem?
Compression takes memory. Having said that, if it crashes with 480GB of RAM then the small amount required for compression isn't going to be the issue.
I understand, but as you said, the overhead for compression << your average BAM file. So the rationale for proposing this solution escapes me, but I thought I'd ask.
Thanks for taking the time to reply, I will try this.
If most of your reads are not properly paired, that will use much more memory. Perhaps you should verify that the files are ordered correctly. Some preprocessing steps (like using fastx toolkit for anything) will corrupt your files. What was the proper pairing rate, which would have been output by the mapping program? And what preprocessing steps did you take?
Okay, sounds like a good plan.
We used bwa mem for alignment, which I believe should have around an 85% mapping rate. For preprocessing, we used FASTQC to inspect the reads and cutadapt to remove our adapters. The quality 'looked good', according to FASTQC.
So, not sure what the issue is -- but I am currently applying a filtering step to the Sorted BAM files to see if that alleviates the memory problem.
Sort I am using:
As I understand it (and, I am just a novice), this should remove all alignments with flag 4 and keep only alignments with flag 2 (redundant, but that's okay for peace of mind), and then filter quality scores of 1 (P_error ~ .8). 0x02, 0x2, 2 is PROPER_PAIR and 0x004, 0x4, 4 is UNMAP, according to samtools docs.
When this is done (actually taking a long time), then I will attempt to mark and remove duplicates again.
But specifically -
Also, if you still have the fastq input to the mapping program, you can download BBMap and run this:
This will verify that, according to the read names, the reads are correctly paired (ordered correctly).
Lastly - what kind of reference are you mapping against? If it's a metagenome with millions of tiny contigs, this kind of thing (a very low proper pair mapping rate) is expected.
I do still have the original fastq files, so it is conceivable that I could redo this whole process if necessary. I will check that the pairs are ordered correctly according to BBMap. What should I be expecting? What can I do about the possible outputs?
Okay, so I think you are right on, the reference is from a large conifer species, and the reads that we got (from collaborators in China) for alignment were very quick and cheap. If I don't get MarkDuplicates to work here (and I haven't had issue with it on different projects), then I may just abandon this step and proceed with the filtered, sorted, reads.
Thanks for the help, I appreciate the donation of your time.
You're welcome. Also, BBMap has a utility [repair.sh] that can re-pair the reads in the fastq files if they were processed incorrectly in a prior step. You can also, of course, simply skip duplicate removal - if the data was not amplified, there's no reason to do it (which is important to know).
Still, the best approach is to start with the raw reads and process them correctly as pairs at every stage, and then map them as pairs; incorrectly-ordered reads sent to an aligner as pairs will give you inferior results due to the "rescue" operation of mapping programs and heuristics that evaluate potential mapping locations, as well as variant-callers which give different weights to variants seen in properly-paired versus unpaired reads.