Question

Improving Performance Of Picard For Markduplicates

8

Entering edit mode

13.5 years ago

Brett Mccann ▴ 80

I'm trying to improve the performance of MarkDuplicates when processing a BAM file. I am running on a 12 core box with 64GB of RAM. I have been using the following picard command on my BAM file:

/usr/bin/java -Xmx10g -XX:-UseGCOverheadLimit -jar $PICARD_HOME/picard-1_42/MarkDuplicates.jar METRICS_FILE=rmdup_metrics.txt COMPRESSION_LEVEL=1 INPUT=merged.bam OUTPUT=dedup_clpc.bam REMOVE_DUPLICATES=True ASSUME_SORTED=True VALIDATION_STRINGENCY=LENIENT

Are there any threading options that might increase performance? I also tried indexing the BAM file using samtools, prior to running MarkDuplicates with this command:

$SAM_TOOLS_HOME/samtools index merged.bam

which resulted in a 'merged.bam.bai' file. But this had no performance impact.

Are there any other options for pre-processing the BAM file that might impact performance of MarkDuplicates?

picard samtools markduplicates • 14k views

ADD COMMENT • link updated 13.5 years ago by Louis Letourneau ▴ 820 • written 13.5 years ago by Brett Mccann ▴ 80

0

Entering edit mode

Out of curiosity, why are you using ASSUME_SORTED? I had problems whereby MarkDuplicates wouldn't recognise a samtools sorted file as being sorted. The problem disappeared when I sorted with Picard SortSam instead. I don't have to use the lenient validation anymore either.

ADD REPLY • link 13.5 years ago by Travis ★ 2.8k

score 3 · Answer 1 · 2011-07-22

The setting, -XX:ParallelGCThreads, is just for garbage collection. It won't really affect MArkDuplicates unless MarkDups hits it's top memory usage (The -Xmx setting).

As for -XX:-UseGCOverheadLimit it will just make MarkDups die faster if the Xmx wasn't set high enough.

They are all java hotspot switches, not MarkDups specific switches.

I've tries to set Xmx at 4,10,40,60,128G modifying MAX_READS_IN_RAM at the same time by the same factor (4G == 150000, the default if I remember).

It does make a difference, but it's not substancial even when running over nfs.

score 0 · Answer 2 · 2011-07-22

Since you're assuming sorted make sure the file is sorted first, but sounds like you're already doing that. What does your core usage look like? Are you using all 12? I don't remember how many cores MarkDups will use. You can try this to see if it helps:

-XX:ParallelGCThreads=12

The default should be to utilize all cores so I don't know that this will help, but may be worth a try. You can try increasing memory but really 10G should be plenty.

You could also try increasing the allocated memory limit and increasing the SORTING_COLLECTION_SIZE_RATIO from the 0.25 default. If you get too close to the memory limit it will probably cause it to start spilling to swap, which won't help anything. If you have success update us on what worked the best.