I'm trying to improve the performance of MarkDuplicates when processing a BAM file. I am running on a 12 core box with 64GB of RAM. I have been using the following picard command on my BAM file:
/usr/bin/java -Xmx10g -XX:-UseGCOverheadLimit -jar $PICARD_HOME/picard-1_42/MarkDuplicates.jar METRICS_FILE=rmdup_metrics.txt COMPRESSION_LEVEL=1 INPUT=merged.bam OUTPUT=dedup_clpc.bam REMOVE_DUPLICATES=True ASSUME_SORTED=True VALIDATION_STRINGENCY=LENIENT
Are there any threading options that might increase performance? I also tried indexing the BAM file using samtools, prior to running MarkDuplicates with this command:
$SAM_TOOLS_HOME/samtools index merged.bam
which resulted in a 'merged.bam.bai' file. But this had no performance impact.
Are there any other options for pre-processing the BAM file that might impact performance of MarkDuplicates?
Out of curiosity, why are you using ASSUME_SORTED? I had problems whereby MarkDuplicates wouldn't recognise a samtools sorted file as being sorted. The problem disappeared when I sorted with Picard SortSam instead. I don't have to use the lenient validation anymore either.