Question

Speed up Picard MarkDuplicates or any better software?

2

Entering edit mode

8.0 years ago

lghust2011 ▴ 110

I use picard software to mark duplicates, and here is my command:

java -d64 -server -XX:+UseParallelGC -XX:ParallelGCThreads=2 -Xms8g  -Xmx16g  -Djava.io.tmpdir=tmp -jar ./picard.jar MarkDuplicates I=input.bam O=out_markdup.bam METRICS_FILE=out.metrics ASO=coordinate VALIDATION_STRINGENCY=LENIENT

It works well but when the input.bam file gets bigger, the speed is very slow! I found that the picard MarkDuplicates doesn't support multiple threads. So, is there anyway to speedup picard? Another way, is there any better software to do the same as picard MarkDuplicates but with less time? I know elprep is another choice, but it needs very large memory!

Besides, I found that samtools can also remove duplicates, but according to my search, samtools can not remove the duplicates cross different chromosomes, so picard is more universe.

Any reply will be much appreciated!

genome alignment sequence • 7.6k views

ADD COMMENT • link updated 8.0 years ago by GenoMax 151k • written 8.0 years ago by lghust2011 ▴ 110

0

Entering edit mode

MarkDuplicates supports multiple GC threads..

-XX:ParallelGCThreads=<number of threads>

ADD REPLY • link 8.0 years ago by James Ashmore ★ 3.5k

score 2 · Accepted Answer · 2017-04-26

2

Entering edit mode

8.0 years ago

Pierre Lindenbaum 166k

So, is there anyway to speedup picard?

1) picard stores data in memory until it needs to flush them on disk. The bigger the memory is (option MAX_RECORDS_IN_RAM) , the less you need I/O operation:

2) use sambamba-markdup http://lomereiter.github.io/sambamba/docs/sambamba-markdup.html (not tested)

ADD COMMENT • link 8.0 years ago by Pierre Lindenbaum 166k

score 2 · Accepted Answer · 2017-04-26

I have this toolkit which I am currently developing and is therefore not to be "browsed too much", where the deduplicate module is completed for the paired-end reads. I would be interested in a test from another user, if you want to give it a try! https://github.com/MatteoSchiavinato/SPIQR

The principle is that it removes PCR duplicates by sequence identity and not by mapping position, so it will speed us the mapping step as well and not only the deduplication. We tried with our data and the intersection of reads deduplicated by this tool and by picard was ~99%.

score 2 · Accepted Answer · 2017-04-26

2

Entering edit mode

8.0 years ago

GenoMax 151k

Use clumpify.sh from BBMap suite (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ). You do not need to align the data. Clumpify can address all types of duplicates (PCR, optical and so on).

ADD COMMENT • link 8.0 years ago by GenoMax 151k