Speed up Picard MarkDuplicates or any better software?
3
2
Entering edit mode
7.6 years ago
lghust2011 ▴ 110

I use picard software to mark duplicates, and here is my command:

java -d64 -server -XX:+UseParallelGC -XX:ParallelGCThreads=2 -Xms8g  -Xmx16g  -Djava.io.tmpdir=tmp -jar ./picard.jar MarkDuplicates I=input.bam O=out_markdup.bam METRICS_FILE=out.metrics ASO=coordinate VALIDATION_STRINGENCY=LENIENT

It works well but when the input.bam file gets bigger, the speed is very slow! I found that the picard MarkDuplicates doesn't support multiple threads. So, is there anyway to speedup picard? Another way, is there any better software to do the same as picard MarkDuplicates but with less time? I know elprep is another choice, but it needs very large memory!

Besides, I found that samtools can also remove duplicates, but according to my search, samtools can not remove the duplicates cross different chromosomes, so picard is more universe.

Any reply will be much appreciated!

genome alignment sequence • 7.2k views
ADD COMMENT
0
Entering edit mode

MarkDuplicates supports multiple GC threads..

-XX:ParallelGCThreads=<number of threads>
ADD REPLY
2
Entering edit mode
7.6 years ago

So, is there anyway to speedup picard?

1) picard stores data in memory until it needs to flush them on disk. The bigger the memory is (option MAX_RECORDS_IN_RAM) , the less you need I/O operation:

2) use sambamba-markdup http://lomereiter.github.io/sambamba/docs/sambamba-markdup.html (not tested)

ADD COMMENT
2
Entering edit mode
7.6 years ago

I have this toolkit which I am currently developing and is therefore not to be "browsed too much", where the deduplicate module is completed for the paired-end reads. I would be interested in a test from another user, if you want to give it a try! https://github.com/MatteoSchiavinato/SPIQR

The principle is that it removes PCR duplicates by sequence identity and not by mapping position, so it will speed us the mapping step as well and not only the deduplication. We tried with our data and the intersection of reads deduplicated by this tool and by picard was ~99%.

ADD COMMENT
2
Entering edit mode
7.6 years ago
GenoMax 148k

Use clumpify.sh from BBMap suite (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ). You do not need to align the data. Clumpify can address all types of duplicates (PCR, optical and so on).

ADD COMMENT

Login before adding your answer.

Traffic: 1817 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6