Question

Split the bam file by chromosome and speed up picard Markduplicates

1

Entering edit mode

8.2 years ago

lghust2011 ▴ 110

I use picard to mark duplicates and found that picard dose not support multiple threads and it's very slow. To speed up it, I want to split BAM file by chromosome and then run picard on every file. The problem is, I know that picard has an advantage over samtools rmdup because picard can mark cross-chromosome duplicates. So if I split the bam file by chromosome, how important will it influence the result? Here is my consideration:

A pair of reads must come from the same DNA fragment, so these two reads mapped to the same chromosome normally. But at sometimes, these two reads mapped to different chromosome, maybe there is a structure variation or repeat such as microsatellite? If I just want to call SNV and indel, may I ignore the cross-chromosome duplicates? Please let me know if there is anything wrong with my consideration. Any reply will be much appreciated!

markduplicates next-gen sequencing alignment • 3.5k views

ADD COMMENT • link updated 8.2 years ago by Pierre Lindenbaum 166k • written 8.2 years ago by lghust2011 ▴ 110

0

Entering edit mode

Another way, if the influence is important, how can I compensate it?

ADD REPLY • link 8.2 years ago by lghust2011 ▴ 110

0

Entering edit mode

You can alternatively use Clumpify, which does duplicate-marking or duplicate-removal prior to mapping and is extremely fast.

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

score 1 · Answer 1 · 2017-05-16

1

Entering edit mode

8.2 years ago

Pierre Lindenbaum 166k

you could split your bam by both-mapped-chr1.bam, both-mapped-chr2.bam , both-mapped-chr3.bam , (..), and 'others.bam'

howeve I don't know if creating those new bams will reduce the computing time.

ADD COMMENT • link 8.2 years ago by Pierre Lindenbaum 166k