Question

Samtools: removing PCR duplicates

0

Entering edit mode

10.3 years ago

devinliao0918 ▴ 40

Could anyone explain the difference between the options -s and -S for "samtools rmdup"? In addition, is it a standard to use -sS in order to remove duplicate reads?

I recently tried to remove the duplicates in one Bam file. After running the command line "samtools rmdup -sS in.nameSrt.bam out .bam", the size of Bam file decreased from 11G to 5.2G and the log showed that there were 52.48% reads that had been removed. I'm really worried about the massive amount of data loss.

By the way, one of my goal in the downstream analysis is to call genotypes and detect SNPs.

next-gen • 14k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by devinliao0918 ▴ 40

Ram · Answer 1 · 2014-07-22

0

Entering edit mode

10.3 years ago

dariober 15k

I would recommend using picard MarkDuplicates. See also http://seqanswers.com/forums/showthread.php?t=5424.

High duplication might expected if you sequenced quite deep. As and extreme, duplication in RNA-Seq is quite high, but it's expected. I would look at some regions on a genome browser (e.g. IGV) to have a feel of whether reads are nicely uniformly spread or tend to be clustered in stack of reads, which would suggest over-amplification.

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 10.3 years ago by dariober 15k

0

Entering edit mode

the question is for single-ended (-s). does MarkDuplicate works with SE ?

ADD REPLY • link 10.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Do you know the difference between the two options -s and -S? Could I use -s only to avoid too much data loss?

ADD REPLY • link 10.3 years ago by devinliao0918 ▴ 40

0

Entering edit mode

are you really using single-end data ?

ADD REPLY • link 10.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

No, the sequencing is done using Illumina HiSeq 2000 which should generate paired-end data.

ADD REPLY • link 10.3 years ago by devinliao0918 ▴ 40

0

Entering edit mode

so you don't have to deal with those options -s and you should use MarkDuplicates. http://samtools.sourceforge.net/samtools.shtml Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan reads or ends mapped to different chromosomes). If this is a concern, please use Picard's MarkDuplicate which correctly handles these cases, although a little slower.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I would like to try MarkDuplicates if I could. However, I need to process thousands of Bam files and my pipeline relies heavily on Samtools.

ADD REPLY • link 10.3 years ago by devinliao0918 ▴ 40

0

Entering edit mode

In principle I don't see any problem in passing a file to MarkDuplicates instead of samtools. Maybe you should give more detail of your pipeline.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by dariober 15k