Question

How to set a cutoff value when de-duplicating

1

Entering edit mode

10.5 years ago

chxu02 ▴ 10

I'm doing BS-seq with some ChIP DNA. To get 500M reads from <1ng ChIP DNA, you can imagine the duplication level is HUGE. FastQC reported the duplication rate to be 39% and 66% for my two libraries. In my case, I think the proper way of de-duplication is to set a cutoff value, say 5, to tolerate some PCR duplication (and possibly amplification from distinct DNA fragments with identical ends). How to do this in a customized way? The reads are paired-end. It would be better to start from an alignment file like BAM/SAM.

sequencing alignment • 2.2k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by chxu02 ▴ 10

score 1 · Answer 1 · 2015-02-04

There's no generally applicable way to deal with deduplicating targeted sequencing data (this is also true for things like RRBS). You can set a threshold if you want, in which case you'll have to tailor things for each experiment and write a program to do this. Traditionally, one simply doesn't deduplicate the dataset since there will be many false positives.