Hi all,
I recently have been working on a reduced representation bisulfite sequencing project (RRBS) where I was required to do 15-18 cycles of PCR to get libraries to a sufficient concentration for sequencing. Because of this, I ended up with quite a bit of duplicated sequences: 75-90% as determined by FastQC. While 75-90% duplication is obviously an issue, I am having a hard time finding an "expected" range for the percent of duplicated sequences for RRBS. Given that RRBS data is lowly diverse by nature and FastQC is working under the assumption libraries should be very diverse, I am curious to know at what point other people start considering deduplication steps. All I could find regarding this range was the FastQC example for RRBS (has duplication levels at ~25%) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/RRBS_fastqc.html) and this post on Biostars which reports 40% of reads being duplicated in an RRBS experiment (PCR duplicates in RRBS data)
Does anyone have a reference for a specific range or have any insight in what could be considered "normal" levels of duplication for RRBS?
Thank you in advance!
I have worked with 75-90% duplication (FastQC), analyzing with Bismark, and have never performed de-duplication as is it not recommended with Bismark.
For example, check out this tutorial by Felix Krueger and Simon R Andrews (Bismark creators) where they say up to 95% of duplication can be a "normal" thing for RRBS (the tutorial is a bit old, nonetheless).
Another observation: I have handled paired-end RRBS reads, and read2 always has more duplication than read1.