Question

PCR duplicates in RRBS data

1

Entering edit mode

5.1 years ago

linelr ▴ 40

Hi!

I am working with DNA-methylation in salmon and have recently aquired data from an RRBS experiment. Fastqc reports that my reads consist of around 40% PCR duplicates, which is quite high. However, I have read that I should not remove duplicates e.g. by simply removing reads that have the exact same start and stop position in the genome when working with RRBS data, but this did not come with a proper explanation. This sort of makes sense to me because of the way the library prep is performed: MspI cleaves only CCGGs + size selection of fragments --> you will probably end up with fragments that are pretty similar, and they might therefor be termed PCR duplicates of each other by fastqc. This is of course based on my non-exhaustive understandig of these processes.

I can´t seem to find any good explanations on how to perform a proper PCR duplicate removal for RRBS data, if that is indeed called for (which I suspect it is).

Does anyone know how to do this or can anyone point me to where I might find this information?

Thanks in advance!

Best, Line

RRBS sequencing • 2.5k views

ADD COMMENT • link 5.1 years ago by linelr ▴ 40

0

Entering edit mode

Allright! This makes sense. Thanks a lot! I´ll keep what you write about FastQC in mind for next time.

Have a good day!

ADD REPLY • link 5.1 years ago by linelr ▴ 40

0

Entering edit mode

Please reply to comments with Add comment and Add reply, that keeps the thread organized. Thanks you. Also please feel free to upvote and accept good answers.

enter image description here

ADD REPLY • link 5.1 years ago by ATpoint 88k

1

Entering edit mode

Sure! Thanks for the reminder

ADD REPLY • link 5.1 years ago by linelr ▴ 40

score 3 · Answer 1 · 2020-04-16

3

Entering edit mode

5.1 years ago

Devon Ryan 105k

I strongly recommend that you not remove alleged PCR duplicates in RRBS data processing. In data like this we expect that there should appear to be very high levels of what look like PCR duplicates. These are not real PCR duplicates (for the most part at least). Please note that FastQC's defaults are all intended for whole-genome sequencing and will give warnings that you should ignore if you run it on RRBS datasets.

ADD COMMENT • link 5.1 years ago by Devon Ryan 105k

0

Entering edit mode

Allright! This makes sense. Thanks a lot! I´ll keep what you write about FastQC in mind for next time.

Have a good day!

ADD REPLY • link 5.1 years ago by linelr ▴ 40