Hi.
I'm running paired-end, single digest RAD-sequencing. I have just completed analysis of my sequences, and my pipeline removed 97.7% of all the reads due to putative PCR duplicates. This naturally makes me worried, however, when I think about it, it doesn't really make sense. The exact phrasing from the pipeline is:
Removed 43288163 read pairs whose insert length had already been seen in the same sample as putative PCR duplicates (97.7%); kept 1026975 read pairs.
But isn't that actually expected, i.e., that the insert lengths should be about the same? Because all of my forward reads start should start from the exact same position (the restriction enzyme cut site) and they are all of the exact same length (100 base pairs). They are then paired with the reverse read, but I'm assuming this too should create roughly the same insert sizes. This is because, according to the molecular protocol, the reads are "size selected" prior to sequencing.