Hello,
Mammalian samples were sequenced using Illumina NovaSeq. FastQC reported that 28-35% of reads were duplicates. I wanted to know if the duplication happened at the PCR, or the sequencing stage. Two more tools reported different levels of duplication than FastQC, and interpreted its origin differently. I do not understand why the discrepancies are so large.
Clumpify: analysing raw, unfiltered reads. Assuming 1% error, and setting a plate distance specific to NovaSeq. Inferred that the duplication rate for one sample is 28.3% (consistent with FastQC), and that among the duplicated reads 96.7% are optical.
Picard Tools: the same raw, unfiltered reads mapped to a high-quality reference. Overall duplication rate for the sample 12.1% (much lower than FastQC and Clumpify). Of the duplicated, only 7.46% optical (opposite of the Clumpify result).
Any insights would be appreciated. We can adjust our laboratory procedures, but need to know if the duplication is PCR or optical...
There is no absolute way to estimate PCR duplicates unless you incorporate UMI's in your library prep. Did you use default options for both programs?
clumpify.sh
result is purely sequence based and it should be reliable to estimate duplicates in your data (PCR or otherwise).Do the minimum number of PCR cycles recommended by whichever kit you are using. If you are going over that number then you are certainly introducing PCR duplication. Optical/positional duplicates are controlled by loading conc. If you do your own sequencing then consult with Illumina support to check on run metrics. If someone else is going the sequencing then hopefully they know what they are doing and already control optical duplication by optimal loading.
Hi, The reason I verified the levels of duplication with three different tools and tried to distinguish how many duplicates are optical was precisely that: the sequencing manager wanted to optimise loading. The seemingly high levels of duplication in the first round of sequencing came as a surprise. For Clumpify I adjusted the distance parameter, which is specific to the sequencing platform. However, even with an order of magnitude lower value the results were quite similar.
However, I am realising now that a similar parameter (OPTICAL_DUPLICATE_PIXEL_DISTANCE) needs ot be changed in Picard, apparently from 100 to 2500. I will test if this brings the Picard results in line. Thanks!
clumpify.sh
also allows one error by default. To make those results strict you will need to add`hdist=0`subs=0
to do only perfect matches. If your starting library already has more than normal duplicates then the problem is going to get compounded during sequencing.I believe his parameter is now called subs, so subs=0.