Hello everyone!
I have read the help document from FastQC group, but there is not enough detailed information.
Here is my understanding of this duplicate sequence plot from FastQC:
From the title "Percent of seqs remaining if deduplicated 14.11%", it means if I do some deduplication process on my data, I will only get 14.11%? Which means the duplication level is very high?
From the red line I can say about 60% of the deduplicated sequences are at the duplication level of "1", about 25% of the deduplicated sequences are at the duplication level of ">10"?
From the blue line I can say about %10 of the total sequences are at the duplication level of "1" and about 65% of the total sequences are at the duplication level of ">10"?
Is this interpretion right?
Can I say the libraries can contain technical duplication according to this plot? What else analysis should I do to exclude this judgement?
Background can be found here
Thank you very much in advance!
Thank you Kevin!
I have read this answer and the updated interpretation of the new version of FastQC duplicate sequence plot. My data looks more like the Example 3, but it is different. Their duplication levels are most above thousdands of times. But in my case, most of the duplication levels are around 10 times. So is my data better? What is the different implications of differetn duplication levels? If the thousands of times of duplication levels can indicate a technical error in sequencing. How about 10 times? Some people suggest not trusting the duplicate sequencing plot too much, considering the per base quality plot to gain a realistic assessment of the duplication. In my case, my per base sequence quality is great, but I have a high proporation of reads in 10 times duplication levels, what does this imply?
Hey,
I think that your plot is more like Example 2, but it is just that you have a greater magnitude of duplication.
Your plot indicates that 65% of your reads are duplicated between 10-50 times - the spike may be here purely because it's looking at 40 different levels of duplication (10x, 11x, 12x, 13x, ... 49x, 50x). Did you run the sample through more than one PCR amplification step?
This level of duplication may not be ideal, but I don't believe that it will cause a major problem for you in downstream analyses. As you mentioned, there is disagreement in the field of sequencing about the importance of removing these duplicate sequences. The best thing to do is to run the analysis separately by removing the duplicates and also by not removing, and to see what differences you get.
If I have over 20% of my reads are duplicated over 5k times or even almost 30% of my reads are duplicated over 10k times, would that be a potential problem for the downstream analysis?