I am trying to understand a bit deeper how duplications occur and how to deal with that in NGS analysis. First, of all, I wanted to understand the FastQC read duplication report for which the tutorial of Istvan Albert is really good (Revisiting the FastQC read duplication report).
My FASTQ file has shown this report
The title shows the proportion of duplicated read what is (as far I can undertand) so high. I have run Rmdup and MarkDuplicate in this file and the proportion of duplicated reads detected and removed/marked is around 15%.
So my question is, are not all duplicated reads removed when applying removal duplicated algorithms?
My second question is, for the simple simulation that Istvan Albert does in his post, I can understand what the red and blue lines is telling me. However, what my red and blue lines are telling me when working in a more realistic scenario like this (e.g. why is there a pick between 9 and >10)?
Two comments:
Thanks for your answer. This is for clinical analysis and GATK best practices recommend removing duplicates.
I am still wondering if having 50% of duplicates reads is normal, and why FASTQC says I have 48%, but with RmDup and MarkDuplicates, only the 15% are removed???
indeed for such an analysis it's advised to remove duplicates.
duplication rate is also dependent on biological complexity of your sample(s) combined with the sequencing depth.
Why one removes or detects more duplication than another tool I don't really know but it is very well possible they all have different definitions of duplication and/or different levels of sensitivity to pick them up
What type of libraries do you have? Is this RNaseq, shotgun whole genome, exome? For RNAseq and exome, it would be normal do have a high duplication rate.
To do a true estimation of duplicates you can use
clumpify.sh
from BBMap suite which does purely sequence based analysis, no alignments needed like other tools you mention. It can do perfect matches and can work on paired-end reads at the same time.See: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.