With a new release of FastQC the post titled So What Does The Sequence Duplication Rate Really Mean In A Fastqc Report has lost its relevance. This is a followup and a short discussion of the new plots and their interpretation.
The new plots now contain two different curves and the meaning of the percentage has also changed. The explanations in the docs are little bit lacking to make sure I got it right I wrote a python implementation (see the end) that produces the same plots.
I found it helpful to use the term "distinct" sequences rather than unique sequences as this latter term seems to imply to some that those sequences are present only once in the data. So distinct sequences are defined as the largest subset of sequences where no two sequences are identical.
Thus distinct sequences = number of singletons (sequences that appear only once) + number of doubles (number of sequences that appear twice but each double will be counted only once) + number of triplets (sequences that appear three times but each will be counted once) ... and so on.
The percentage in the title is computed as the distinct/total * 100
The blue line represents the counts of all the sequences that are duplicated at a given rate. The percentage is computed relative to the total number of reads.
The red line represents the number of distinct sequences that are duplicated at a given rate. The percentage is computed relative to the total number of distinct sequences in the data.
Let's take two examples where each contain 20 reads:
- Case 1: 10 unique reads + 5 reads each present twice (duplicates)
- Case 2: 10 unique reads + 1 read present 10 times
Case 1 shown in the upper plot will lead to 15 distinct reads and thus 15/20=75% percent remaining, the number of singletons is 1x10 =10 and the number of doubles is 5x2 =10 therefore the blue line has a plateau at those rates. The 15 distinct sequences are distributed as 10 singletons and 5 duplicates, 10/15=66% and 5/15=33% is the slope of the red line.
Case 2 will produce 11 distinct reads and therefore 11/20=55% will be the precent remaining reads. Again the total number of reads is equally distributed between the two cases but this time the peak will be at 10 since we have one read duplicated 10 times and that produces 10 sequences. But there are 11 total groups where 10/11=91% are singletons and 1/11=9% of the groups form at duplication rate of 10x.
Below is the python code that was used to plot the above.
After going through your post (which is very informative indeed) I went through the FastQC documentation: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html
It states that "The plot shows the proportion of the library which is made up of sequences in each of the different duplication level bins. There are two lines on the plot. The red line takes the full sequence set and shows how its duplication levels are distributed. In the blue plot the sequences are de-duplicated and the proportions shown are the proportions of the deduplicated set which come from different duplication levels in the original data."
I think they have exchanged the definitions of the red and blue lines or am I wrong?
One of the reasons that went ahead and I generated these plots (and the code for them )was that I did not understand the explanations in the help module. Note how the red line is also labeled "de-duplicated sequences" on the plot itself. I could not figure it out what that meant.
I contacted Simon Andrews on this subject, because I didn't understand the meaning of "de-duplicated sequences" and he gave me a link where there is a good explanation of that:
http://proteo.me.uk/2013/09/a-new-way-to-look-at-duplication-in-fastqc-v0-11/