Note: there is an updated version of FastQC see Revisiting the FastQC read duplication report
Recent discussions on the duplication rates reported by FastQC A: Illumina mate pair read duplication level and RNAseq library quantification made me realize that I never thought this through and perhaps this value may not even be what I think it ought to be.
What is really a duplication rate?
I would intuitively define duplication rate as 1 - unique reads/total reads
.
Of course we'll have to define what unique read means. Let's see what FastQC reports. Take this dataset (20Mb) from a bioinformatics lecture series. It contains two sample files, duplications rates reported via FastQC are 11.98%
and 32.09%
.
The FastQC documentation has a section covering the methodology. The plot that it generates is particularly unhelpful as it combines all reads with more than 10 duplications in the last bin, moreover the number of unique reads dwarfs the rest, so usually one can't see much (or to put it differently if you are able to notice anything it usually means really bad data). The duplication values are displayed at the top.
Each of our test files contain 100,000 reads. First we compute a sequence statistics where the occurrence of each read is counted:
# create unique sequence counts
$cat sample1.fq | ~/bin/bioawk -c fastx ' { print substr($seq, 1, 50) } ' | sort | uniq -c | sort -rg > seqstats.txt
# the number of unique sequences in the data is then
$wc -l seqstats.txt
92173 seqstats.txt
# the duplication rate would then be
$echo '(1- 92173/100000) * 100' | bc -l
7.83
If we take my original definition of uniqueness then the duplication rate is 7.83
but that is different from what FastQC reports 11.98
It seems FastQC will only consider a read unique if it appears only once:
# take reads that only appear once
$ cat seqstats.txt | awk -F ' ' ' $1 == 1 { print $2 }' | wc -l
88023
$ echo '(1- 88023/100000) * 100' | bc -l
11.97700000000000000000
The same math checks out for the second sample as well:
$ cat sample2.fq | ~/bin/bioawk -c fastx '{print substr($seq, 1, 50) }' | sort | uniq -c | sort -rg | awk -F ' ' ' $1 == 1 { print $2 }' | wc -l
67913
$ echo '(1- 67913/100000) * 100' | bc -l
32.08
But this is much larger than the value one would get by looking at the actually unique sequences in the data, that duplication rate would be 17.83%
.
In the end I have come to believe that the FastQC reported duplication rate is not the right representation. Having a sequence duplicated does not actually mean that the "original" sequence should also be removed from computation.
Doing so does not allow us to distinguish between cases where say out of 100K reads one single read would be duplicated 50K times, versus having 25K reads each duplicated 2 times. FastQC would in both cases report a 50% duplication rate whereas a more intuitive definition would report a duplication rate of 50% and 25%
I'll stick to my opinion that duplication should look at all unique reads. But perhaps there is a compelling rationale of why this is not the right approach.
@Istvan Albert: Interesting...good to know
Is there a nicer visualization of duplication rate that would be good to show for each sample? Or perhaps there's just one value per sample in a barplot...
I have started using a rarefaction plot where the X axis is the reads sequenced and the Y axis is the number of unique reads with the unique reads defined as reads with the same starting position on each end. I don't log it.
We do a lot of tech development where we need to know the complexity of the libraries. I have decided that this ends up being more informative than a rarefaction plot based on genes detected because you estimate the total size of the library if you can see the asymptote. It uses a good chunk of memory but I'll stick it up on github if anyone wants it. I have been trying for a couple weeks to find an equation that fits the line so I can estimate the library size but I haven't been able to do better with my math than with my eyes. Michaelis Menten does not suck but it is not a a good enough fit to justify using it as a metric. Downsampling reads to the point where you have hit saturation seems to improve the performance of over sequenced libraries.
I've also charted scatter plots where the X axis is the total reads aligned to the gene and the Y axis is the total number of duplicate reads aligned to the gene. The points end up going in a diagonal across the chart. A line with a slope of 1 means everything is duplicates. It's a little goofy because it's all in the lower diagonal but you can compare two different libraries that way.
If your rRNA is high it might make sense to remove those reads before putting the library through because they are going to all be duplicates (because they're short with lots of reads by the pigeon hole principle) and then your duplication rate isn't a function of your complexity. It's a function of your rRNA contamination. When we did this paper http://www.nature.com/nmeth/journal/v10/n7/full/nmeth.2483.html we removed the rRNA before calculating duplication. They worked out a lot of the methods before I got there but there are a lot of details that they considered that make a difference in your analysis.
I'd be super interested in your script, Michele.
Hi Rory,
I put it up on GitHub here. It's C++. There is a compiled version up there.
https://github.com/mbusby/ComplexityByStartPosition
Thanks!
"Default is 20. If you want it to go faster use 10. If you want a really smooth line use 50. If you are really neurotic use 100." Ha ha.
Thanks for sharing!
I haven't quite thought about it yet, what might work is a plain counts vs duplication rate plot on log-log scales without the collapsing of reads that are highly duplicated. After all there is a lot of value in seeing where that duplication is coming from.
New post to discuss the latest release of FastQC see Revisiting the FastQC read duplication report