I'm working with some high-depth whole-exome sequencing data. It's paired-end. Some samples are targeted at 200X and others are 400X average read depth across the exome. The duplicate rate is very high after alignment. I suspect a good portion of these marked duplicates are not actual PCR duplicates, but are simply synchronous alignments of distinct fragments. However, I'm not sure about the best way to properly calculate coverage estimates over the exome territory, because I can theoretically justify both omitting and incorporating marked duplicates into the calculation.
I haven't yet found anything in the literature addressing this specific issue. I'm thinking there's probably some middle ground that factors in the territory size, number of reads, and perhaps insert size profile in the case of paired-end sequencing. Does anyone have any guidance?
EDIT: I should add that each sample has a single library preparation.
Hello Dan,
could you please tell a little bit more about your data. How was the library prep done? How high is the duplicate rate? What do you plan to do with the coverage information?
fin swimmer
Is that your feeling or do you have any observations to support that? Have you tried using
clumpify.sh
from from BBMap on this data (A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files )? Was the data sequenced on patterned FC or non-patterned? Unless one uses UMI's that tag individual fragments it would be difficult to conclusively define the origin of these in practice.I don't know how exome library prep works - what is the "unit" captured? exons? genes? I'll assume / guess exons. So the shorter the exon, the more likely to have false-positive PCR duplicates. The R/BioConductor package dupRadar investigates this issue examining PCR duplication by gene length. Probably the same approach can be adapted to exome libraries.
@finswimmer. This is a standard whole-exome library prep with tight insert size distribution around 225. The duplicate rate is showing between 25-30% from the 200X-targeted and 40-45% for the 400X targeted. I'm looking to report an accurate coverage estimate per library which appropriately considers
@genomax This is something that occurred to me as I was considering the nature of high-depth exome sequencing. Just doing the math, if you have a narrow insert size distribution and a relatively small territory due to whole-exome capture, then there's an elevated chance that truly distinct fragments will end up aligning to the same coordinates, even in a paired-end context.
I haven't done anything in-depth yet though because I wanted to see what the literature had to say to inform my investigation. I'm not seeing many results on my current search terms (various combinations of
high-depth
exome
duplicate
coverage
keywords)These were sequenced on a patterned flowcell. Agreed w/r/t UMIs.
@Dan Do you mind trying
clumpify.sh
on this data? Do both total duplicates and optical duplicate estimates.Oh, right, I didn't address that in my reply. I will give it a try right now.
OK, I ran it and with the default of mismatch of two it found dup rates between 38-42% on a set of four paired-end datasets. After thinking on this for a bit I come back to my core premise: it's entirely possible that these are not duplicates because of the hundreds of millions of input reads and the capture territory being on the order of tens of millions of bases. It seems it requires a systematic study to answer; I'm hoping it's been done already and the results are available in publication form.
Hello,
I have no literature but maybe a suggestion how to get a "feeling" for it, how many false positive duplicates there are. One can create random reads for the reference and look than how many duplicates are in. I did this just for chrM as my computer power is limited.
For creating random reads one can use
randomreads.sh
from BBTools. I used the optioncoverage
to simulate how many DNA copies I have at the beginning of my library prep.I've repeated it for coverage 10.000 and 100.000. Afterwards I used
clumpify
to estimate the number of duplicates:Here's the result:
1.000X
10.000X
100.000X
Of course chrM is much to small to conclude on the other chromomes. But it should show quite good how the number of duplicate fragments heavily depends on the initial number of dna fragments.
fin swimmer
@Dan: You probably want to be strict (
dupesubs=0
) if you are looking for true duplicates. That would reduce the number down further.@finswimmer: your result is not surprising. Small size leads to more duplication.
Hello genomax,
yes it's not suprising, that small reference sizes leads to more duplicates. Also smaller fragments will produce more duplicates and the higher the input the higher the duplicate rate (as one see above).
But let's take the approach above to investigate the initial hypothesis ("I suspect a good portion of these marked duplicates are not actual PCR duplicates, but are simply synchronous alignments of distinct fragments") a little bit more.
The simulated data get's the initial fragments in the library prep. But during the preparation, we select a subset of all these fragments. The smaller the subset, the higher the chance that we capture just one distinct fragment and not the duplicates.
So, let's start with the initial coverage of 100.000X and let's select a number of reads, which result in an average coverage of 300X:
x = 11047334*(300/100000) x = 33142
As we have paired reads, we just need the half of it per direction resulting in 16571.
Select the reads:
Get the corresponding paired read:
Check for duplicates:
That's the result:
So, in conclusion one can see, that the number of duplicates is decreased dramatically. In case of the other chromosomes we wouldn't have such high duplicate rate in the beginning, as the reference is much longer and the input is lower (I guess something about 20.000). So in the end of our library prep we will have much lower false positive duplicates.
To come back to the hypothesis:
I would conclude: No, the number of "simply synchronous alignments of distinct fragments" is negligible.
fin swimmer
Good theoretical exercise. Reasonable conclusion.
While people try to be careful in experimental biology (there must be SOPs) about controlling non-biological variations there is likely to be many sources of it. Even a single extra PCR cycle, a 5% error in pipetting, wrongly calibrated instrument etc can lead to significant effects.
This seems to indicate that the deep coverage is due to reads alone. We don't have any information about how the sample was initially targeted/obtained/prepped.
Having UMI's and/or using little/no amplification/minimum handling would be practical ways of addressing this issue. Since the results will likely be confirmed independently (sanger sequencing) it should all be well in the end.