Entering edit mode
9 months ago
ebaldwin
▴
40
I am often counting kmers in reads for genome size and heterozygosity estimation and for trio-binning genome assemblies. I have always assumed that removing PCR duplicates before counting kmers would give a less biased count and have always done so. However, I do not see deduping recommended in kmer counting tutorials (e.g. https://bioinformatics.uconn.edu/genome-size-estimation-tutorial/). Is this an unneccesary step that I am doing or am I right in thinking that deduplication will provide a less biased distribution of kmers?
Thanks in advance!
How are you deciding that the duplicates are due to PCR? There is at least one package (can't remember it immediately, someone else may post) that can estimate PCR dups there is no absolute way to tell unless you have an experimental handle such as
unique molecular indexes (UMI)
in your data.These are paired end WGS data and I am using fastp to dedup. So both pairs need to be identical to dedup. I assumed the likelihood of two identical fragments making it through library prep to sequencing was fairly unlikely with normal (30X ish) sequencing depth. You are probably right that deduplicating via fastp is throwing away some amount of true duplicates without UMIs.
I can think of maybe three cases when the real PCR duplicates will be a problem: 1. you start with way too little DNA/it got lost somewhere along the way to sequencing; 2. "problematic genome composition", meaning that for some reason you get a preferential amplification of some fragments; 3. small genome/non-random fragmentation. Other than that, assuming you have reads 2x150bp or more derived from non-minuscule genome I doubt you will see a significant number of identical ~300bp sequences which are not from genomic repeats.