Should you remove PCR duplicates for kmer counting?
0
2
Entering edit mode
9 months ago
ebaldwin ▴ 40

I am often counting kmers in reads for genome size and heterozygosity estimation and for trio-binning genome assemblies. I have always assumed that removing PCR duplicates before counting kmers would give a less biased count and have always done so. However, I do not see deduping recommended in kmer counting tutorials (e.g. https://bioinformatics.uconn.edu/genome-size-estimation-tutorial/). Is this an unneccesary step that I am doing or am I right in thinking that deduplication will provide a less biased distribution of kmers?

Thanks in advance!

kmer dedup duplicates • 546 views
ADD COMMENT
3
Entering edit mode

How are you deciding that the duplicates are due to PCR? There is at least one package (can't remember it immediately, someone else may post) that can estimate PCR dups there is no absolute way to tell unless you have an experimental handle such as unique molecular indexes (UMI) in your data.

ADD REPLY
0
Entering edit mode

These are paired end WGS data and I am using fastp to dedup. So both pairs need to be identical to dedup. I assumed the likelihood of two identical fragments making it through library prep to sequencing was fairly unlikely with normal (30X ish) sequencing depth. You are probably right that deduplicating via fastp is throwing away some amount of true duplicates without UMIs.

ADD REPLY
2
Entering edit mode

I can think of maybe three cases when the real PCR duplicates will be a problem: 1. you start with way too little DNA/it got lost somewhere along the way to sequencing; 2. "problematic genome composition", meaning that for some reason you get a preferential amplification of some fragments; 3. small genome/non-random fragmentation. Other than that, assuming you have reads 2x150bp or more derived from non-minuscule genome I doubt you will see a significant number of identical ~300bp sequences which are not from genomic repeats.

ADD REPLY

Login before adding your answer.

Traffic: 1571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6