Hi, so I am working with 30x wgs data for 8 individuals. While running the data through the normal lab pipelines we came across a curiosity. Each sample on average had a 30% optical duplication rate as marked by gatks MarkDuplicates. After further investigation, a lot of these reads that are being marked as duplicates are piling up at the same locations across all individuals. For example, on chromosome 2:32916422 each individual has around 200 thousand optical duplicates and this is consistent at other locations across the genome. I am trying to understand what could possibly cause this or what I should look into next to better understand how to troubleshoot.
You may want to test an alternate method to verify what you are seeing is accurate. This is a big thread and you will want to read through completely: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.
I assume you aligned and then marked dups with GATK.
clumpify.sh
allows you to do this in an alignment free manner.It seems odd that you would get "optical" duplicates from diverse samples at the same location.