Hi, so I am working with 30x wgs data for 8 individuals. While running the data through the normal lab pipelines we came across a curiosity. Each sample on average had a 30% optical duplication rate as marked by gatks MarkDuplicates. After further investigation, a lot of these reads that are being marked as duplicates are piling up at the same locations across all individuals. For example, on chromosome 2:32916422 each individual has around 200 thousand optical duplicates and this is consistent at other locations across the genome. I am trying to understand what could possibly cause this or what I should look into next to better understand how to troubleshoot.
You may want to test an alternate method to verify what you are seeing is accurate. This is a big thread and you will want to read through completely: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.
I assume you aligned and then marked dups with GATK.
clumpify.sh
allows you to do this in an alignment free manner.It seems odd that you would get "optical" duplicates from diverse samples at the same location.
Those sound like PCR duplicates, not optical duplicates.
Agree, how do you know it is optical replicates, did you check those flags?
the methodology used to produce the reads, was pcr-free according to the lab we work with. When marking duplicates using GATK I ran:
gatk MarkDuplicates -I input.bam -O output.mdup.bam -M mdup.txt
this is where I first noticed the high optical duplication rate from the metrics file so I reran the above with these two added flags.
--TAGGING_POLICY All --TAG_DUPLICATE_SET_MEMBERS true
I pulled out the ones that were tagged as specifically optical duplicates and visualized their placement using igv
What instrument was this run on? You might want to check that the regex that is identifying optical duplicates works properly on read names from your run.