Optical Duplicates
0
0
Entering edit mode
14 days ago
ebogen ▴ 10

Hi, so I am working with 30x wgs data for 8 individuals. While running the data through the normal lab pipelines we came across a curiosity. Each sample on average had a 30% optical duplication rate as marked by gatks MarkDuplicates. After further investigation, a lot of these reads that are being marked as duplicates are piling up at the same locations across all individuals. For example, on chromosome 2:32916422 each individual has around 200 thousand optical duplicates and this is consistent at other locations across the genome. I am trying to understand what could possibly cause this or what I should look into next to better understand how to troubleshoot.

bioinformatics MarkDuplicates optialduplicates gatk • 543 views
ADD COMMENT
1
Entering edit mode

You may want to test an alternate method to verify what you are seeing is accurate. This is a big thread and you will want to read through completely: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

I assume you aligned and then marked dups with GATK. clumpify.sh allows you to do this in an alignment free manner.

It seems odd that you would get "optical" duplicates from diverse samples at the same location.

ADD REPLY
1
Entering edit mode

Those sound like PCR duplicates, not optical duplicates.

ADD REPLY
0
Entering edit mode

Agree, how do you know it is optical replicates, did you check those flags?

ADD REPLY
1
Entering edit mode

the methodology used to produce the reads, was pcr-free according to the lab we work with. When marking duplicates using GATK I ran:

gatk MarkDuplicates -I input.bam -O output.mdup.bam -M mdup.txt

this is where I first noticed the high optical duplication rate from the metrics file so I reran the above with these two added flags.

--TAGGING_POLICY All --TAG_DUPLICATE_SET_MEMBERS true

I pulled out the ones that were tagged as specifically optical duplicates and visualized their placement using igv

ADD REPLY
0
Entering edit mode

What instrument was this run on? You might want to check that the regex that is identifying optical duplicates works properly on read names from your run.

ADD REPLY

Login before adding your answer.

Traffic: 2551 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6