Question

Optical Duplicates

0

Entering edit mode

14 days ago

ebogen ▴ 10

Hi, so I am working with 30x wgs data for 8 individuals. While running the data through the normal lab pipelines we came across a curiosity. Each sample on average had a 30% optical duplication rate as marked by gatks MarkDuplicates. After further investigation, a lot of these reads that are being marked as duplicates are piling up at the same locations across all individuals. For example, on chromosome 2:32916422 each individual has around 200 thousand optical duplicates and this is consistent at other locations across the genome. I am trying to understand what could possibly cause this or what I should look into next to better understand how to troubleshoot.

bioinformatics MarkDuplicates optialduplicates gatk • 540 views

ADD COMMENT • link updated 13 days ago by swbarnes2 14k • written 14 days ago by ebogen ▴ 10

1

Entering edit mode

You may want to test an alternate method to verify what you are seeing is accurate. This is a big thread and you will want to read through completely: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

I assume you aligned and then marked dups with GATK. clumpify.sh allows you to do this in an alignment free manner.

It seems odd that you would get "optical" duplicates from diverse samples at the same location.

ADD REPLY • link 14 days ago by GenoMax 147k

1

Entering edit mode

Those sound like PCR duplicates, not optical duplicates.

ADD REPLY • link 14 days ago by swbarnes2 14k

0

Entering edit mode

Agree, how do you know it is optical replicates, did you check those flags?

ADD REPLY • link 14 days ago by Abieskawa • 0

1

Entering edit mode

the methodology used to produce the reads, was pcr-free according to the lab we work with. When marking duplicates using GATK I ran:

gatk MarkDuplicates -I input.bam -O output.mdup.bam -M mdup.txt

this is where I first noticed the high optical duplication rate from the metrics file so I reran the above with these two added flags.

--TAGGING_POLICY All --TAG_DUPLICATE_SET_MEMBERS true

I pulled out the ones that were tagged as specifically optical duplicates and visualized their placement using igv

ADD REPLY • link 14 days ago by ebogen ▴ 10

0

Entering edit mode

What instrument was this run on? You might want to check that the regex that is identifying optical duplicates works properly on read names from your run.

ADD REPLY • link 13 days ago by swbarnes2 14k