So I have a set of tumor:matched normal samples. I have them deduped with picard for PCR contamination. Afterwards I use MuTect2 to call somatic variants against dbSNP, COSMIC coding mutation, COSMIC noncoding mutation. And for some reason about 10% of my reads are being filtered out as duplicates.
I suspect that these "duplicates" are not contaminants and was wondering what may be going on. Could it be rRNA that were not trimmed during pre-processing QC?
Of course. I am working with RNA-seq data from TCGA for calling somatic variant calling using Mutect2. I have both tumor and its matched normals.
From my understanding, deduping was encouraged (according to Broad/GATK) to remove PCR contaminants. I am not trimming but simply marking my duplicates.
Could it be that the duplicates that are being filtered by MuTect2 are actually my marked duplicates?
Unless you specify not to MuTect2 automatically filters marked duplicates. That's the whole point of marking duplicates is so they aren't considered by downstream variant callers and other tools.The GATK/Broad best-practices documents is primarily geared towards working with DNA sequencing data. Many of the steps have not been validated when working with RNA. RNA-based variant calling has always been considered a little bit more problematic than DNA-based results as the underlying error rate is higher for individual nucleotides.