I have WES data from which identified somatic mutations with MuTect2, and annotated with Oncotator. The protocol captures exomes and UTRs, but on my result, over 40% of mutations are annotated as Intron regions. How is this possible?
I have WES data from which identified somatic mutations with MuTect2, and annotated with Oncotator. The protocol captures exomes and UTRs, but on my result, over 40% of mutations are annotated as Intron regions. How is this possible?
Most likely because the exome capture will extend beyond the strictness of exon boundaries. The padding value in GATK allows you to look beyond just the exome capture too for any reads that cross those areas. The reason you're likely seeing a lot of results is because these areas just beyond the exon boundaries will have much poorer coverage, and thus calling variants is harder.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks for your reply.
Do you mean in those intron regions captured, the coverage is poorer, and therefore a lot of false callings, and therefore higher calling? It is difficult to understand, how come I am calling so much more intron mutations.
If you look at your alignment in something like IGV, along with your VCF file and look at some of your intronic variants, my guess is that they'll be just outside of the exons. You'll get a higher false positive (FP) rate around these areas because they're typically poorly covered. There's a couple of things to consider: the exon, the capture, and the reads. The capture kit targets the exons, and usually a little bit more upstream and downstream. The reads will mostly cover the capture boundaries, plus around half the read length of the sequencing you're performing (padding). Coverage is almost a bell curve around the capture, with low coverage just outside of the capture. If you're seeing a high intronic ratio of variants, then consider looking at these in IGV, and if you think they're a problem, maybe apply a
DP
tag filter.Thanks again for your help. It takes very very long to run MuTect even with many cores or nodes. So I do not have time to rerun, because I am trying to finish up soon.
Is there any way to do the filtering from MuTect output?
This would filter out all reads mapping to intronic regions prior to base score recalibration and variant calling based upon genomic intervals annotated as exonic in the assemblies corresponding gtf or gff file. Might actually be the best way to do it, IMO.
You probably also want to remove reads that are lower than q30 as well, but I'm not exactly sure how the gatk pipeline handles these reads. So I'd check their documentation first.
You don't have to re-do the processing from MuTect2, but use GATK's SelectVariants to filter through your call set.