Question

WES -- How come I found 40% mutations intron regions

1

Entering edit mode

7.5 years ago

haiying.kong ▴ 360

I have WES data from which identified somatic mutations with MuTect2, and annotated with Oncotator. The protocol captures exomes and UTRs, but on my result, over 40% of mutations are annotated as Intron regions. How is this possible?

WES somatic mutation • 4.7k views

ADD COMMENT • link updated 7.5 years ago by andrew.j.skelton73 6.6k • written 7.5 years ago by haiying.kong ▴ 360

GenoMax · Answer 1 · 2017-12-18

2

Entering edit mode

7.5 years ago

andrew.j.skelton73 6.6k

Most likely because the exome capture will extend beyond the strictness of exon boundaries. The padding value in GATK allows you to look beyond just the exome capture too for any reads that cross those areas. The reason you're likely seeing a lot of results is because these areas just beyond the exon boundaries will have much poorer coverage, and thus calling variants is harder.

ADD COMMENT • link 7.5 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

Thanks for your reply.

I do not understand: The reason you're likely seeing a lot of results is because these areas just beyond the exon boundaries will have much poorer coverage, and thus calling variants is harder.

Do you mean in those intron regions captured, the coverage is poorer, and therefore a lot of false callings, and therefore higher calling? It is difficult to understand, how come I am calling so much more intron mutations.

11                                            Intron 42904
13                                 Missense_Mutation 38080
1                                              3'UTR 24338
22                                            Silent 19292

ADD REPLY • link updated 7.5 years ago by GenoMax 151k • written 7.5 years ago by haiying.kong ▴ 360

0

Entering edit mode

If you look at your alignment in something like IGV, along with your VCF file and look at some of your intronic variants, my guess is that they'll be just outside of the exons. You'll get a higher false positive (FP) rate around these areas because they're typically poorly covered. There's a couple of things to consider: the exon, the capture, and the reads. The capture kit targets the exons, and usually a little bit more upstream and downstream. The reads will mostly cover the capture boundaries, plus around half the read length of the sequencing you're performing (padding). Coverage is almost a bell curve around the capture, with low coverage just outside of the capture. If you're seeing a high intronic ratio of variants, then consider looking at these in IGV, and if you think they're a problem, maybe apply a DP tag filter.

ADD REPLY • link 7.5 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

Thanks again for your help. It takes very very long to run MuTect even with many cores or nodes. So I do not have time to rerun, because I am trying to finish up soon.

Is there any way to do the filtering from MuTect output?

ADD REPLY • link 7.5 years ago by haiying.kong ▴ 360

2

Entering edit mode

This would filter out all reads mapping to intronic regions prior to base score recalibration and variant calling based upon genomic intervals annotated as exonic in the assemblies corresponding gtf or gff file. Might actually be the best way to do it, IMO.

awk ' $3 == "exon" ' Homo_sapiens.GRCh38.91.gtf | awk ' {i=i+1; print $1"\t"$4"\t"$5"\t""exon"i"\t""100""\t"$7} ' >> exon.bed
samtools view -bh -L exon.bed alignments.bam > filtered_alignments.bam

You probably also want to remove reads that are lower than q30 as well, but I'm not exactly sure how the gatk pipeline handles these reads. So I'd check their documentation first.

ADD REPLY • link 7.5 years ago by mforde84 ★ 1.4k

0

Entering edit mode

You don't have to re-do the processing from MuTect2, but use GATK's SelectVariants to filter through your call set.

ADD REPLY • link 7.5 years ago by andrew.j.skelton73 6.6k