Hi all,
First post on the site, although I've been benefiitting from the questions on here for a while. I've been thinking about some SNP filtering and was hoping for some feedback on my ideas. I've been using mutect to identify somatic SNPs in tumor sample using a matched control and a panel of normal mutations. I'm still getting quite a few very SNPs called with low MAF (<0.1) and I'm keen to remove false positives where possible.
One idea I had was to run mutect with control and tumor bams swaped around and investigate the SNPs which are not rejected by the bayesian classifier or subsequent filtering (Control SNPs). Assuming that these are all false positives - the frequency of back-mutations should be vanishingly small - then I'd expect the attributes of these SNPs (read depth, alternate allele count, power to call strand inbalance, etc) to be similar to false positives SNPs in the tumor sample. I can therefore use these attributes to classify the tumor mutations as low or high confidence depending on how much they resemble the Control SNPs. In essence this would be similar, but opposite, to the GATK VariantQualityScoreCalibration tool which scores variants based on my much they resemble SNPs which are also found in a database of high-quality SNPs.
Does this seem a reasonable approach? If so, any advice on how best to do this? I assume some sort of clustering-based approach would be best?
Cheers,
Tom
EDIT: In response to Malachi's questions,s ome more details: I have 8 tumor and matched control exomes from non-smoking lung cancer patients. Our initial aim is to identify somatic SNPs and INDELs, although I would like to extend this to identification of susceptibility factors and mutation hotspots later if at all possible.
For the mutect analysis, I am employing the HC + PON mode by first running mutect on my 8 normal samples in single sample mode and then generating a vcf containing all variants found in at least 2 normal samples. All default parameters except min_qsore 20
and gap_events_threshold 2
.
Hi Tom,
This may not be the proper answer to your question, but here is what I found to be best. The problem with MuTect is, it is too sensitive and catches lot of low frequency mutations (as you are observing in your data and it is good since some of them are rare mutations which usually occur at low VAF). So what you could do is run VarScan2 somatic on the same dataset, and combine both the results. You could later remove potential false positives using this excellent script by Cyriac Kandoth's.
@poisonAlien
Thanks. I was thinking of calling with another tool and comparing the outputs. I'll definitely have a look at the script too. I'm not great with perl but it seems suitably commented throughout which helps!
Can you elaborate a bit more on the analysis scenario? Do you have only tumor/normal pairs (matched)? Are you also using a pool of unrelated normals? What tumor type are you studying?