Question

Identifying false positives in mutect output

0

Entering edit mode

11.0 years ago

smithtomsean ▴ 220

Hi all,

First post on the site, although I've been benefiitting from the questions on here for a while. I've been thinking about some SNP filtering and was hoping for some feedback on my ideas. I've been using mutect to identify somatic SNPs in tumor sample using a matched control and a panel of normal mutations. I'm still getting quite a few very SNPs called with low MAF (<0.1) and I'm keen to remove false positives where possible.

One idea I had was to run mutect with control and tumor bams swaped around and investigate the SNPs which are not rejected by the bayesian classifier or subsequent filtering (Control SNPs). Assuming that these are all false positives - the frequency of back-mutations should be vanishingly small - then I'd expect the attributes of these SNPs (read depth, alternate allele count, power to call strand inbalance, etc) to be similar to false positives SNPs in the tumor sample. I can therefore use these attributes to classify the tumor mutations as low or high confidence depending on how much they resemble the Control SNPs. In essence this would be similar, but opposite, to the GATK VariantQualityScoreCalibration tool which scores variants based on my much they resemble SNPs which are also found in a database of high-quality SNPs.

Does this seem a reasonable approach? If so, any advice on how best to do this? I assume some sort of clustering-based approach would be best?

Cheers,
Tom

EDIT: In response to Malachi's questions,s ome more details: I have 8 tumor and matched control exomes from non-smoking lung cancer patients. Our initial aim is to identify somatic SNPs and INDELs, although I would like to extend this to identification of susceptibility factors and mutation hotspots later if at all possible.

For the mutect analysis, I am employing the HC + PON mode by first running mutect on my 8 normal samples in single sample mode and then generating a vcf containing all variants found in at least 2 normal samples. All default parameters except min_qsore 20 and gap_events_threshold 2.

sequencing SNP next-gen • 8.8k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 11.0 years ago by smithtomsean ▴ 220

1

Entering edit mode

Hi Tom,

This may not be the proper answer to your question, but here is what I found to be best. The problem with MuTect is, it is too sensitive and catches lot of low frequency mutations (as you are observing in your data and it is good since some of them are rare mutations which usually occur at low VAF). So what you could do is run VarScan2 somatic on the same dataset, and combine both the results. You could later remove potential false positives using this excellent script by Cyriac Kandoth's.

ADD REPLY • link updated 3.7 years ago by Ram 45k • written 11.0 years ago by poisonAlien ★ 3.2k

0

Entering edit mode

@poisonAlien

Thanks. I was thinking of calling with another tool and comparing the outputs. I'll definitely have a look at the script too. I'm not great with perl but it seems suitably commented throughout which helps!

ADD REPLY • link updated 3.7 years ago by Ram 45k • written 11.0 years ago by smithtomsean ▴ 220

0

Entering edit mode

Can you elaborate a bit more on the analysis scenario? Do you have only tumor/normal pairs (matched)? Are you also using a pool of unrelated normals? What tumor type are you studying?

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 11.0 years ago by Malachi Griffith 20k

Ram · Answer 1 · 2014-08-05

2

Entering edit mode

11.0 years ago

Christian ★ 3.1k

This sounds like a good idea, but I would be even more aggressive with filtering: just throw out any tumor variant that shows up on reads in your normal panel. Depending on the size of your normal panel, you might require the presence of the variant in two or more normal samples and having good base quality to account for sequencing errors. In my experience this is one of the most effective filters of false positives and reduces the amount of MuTect variants by more than 50% without much compromise in sensitivity.

ADD COMMENT • link 11.0 years ago by Christian ★ 3.1k

3

Entering edit mode

One would want to be careful with this strategy in cases where there is a prior expectation of tumor DNA contaminating the normal DNA sample (assuming the normal samples are matched of course). In AML, for example, contamination of the normal DNA with tumor DNA can seriously affect sensitivity for determining somatic variations. Analysis strategy may need to be altered based on this factor and others that vary between tumor types.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 11.0 years ago by Malachi Griffith 20k

0

Entering edit mode

@ Malachi

That's definitely something worth considering, and hence why I've only included variants in more than one normal sample in my "panel of normal" vcf. In this case, I'm not overly worried about contamination from tumor to normal tissue.

ADD REPLY • link 11.0 years ago by smithtomsean ▴ 220

0

Entering edit mode

@ Christian

Thanks. I have indeed used the panel of normal option to remove tumor variants in my normal sample, with every variant found in any 2 of 8 normal samples retained in the "panel of normal" vcf. As you say, this removed a considerable number of variants.

I could remove more variants by including all normal variants. However, my approach was more to try and characterise the false positives from mutect to inform a post-mutect filtering step.

As an aside, what sort of validation rates are you getting with MuTect?

ADD REPLY • link 11.0 years ago by smithtomsean ▴ 220

0

Entering edit mode

We have not performed any systematic validation of mutect variants yet, but I would be interested in validation rates as well. partly we cannot sanger validate because allelic frequencies are.well below 20%.

ADD REPLY • link 10.9 years ago by Christian ★ 3.1k

0

Entering edit mode

I would like to ask certain questions related to Mutect, I have been using it for 2 of my tumor samples. One of my tumor samples is 25 % contaminated. I have normal/tumor paired samples. I used --fraction_contamination 0.25 but it does only give me 3 variants that are having the FLAG KEEP however VarScan gives me over 800 with that contamination fraction. So can you tell me in this case what should be the parameter. Below is the command am using

java \
  -Xmx14g \
  -jar /scratch/GT/softwares/mutect/muTect-1.1.4.jar \
  --analysis_type MuTect \
  --reference_sequence /scratch/GT/vdas/test_exome/exome/hg19.fa \
  --cosmic /data/PGP/exome/mutect/hg19/hg19_cosmic_v54_120711.vcf \
  --dbsnp /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf \
  --input_file:normal /scratch/GT/vdas/pietro/exome_seq/results/N_S8981/N_S8981.realigned.recal.bam \
  --input_file:tumor /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999.realigned.recal.bam \
  --out /scratch/GT/vdas/pietro/exome_seq/results/mutect/exonic_call/mutect_S_333soma_t_ex25.txt \
  --coverage_file /scratch/GT/vdas/pietro/exome_seq/results/mutect/exonic_call/LGex25.coverage.wig.txt \
  --vcf /scratch/GT/vdas/pietro/exome_seq/results/mutect/exonic_call/mutect_S_333soma_t_ex25.vcf \
  --intervals /scratch/GT/vdas/referenceBed/hg19/ss_v4/Exon_V4_clean.list \
  --fraction_contamination 0.25 \
  -ip 50 \
  -tdf /scratch/GT/vdas/pietro/exome_seq/results/mutect/exonic_call/LGex_25.tdf

I have results which are interesting specially removing the fraction contamination. I am using both variants coming out of VarScan2 and Mutect, by combining them and taking union of them, am not taking intersection since mutect gives a lot more variants at low frequency. I would like to ask if my tumor is contaminated with 30% of normal cells which should be the option in mutect. Is my option correct? I have only 1 pair of normal/tumor which is impure. I would like some suggestions

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by ivivek_ngs ★ 5.2k