Hello,
I'm currently testing GATK's tumor-only variant calling pipeline and came across an issue that I'm stuck on. I wanted to see how different the results would be if I ran Mutect2 with and without matched normal by comparing the two outputs (command used available below) using TCGA data. I'm following the current GATK workflow in order to filter the Mutect2 (i.e. FilterMutectCalls) calls and only consider variants with 'PASS' for comparison.
For a given sample, its results were summarized in venn diagram:
I assume Mutect2 is trying to do its best without paired normal sample to reduce errors in its calls, but the small overlap between the results really concern me as I prepare to perform tumor-only calling on my samples. Does anyone have a suggestion on how to impose more stringent filtering (i.e. retrieve variants that fall in both matched and unmatched workflow from the venn diagram)?
I seeked for assistance on GATK forums, but found out they only handle errors and issues with their tools.
Thanks in advance.
# Mutect2 w/ matching normal
gatk Mutect2 -R hg38.fa \
-I input_tumor.bam -I input_normal.bam \
-tumor tumor_sample -normal normal_sample \
-pon gatk4_mutect2_4136_pon.vcf.gz --germline-resource af-only-gnomad.hg38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
-L exome_autoXYM.intervals -O mt2_matched.vcf.gz
# Mutect2 w/ no matching normal
gatk Mutect2 -R hg38.fa \
-I input_tumor.bam \
-pon gatk4_mutect2_4136_pon.vcf.gz \
--germline-resource af-only-gnomad.hg38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--genotype-germline-sites \
-L exome_autoXYM.intervals \
-O mt2_unmatched.vcf.gz
That is quite a worrying finding, but not unexpected. Can you please paste a link to the thread on GATK Forum, please?
Discussion thread can be found here. A developer responded asking how I compiled panel of normals, but not much to go on off of yet.
Are you seeing this with multiple tumor-normal pairs or just one?
Have you manually checked some of the variants in different categories in IGV (or a similar tool)? Do they make sense? For example, do the unmatched-only variants show up in the normal?
newbio17?
Sorry, I didn't get the notifications for the comments. I ran a couple of other matched data and the overlap is still low (10-15%) between the two workflows.
For igor's second comment, I did try running the germline variant calling on the normal samples and tried to see if the germline variants overlap with the unmatched-only variants from tumor-only somatic call. The results show that two sets do not overlap (0-2%).
I think you misunderstood my second comment. Did you manually inspect the variants? I mean visually inspect the BAM files. A lot of problems become very obvious when you actually see the reads.
Ah okay. I did briefly take a look at couple of files I tested, but nothing too out of ordinary. However, I only examined a couple cases such as intron variants in order to make sure the region was consistent with Agilent's target region. What do you recommend I look for specifically (aside from potential misalignments)?
Also, one of GATK developers replied to above-mentioned GATK forum post commenting on tumor-only variant calling and issues associated with it for those who are curious.
Are you targeting non-coding regions?
Do the variants make sense in terms of where they are on your venn diagram? For example, do the unmatched-only variants show up in the normal?
Thanks for letting us know.
I'm not specifically interested in non-coding regions, but I examined them since they showed up to make sure they weren't due to some alignment errors.
Unmatched variants rarely shows up, if at all in normal. Therefore, they are new variant calls that unique to tumor-only protocol.
If I have any updates regarding this, I will follow up on this post. Thanks everyone.
Just to clarify, you see variants in your tumor sample, but not in your normal sample, but they are not being called when you are using that matched normal.
In that case, you can substantially simplify your original question to something like "I see clear variants in my tumor BAM file and not in the matched normal, but why are they not being called?".
Let me clarify what I was doing with the tumor-only approach. I first call somatic variants using Mutect2 on the tumor BAM file. Then I call germline variants using HaplotypeCaller on the normal BAM file. Between the VCF files generated by both tools, there was almost no overlap. Therefore, I concluded that 3911 (from above venn diagram) are not germline variants that were failed to get filtered out in tumor-only approach.
To reiterate, I have the option to either provide normal & tumor BAM files or provide only tumor BAM file to get a single VCF file. I asked my original question in the hopes of receiving some feedback after providing evidence that two different workflows for calling somatic variants display significantly different results when they should be (ideally) similar. So I'm not sure if I fully agree with your simplified version of my question since that is not what I intended to ask in the first place (why they are not in normal seems self-explanatory if you consider that they are two different variant callers serving different purposes).
I realize that is not what you intended to ask, which is why I suggested that should be the first question you should be asking. Your final question has a lot of variables. You should resolve each variable separately. The reason you have poor overlap is because your true somatic variants are not being called in the matched analysis (at least based on what you said in the comments). You need to figure out why they are not being called before you ask why they are not overlapping with variants from a different type of analysis.
Now you are adding yet another variable to your question. Why is HaplotypeCaller relevant here? When you say that the variants are not in the normal, did you actually look at the BAM files yourself (as I suggested) or are you looking at HaplotypeCaller results?
Yeah. I mentioned that I did this for handful of coding regions with no noticeable oddities.
I agree, but what I'm concerned with aren't the missed calls (false negatives), but the false positives.
This became relevant when I tried to examine the false positives were actually germline variants.
Thanks again for your comments and I am more than interested in other suggestions/recommendations if you have any additional ones.