Question

Population genetics with mutect2 data

0

Entering edit mode

4 months ago

slzr_ ▴ 10

Hey guys, I am currently evaluating nearly 50 genes in a group of samples, and the variant calling was performed using Mutect2. Is it possible to conduct population genetics analyses, such as Fst, Tajima's D, etc., with data obtained from Mutect2? I know that most analyses are more reliable when performed using HaplotypeCaller.

haplotypecaller mutect2 • 600 views

ADD COMMENT • link updated 4 months ago by LauferVA 4.8k • written 4 months ago by slzr_ ▴ 10

score 2 · Answer 1 · 2025-03-17

Hi @slzr_

In short, no. What I mean is, it’s generally not recommended to use Mutect2 calls directly for classical population genetics analyses (like Fst, Tajima's D, etc.).

Analyses that employ metrics like Fst, Tajimas D, etc almost always assume germline variation rather than somatic variation; conversely Mutect2 is optimized for detecting somatic mutations in tumor–normal comparisons.

Why is Mutect2 inappropriate? Because of this difference, Mutect2 applies filters/heuristics that differ significantly from tools dedicated to germline calling. For instance, Mutect2 may aggressively filter out certain sites based on tumor / normal differences --> this makes sense for somatic variant calling, but in contrast, in a germline context, one would not necessarily even have such samples making the intent of use of Mutect2 unclear... Stated a bit differently, you would expect NOT to have high-confidence genotypes at every site in every sample in the same way you would with germline pipelines ... but this is what you would want for downstream pop genetics studies.

So what would you use? Instead, for classical pop genetics you would likely run something like HaplotypeCaller followed by joint genotyping to obtain accurate germline calls, then calculate any statistics liek those above based on this. robust population genetics analyses.

Additional problems suggested by your question Additionally, please keep in mind that having only ~50 genes may limit statistical power for measures like Fst and Tajima’s D, which typically benefit from larger genomic regions. It is also possible - even likely - that depending on the identity of those genes, results that dont generalize to the whole genome would be generated.