I have a consensus fasta file generated by samtools/bcftools/vcfutils as discussed in a previous post here: How to generate a consensus fasta sequence from SAM tools pileup?
In some sites, I see e.g. R, which means that site was heterozygous A/G.
How does it determine it's heterozygous? E.g. if you see 50 A's and 39 G's, how do you know it's a het? By using a minor allele % threshold? Or using a probability model? If it uses probability, are there any references out there that describe the math?
Thanks!
For those interested in the math, see the heading "Consensus genotype calling" in the "Methods" section of the paper.
Basically, it's a Bayesian probability model where you want P(genotype g | observed data D), which you can get from Bayes' theorem if you have P(D | g), which the paper explains how to get. Then you estimate g as g* = argmax[g] P(g | D)