Question

How Is Heterozygosity Determined In A Consensus Fasta?

0

Entering edit mode

11.6 years ago

Justin ▴ 470

I have a consensus fasta file generated by samtools/bcftools/vcfutils as discussed in a previous post here: How to generate a consensus fasta sequence from SAM tools pileup?

In some sites, I see e.g. R, which means that site was heterozygous A/G.

How does it determine it's heterozygous? E.g. if you see 50 A's and 39 G's, how do you know it's a het? By using a minor allele % threshold? Or using a probability model? If it uses probability, are there any references out there that describe the math?

consensus samtools • 3.9k views

ADD COMMENT • link updated 11.6 years ago by Sean Davis 27k • written 11.6 years ago by Justin ▴ 470

score 1 · Answer 1 · 2013-09-18

1

Entering edit mode

11.6 years ago

Sean Davis 27k

See this link:

http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ#How_are_SNPs_and_indels_called_and_filtered_by_SAMtools.3F

In short, the MAQ paper is where the method is described:

http://www.ncbi.nlm.nih.gov/pubmed/18714091

ADD COMMENT • link 11.6 years ago by Sean Davis 27k

0

Entering edit mode

Thanks!

For those interested in the math, see the heading "Consensus genotype calling" in the "Methods" section of the paper.

Basically, it's a Bayesian probability model where you want P(genotype g | observed data D), which you can get from Bayes' theorem if you have P(D | g), which the paper explains how to get. Then you estimate g as g* = argmax[g] P(g | D)

ADD REPLY • link 11.6 years ago by Justin ▴ 470