I recently tried to apply a simplified version of the equations from Chapter 3.1.3 of http://www.broadinstitute.org/gsa/wiki/images/1/10/Samtools.pdf (modelling sequencing errors/ practical calculation), but the equations don't make sense to me:
Either we're calculating the likelihood of a sequence of observations, in that case P(D|1)
should simply be (2^-n)
, or we're aggregating the data before into k
reference alleles and (n-k)
alternate alleles, in which case P(D|0)
and P(D|2)
should also contain the term (n over k)
. GATK seems to agree with my view: http://www.broadinstitute.org/gsa/wiki/index.php/File:UGgenotypeLikelhoods.jpg.
Intuitively, the samtools method seems to make sense, too. But GATK and samtools can give wildly different results (not just from the different error model, this equation alone can make a difference of many orders of magnitude!), so they cannot both be right. My gut feeling is that samtools would give lots of false positive heterozygous calls. And when I applied the equations to estimate the fraction of heterozygous sites in a sample, I got a gross overstimate.
Has anyone else noticed this? What are you using to call heterozygotes and why? Or am I simply confused?
To clarify further: P(D|0) and P(D|2) express the probability of seeing a specific sequence of n alleles of two kinds, but P(D|1) expresses the probability of seeing a specific set of k alleles of one kind and (n-k) alleles of another kind. Either works, but not in combination.
If something like this is the reasoning behind the equations, they are wrong. And if not, what is the reasoning, because I can't figure it out.
The error modeling is omitted due to the page limit, so this question is not answered in that paper.
If you have not seen it, this paper by Heng Li gives additional/updated details on the samtools model http://bioinformatics.oxfordjournals.org/content/early/2011/09/08/bioinformatics.btr509.short
If you have not seen it, you might be interested in this paper by Heng Li with more/updated details on the samtools model: Genotype Likelihood Calculation In Samtools Is Wrong?
As I misunderstood your question... The paper jts refers to gives the likelihood computation in mutli-ploidy cases. That formula is reduced to the GATK one for diploid samples. Nonetheless, that formula is not used in samtools.
Hi Marvin,
Could you please send me the articles in this link and thisone? I can't open the website right now. My email address is devinliao0918@gmail.com. Thanks
By the way, did you ever try to implement the formula to calculate the genotype likelihood by yourself?