I am struggling understand the first two of the following equations, which are from the Supplementary materials of the Conpair paper, and describe how genotypes are calculated:
For the first equation, the probability of D|AA is much lower if all my reads were A compared to if all my reads were B? Is e_j what I think it means? I would think a low e_j means that the call is more reliable. Ex: my error rate is .01 and D={A,A,A,A}
:
P(D|AA) = (.01^4)(.99^0)=1e-8
But if D={B,B,B,B}
, the calculation comes to:
P(D|AA) = (.01^0)(.99^4) ~ .96
For the second equation, the occurrences of A are not considered at all? The index and upper bound for both operators are exactly the same, if I'm reading that right? Is it just me or are there a bunch of typos here?
Citation: Bergmann EA, Chen BJ, Arora K, Vacic V, Zody MC. Conpair: concordance and contamination estimator for matched tumor-normal pairs. Bioinformatics. 2016 Oct 15;32(20):3196-3198. doi: 10.1093/bioinformatics/btw389. Epub 2016 Jun 26. PMID: 27354699; PMCID: PMC5048070.
Are you suggesting reviewers are doing a lousy job? SHOCKING!
More seriously, if you go to Heng's note (p20), yes they got it wrong. They mislabeled AA and BB compared to Heng's 0 and m (plus other typos).
http://lh3lh3.users.sourceforge.net/download/samtools.pdf
I thought only I was allowed to be lousy!
Thanks for the reference, so far the equations make more sense.