According to plink documentation, a hardcall is saved when the distance from the nearest hardcall, defined as 0.5 * sum_i |x_i - round(x_i)| (where the x_i's are 0..2 allele dosages), is not greater than 0.1.
Truthfully, this doesn't make sense to me (especially why x_i is being summed) and was hoping if someone could help break down math.
Many thanks!
So lets say you had single dosages representing only 2 alleles.
Example 1:
A/T = 1.900
0.5* (|1.900-2| + |0.100 - 0|)
= 0.5 * (0.100 + 0.100)
= 0.5 * 0.200
= 0.100
As this is not greater than 0.1, we would code the dosage as 2 (for TT).
Example 2:
A/T = 0.980
0.5 * (|0.980 -1| + |1.02 - 1|)
=0.5 * (0.020 + 0.020)
=0.5 * 0.040
= 0.020
In this example, we would code it as A and T (heterogeneous).
My concern is how do you "decide" which alleles are present. Looking at the examples and my first example, it seems obvious. But I want to be able to program logic in a python script to check my understanding and compare my results against Plink.
I appreciate your help!
Just look at when round(x_i) isn't zero.
Clever. =)
Thank you