I have been trying to get my head round the different methods used for detecting gene interactions using "A survey of statistical methods for gene-gene interaction in case-control genome-wide association studies" among other sources. In the section on regression based models logistic regression is mentioned. It's easy enough to run this on Plink but I'm struggling to get my head around exactly what the raw dataset would look like if, for example, I wanted to run this on simulated data in R.
It's quite hard to display nice equations here so I will try to simply somewhat:
logit ( P (Y=1 | (Xi, Xj) = (xi, xj) )) = B0 + B1.I(xi=A) + B2.I(xj=B) + B3.I((xi,xj)=(A,B))
Where the link function is logit, xi is SNP1, xj is SNP2, B0 (beta 0) is the intercept and so on. A and B refer to the alleles of SNP1 and SNP2 respectively but I'm not sure if this is a direct allele count (presumably for the disease susceptibility locus) or something else. What is the value of I(xi=A)? From the typeset in the original paper (mentioned above) I seems to be an indicator function. I think in this context that implies that the value of I for SNP xi (denoted as Ixi) would equal 1 if the allele was 'minor' (e.g. A).
I believe though that this GLM is for additive genetic disease and there is talk of allele dosage which makes me thing Ixi may be the number of minor alleles instead (e.g. 0,1,2 assuming the locus is diallelic).
I think the issue would be easy enough to resolve with a mock dataset but I can't find any good examples online.
Subject (case =1, control = 1) | SNP1 | SNP2 | Interaction |
Could someone clarify what the values for SNP1 and SNP2 should be for this model? Is it DSL allele counts or something else? Presumably the interaction term is just SNP1 * SNP2 but I'd be grateful if this could also be clarified.
Thanks for your time!