You include n-1 genotype columns in your regression, where n is the number of alleles. (One allele, usually the highest-frequency one, must be omitted to avoid linear dependence in the regression.)
Suppose you have 4 samples; let's label them A, B, C, and D. Sample A has genotype T/T at this SNP, and phenotype value 175. Sample B has genotype C/T and phenotype value 160; sample C has genotype C/C and phenotype value 155; and sample D has genotype T/T and phenotype value 173.
A standard GWAS is based on [phenotype] ~ [genotype, intercept, other predictors] regressions. Ignoring "other predictors" for now, the data matrices for the regression at this SNP would look like
I've labeled the single genotype column "#C" here, representing "number of copies of the C allele".
Now change sample D's genotype to A/T. This would leave the original data matrices unchanged: neither A/T nor T/T have any copies of C. Which may actually be fine for detecting whether the C allele has a noticeable effect, but we're now also interested in whether the A allele does. We investigate that by adding a #A column:
Of course, with only 4 samples, we can't conclude much. But (with a good choice of "other predictors") this approach becomes quite effective as your sample size increases.
The model is usually linear so 0,1,2 is the number of minor alleles in the genome (so 0=homo-major, 1=hetero, 2=homo-minor) and the assumption is that two minor alleles will have two times the effect of the major. It doesn't have to hold for every test and tool but this is what I've seen. If there are alternative minor alleles they could be two different SNPs or assumed to have the same effect (or avoided altogether).
One way of dealing with epistasis could be to multiply the two SNPs values and divide by 2 (to be in the 0-2 range). I don't know a tool that can do this but statistically is should be valid (assuming linear interaction and additive effect).
Hi, I'm not so clear about your answer, could you please explain a bit more? Thank you
Suppose you have 4 samples; let's label them A, B, C, and D. Sample A has genotype T/T at this SNP, and phenotype value 175. Sample B has genotype C/T and phenotype value 160; sample C has genotype C/C and phenotype value 155; and sample D has genotype T/T and phenotype value 173.
A standard GWAS is based on [phenotype] ~ [genotype, intercept, other predictors] regressions. Ignoring "other predictors" for now, the data matrices for the regression at this SNP would look like
I've labeled the single genotype column "#C" here, representing "number of copies of the C allele".
Now change sample D's genotype to A/T. This would leave the original data matrices unchanged: neither A/T nor T/T have any copies of C. Which may actually be fine for detecting whether the C allele has a noticeable effect, but we're now also interested in whether the A allele does. We investigate that by adding a #A column:
Of course, with only 4 samples, we can't conclude much. But (with a good choice of "other predictors") this approach becomes quite effective as your sample size increases.
Ah, it's clear to me now, thank you