Consider some toy data that describes two SNPs, sex, and a disease state for some individuals
The SNPs are represented as the alternate allele dosage that ranges from 0 to 2 since they imputed.
I want to test the association between the SNPs and the disease with sex as a covariate.
data <- data.frame("snp1"=c(runif(n=50, min=0, max=.2),
runif(n=50, min=1.5, max=2),
runif(n=50, min=0, max=.2),
runif(n=50, min=1.5, max=2)),
"snp2"=c(runif(n=50, min=0, max=.2),
runif(n=50, min=0, max=.2),
runif(n=50, min=1.5, max=2),
runif(n=50, min=1.5, max=2)),
"sex"=rbinom(200, 1, 0.5),
"disease"=c(rbinom(150, 1, 0.1),
rbinom(50, 1, 0.9))
)
I think I understand that I could do a single SNP association test like this:
single_snp_test <- glm(disease ~ snp1 + sex, data=data, family="binomial")
Lest say that instead of a single snp test I wanted to use these dosage data to test the association of a haplotype that contains both snp1
AND snp2
. How would I go about doing this haplotype-based type of association test? in psudocode something like this:
snp1_and_snp2_haplotye_test <- glm(disease ~ (snp1 AND snp2) + sex, data=data, family="binomial")
I guess you have to manually combine snp1+snp2 sample into snp by dplyr mutate.
This is really getting at the part that I am confused about and I can't find any examples of this in the literature.
Since dosages are the probability of the alternate allele at a given site, would you really just add them up? Ignoring linkage disequilibrium (which I am thinking you probably can't do), I would think the probability of getting a haplotype that has snp1 and snp2 would be the product of the dosages for snp1 and snp2. Could you clarify why add them?
I don't think they're added per se. snp1+snp2 is combinatorical notation in model expressions. You're looking for all pairs of each factor. It will result in a dataset bigger than your computer's RAM.
It is not a genome wide scan, I am just testing a few pre-selected snps that define a few haplotypes.
I am not sure exactly sure how to incorporate this into my analysis then and whether it would be a change to the glm formula or change to how the data are wrangled before the regression (as pappu seemingly suggested). from my understanding just putting snp1 + snp2 like this:
snp1_and_snp2_haplotye_test<- glm(disease ~ snp1 + snp2, data=data, family="binomial")
would just add snp2 as an additional predictor variable and give me a p value for snp1 and snp2 separately. I want to know what the probability of getting disease is when both snp1 and snp2 are present (based on the dosage data as input).
So far the only way I can see to get one p value for snp1 and snp2 together would be to set up an interaction term:
snp1_and_snp2_haplotye_test <- glm(disease ~ snp1*snp2, data=data, family="binomial")
Although this gives me a p value for
snp1:snp2
, I am really unsure if this actually measures the outcome of disease given the presence of snp1 and snp2, which is what I really want to know.Any help is greatly appreciate, thank you for everything so far
You should be good with snp1*snp2 since it will do each term separately and the interaction. Interpreting all the various results may be tricky, so it could be easier to invent snp3 as the concatenation or paste() of each snp.