Hi Sven,
Seems like a very interesting project. I would do the following:
- prune your SNP dataset based on linkage disequilibrium (LD) so that
you are only looking at the most informative SNPs and also to reduce
your variable load (OPTIONAL)
- for each methylation region, take SNPs within a defined window
surrounding the region and test each independently
- for each methylation region, take the statistically significant SNPs
and put those in the final model
- reduce the final model further through stepwise regression (OPTIONAL)
- test the final reduced model's robustness via r-squared shrinkage,
ROC analysis, and cross-validation
----------------------------------------------
In part 2, when I say 'test each independently', I mean:
glm(meth% ~ SNP1)
glm(meth% ~ SNP2)
glm(meth% ~ SNP3)
et cetera
In part 3, if SNP2
, SNP3
, SNP8
, and SNP9
were your statistically significant SNPs, then the final model would be:
final <- glm(meth% ~ SNP2 + SNP3 + SNP8 + SNP9)
Regarding your SNP encoding, you can have these as:
- continuous variables (counts of minor alleles)
- categorical variables (HomMinor, HomMajor, Het)
Regarding your outcome, you can equally encode this as continuous or categorical.
Instead of glm, you could also do lasso-penalised regression. You can also build multiple models in various ways and then compare them, as I do here:
I go over more on these things here:
There's a lot of other material on Biostars and elsewhere, too.
Kevin
Is there any particular reason or hypothesis suggesting that SNPs' effects on DNA methylation are local? I would guess methylation status in a region could well be influenced by variants very far away?
That is true, Vitis.