Hi,
I have about 75 methylation profiles from diseased subjects that differ in several variables, e.g. a continuous variable that indicates disease severity. There are no classical different groups. The data stem from Illumina 450k arrays. I have done QC, normalization etc. with minfi, and ended up with a matrix of beta values.
I looked into different ways to assess differential methylation related to the different variables. I am, however, unsure what the most approporate way to tackle this problem could be.
I am concerned about the distribution of betas that from what I remember renders this problem unsuitable for normal linear methods, so I cannot use limma or just a lot of linear models. Or am I mistaken?
beta ~ variable1 + variable2 + (1|subject) (450k times, possibly inappropriate?)
An alternative way would be to 1/0 the data, e.g. by calling every beta below 0.5 'unmethylated', and all above 'methylated'. Thus, I could use (a lot of) logistic regressions to check for variables related to methylation. Still, this approach would loose me quite a lot of detail.
Methylated(0,1) ~ variable1 + variable2 + (1|subject) (450k times, takes a long time)
What do you think?
Many thanks!
Many thanks for your input, probably very smart to limit the number of probes. Will definitely do so. Also helps a lot with the Bonferroni correction. So you would go for a logistic regression, but with three levels? I am using minfi at the moment to read in the data and get betas, and take it from there with handmade code.
Yes, a three-level logistic regression on a subset of highly-variable probes seems sensible although I've never done it as formally as all that. My approach has generally been to: 1. Sort loci by decreasing between-samples variance 2. Select the most-variant 5, 10, 25% of loci 3. Plot a heatmap with hierarchical clustering applied to both the loci and the samples 4. Compare several heatmaps and take note of which associations between clusters of samples and clinical features persist across multiple resolutions.
In the case of continuous clinical variables, it can be tricky because the samples sometimes cluster quite differently with slight changes in the number probes you consider. My experience has been that if you take enough subsets of the most-variant loci you'll soon see consistent groupings of at least most of your samples - but choosing to focus on a particular subset of probes (ex: 17% most-variant loci) because they associate nicely with your clinical variable(s) is obviously problematic.
Another thing you may want to consider is to exclude loci that show high methylation in your normal samples - not sure how appropriate that would be for your research questions but I have mostly used methylation data to look at tumors from a population showing unusual etiology and its made finding associations easier.