Question

Dna Methylation - Which Measure Of Central Tendency For A Dmr

4

Entering edit mode

12.5 years ago

scottwilliamrobinson ▴ 130

I am writing a script to look at differentially methylated regions (DMR) and notice that while a lot of papers seem to use the mean of the CpGs as a measure of central tendency for a DMR, the default for the R package 'IMA' is the median. I was wondering which one would be best (or geometric mean or whatever)?

I am working with Illumina Beadchip data, my samples are hypertensive human patients and I am using beta value rather than M-value.

methylation • 4.5k views

ADD COMMENT • link 12.5 years ago by scottwilliamrobinson ▴ 130

score 3 · Answer 1 · 2013-03-04

This is a good question and one, I think, with no one correct answer.

You're correct that the default option for the IMA function indexregionfunc is median; it also offers two other options:

For each speciﬁc region of a gene, IMA will collect the loci within it and derive an index of overall region methylation value. Currently, there are three different index metrics implemented in IMA: mean, median, and Tukey’s Biweight robust average. By default, the mean beta values will be used as the region’s methylation index for further analysis.

The problem that we want to address is: how best to summarize the measurements from methylation array probes that are associated with a transcript into a single value, indicative of a DMR. This is a rather different problem to other kinds of array. For example, to summarize exon expression probesets to a transcript, we might take the median RMA value of core probesets. With the methylation array we have probes located in different types of region (CpG island, shore, shelf, in-gene) and genomic annotation is frequently less well-known.

To be frank, I suspect that many papers use mean of CpG-associated probes because the authors are biologists who have not given much thought to the statistics. Mean is certainly a way to summarize multiple probes, but is it meaningful? Bear in mind that beta-values are strongly bimodal in distribution whereas the mean describes one feature of a normal distribution. Likewise, some people probably use median because of a vague notion that it is "better than the mean" - but again, only in the context of normal distribution.

I've also seen people choose, for transcripts with multiple methylation probes, the probe with the highest variance. Or the probe with the lowest p-value after analysing differences between 2 conditions. Or take moving averages across N bases upstream of the gene, where N is anything from 500-2000 bp. In summary: I don't think anyone yet has a good handle on how to summarize methylation probe values to DMRs and what people are doing is justifying essentially arbitrary decisions.

score 0 · Answer 2 · 2013-03-04

Bear in mind that beta-values are strongly bimodal in distribution whereas the mean describes one feature of a normal distribution.

The distribution for a large number of pooled sites is bimodal, but for individual sites/CpGs it is apparently logistic. As such I have been looking at my sites with a logit transformation, and was thinking that I would make the assumption that my regions will follow the same pattern. I suppose that for some regions the distribution may be somewhere in between logistic and bimodal if there is differential methylation going on within-group, but I guess in these cases I am less likely to find a significant differential methylation between groups anyway due to high SD (so I wouldn't get false positives). I suppose it might result in some false negatives. Think this might be the best way though?

(1) Back to the original question though since if we have all the observations (CpGs) in a particular region maybe it is best to think of the average as simply a score rather than an estimation of a population, in which case I would have thought the mean would have been more appropriate since it takes 'full account' of the observations. That still isn't to say that the mean would work best as a score in practice.

(2) I suppose the 'safe bet' would be the geometric mean since it is conceptually somewhere in between the arythmetic mean and the median.

(3) Lastly is there anything to be said for trying them all and seeing which gives the most differentially methylated regions, or the highest percentage which 'look right'? I realise this might be seen as circular but it's not entirely different to using a 'training data set'.

Which of these three would seem most reasonable or are they all fairly equally flawed/arbitrary?