Question

What statistical test for CpG DMR ?

2

Entering edit mode

6.4 years ago

alorsonmethyle ▴ 50

Hi to all,

relatively new to bioinformatics, I found myself very interested in a remark made by Devon on that link about NOT using fisher's test for DMR : Question: Dmr (Differentially Methylated Regions) Identification Software

There's no reason to expect methylation metrics at a given site to follow a hypergeometric distribution across biological replicates. This is effectively the same reason we use negative binomial distributions with RNAseq count data rather than just sequencing a single sample and assuming a Poisson distribution. While I don't think anyone has really nailed the perfect way to analyze bisulfite sequencing data, bsseq is a good step in the right direction.

I am using regularly MethylKit to assess DMR, and it says that it will choose the most appropriate test. Regardless of that, I would like to understand more about the distribution of that epigenetic methylation. As -again- I'm not a specialist in statistics, I wonder about that :

Is it now widely shared that fisher's test is not accurate to test DMR / DMC ?
The reason Devon told about the non-hypergeometric distribution is I guess not because it's a 1/0 distribution, but because there's no reason for a CpG following a first one to be less methylated ? (I quote wikipedia down)

The following conditions characterize the hypergeometric distribution:

The result of each draw (the elements of the population being sampled) can be classified into one of two mutually exclusive categories (e.g. Pass/Fail or Employed/Unemployed).

The probability of a success changes on each draw, as each draw decreases the population (sampling without replacement from a finite population).

if it is the reason, it has to be sourced or at least supported by litterature I guess. As a first argument to defend the fact it could be be understood as an hypergeometric distribution, the allosteric interaction could lower the probability two spatially close CpG could lower the possibility of having a methyl group on a C? (I acknowledge it is a specific case, although it seems important to take that in account in CGI?) So far I haven't found any paper, but I would be very happy if you would have any about the distribution.
I thought (but again remind about my lack in statistics) that a Fisher's test could also be designed for independant events ? If it is a hypergeometric distribution, it doesn't sound to me to be fully independant?

Sorry for those questions for dummies :)

Best,

methylation statistical test epigenetics DMR • 2.5k views

ADD COMMENT • link 6.4 years ago by alorsonmethyle ▴ 50

0

Entering edit mode

Thanks for your answers, it's really interesting :)

I'll have a look to metilene.

It would be pretty interesting actually to have with the same samples the differences in analysis with the different softwares/libs available. Does that exist already ?

ADD REPLY • link 6.4 years ago by alorsonmethyle ▴ 50

0

Entering edit mode

Usually that's part of a figure that each software package will have in the paper describing it. Please take those with a large grain of salt though, since a paper on X will always show that X is better on the datasets presented. In RNA-seq there have been a LOT of comparison papers, that trend hasn't carried over to BS-seq :( I guess I can't really complain though, since I haven't bothered to write such a comparison paper either.

ADD REPLY • link 6.4 years ago by Devon Ryan 105k

score 4 · Answer 1 · 2018-08-23

Since you linked to a comment from me I suppose I should respond :) To your bullet points in order:

Yes. In fact I think this was always the case, it's just that early on sequencing costs were so absurdly high that most people sequenced single samples and called it done.
No, the reason is more that a hypergeometric distribution doesn't model biological variance. The most appropriate distribution is a beta-binomial and you'll find a few packages using that (I think bsseq uses that). The biggest issue with a beta-binomial distribution is that it's really hard to fit unless you have decently high sequencing depth. A number of programs (e.g., Metilene) instead use segmentation-based methods to avoid that. In practice these work pretty well (we teach Metilene in our Galaxy WGBS training sessions).
See above. Note also that a hypergeometric test testing for DMRs, but only DMCs (a beta-binomial is the same in this regard). Typically people use "bump hunting" and similar methods to go from individual CpGs to DMRs (unless they use the segmentation-based methods like Metilene that I mentioned above).
CpG methylation is actually not independent, there's a strong local correlation in methylation levels (many DMR calling programs will make a "correlogram" to model this).

BTW, your questions are in no way dumb. There are a LOT of tools in this field but not a lot of real best practices or further discussion on the various limitations of many of the methods (if BS-seq were as popular as RNA-seq this probably wouldn't be the case).