Hi to all,
relatively new to bioinformatics, I found myself very interested in a remark made by Devon on that link about NOT using fisher's test for DMR : Question: Dmr (Differentially Methylated Regions) Identification Software
There's no reason to expect methylation metrics at a given site to follow a hypergeometric distribution across biological replicates. This is effectively the same reason we use negative binomial distributions with RNAseq count data rather than just sequencing a single sample and assuming a Poisson distribution. While I don't think anyone has really nailed the perfect way to analyze bisulfite sequencing data, bsseq is a good step in the right direction.
I am using regularly MethylKit to assess DMR, and it says that it will choose the most appropriate test. Regardless of that, I would like to understand more about the distribution of that epigenetic methylation. As -again- I'm not a specialist in statistics, I wonder about that :
- Is it now widely shared that fisher's test is not accurate to test DMR / DMC ?
- The reason Devon told about the non-hypergeometric distribution is I guess not because it's a 1/0 distribution, but because there's no reason for a CpG following a first one to be less methylated ? (I quote wikipedia down)
The following conditions characterize the hypergeometric distribution:
The result of each draw (the elements of the population being sampled) can be classified into one of two mutually exclusive categories (e.g. Pass/Fail or Employed/Unemployed).
The probability of a success changes on each draw, as each draw decreases the population (sampling without replacement from a finite population).
- if it is the reason, it has to be sourced or at least supported by litterature I guess. As a first argument to defend the fact it could be be understood as an hypergeometric distribution, the allosteric interaction could lower the probability two spatially close CpG could lower the possibility of having a methyl group on a C? (I acknowledge it is a specific case, although it seems important to take that in account in CGI?) So far I haven't found any paper, but I would be very happy if you would have any about the distribution.
- I thought (but again remind about my lack in statistics) that a Fisher's test could also be designed for independant events ? If it is a hypergeometric distribution, it doesn't sound to me to be fully independant?
Sorry for those questions for dummies :)
Best,
Thanks for your answers, it's really interesting :)
I'll have a look to metilene.
It would be pretty interesting actually to have with the same samples the differences in analysis with the different softwares/libs available. Does that exist already ?
Usually that's part of a figure that each software package will have in the paper describing it. Please take those with a large grain of salt though, since a paper on X will always show that X is better on the datasets presented. In RNA-seq there have been a LOT of comparison papers, that trend hasn't carried over to BS-seq :( I guess I can't really complain though, since I haven't bothered to write such a comparison paper either.