Hi everyone,
I'm recently analyzing DNA methylation data and facing an obstacle problem here:
As we know that the DNA methylation distribution can vary differently in genomic features (core promoter, enhancer, CpGIsland, etc). I want to measure the distribution bias among these genomic features now.
In other words, I want to know the deviation between expected and observed DNA methylation sites number?
I read some papers and found various methods used in this analysis, for example, independent t-test, Chi-square test, Mann-Whitney U test, permutation test, etc
, which made me really confused on choosing.
I have tried the independent t-test
and calculated the ratio = log2(mean of observed/mean of expected)
for plotting heatmap (In this result, if the ratio > 0, I will say DNA methylation occurs more often in this region and vice verse). However, someone told me that the Chi-square test
may better on measuring the difference between observed and expected. I also tried this too. However, I can only get a chi-value for each genomic feature, which varies a lot (from 300 - 40000000), difficult for visualization.
So, I have several questions:
- Which methods do you think is better for this kind of problem?
- If Chi-square distribution is used, how to properly handle the chi-value for visualization (normalize the chi-square of each region to a random region?)
- I noticed the p-value is typically small (10e-100 often and even 0 reported), I referred some answer on how to handle very large dataset for statistical test and find there are no clear conclusions. So, if you make statistical test on a very large dataset (typically, sample size in the level of 10e6 is usual in bioinformatics), how do you handle the very small p-value?
Thanks for your time, really appreciate any answers!