Dear all,
i have a more general question (anchored in genomics and related to ChIP-seq) regarding the statistical tests to show the specificity of phenomenon :
let's consider an example: someone did a ChIP_seq for H3K27me3, and wants to show that H3K27me3 mark increases only on the genes involved in autophagy, after cell treatment ...
what type of analysis would you recommend in order to show that the phenomenon (ie increase in H3K27me3) is specific to a set of genes (ie autophagy genes) :
A -- taking random sets of non-autophagy genes (practically, the rest of the genes in the genome) -- and using parametric and non-parametric tests when comparing SET 1 (autophagy genes) with SET 2 (non-autophagy genes)
or
B -- using hypergeometric / fisher-tests on a matrix (autophagy/no-autophagy genes vs increase/no-increase in H3K27me3) ?
thanks a lot, and happy weekend ;) !
bogdan
Thank you Jean for your comments. On a side note, I was just thinking, as an alternative to enrichment tests, could someone just use the following procedure :
A-- take the SET of genes with a specific effect (in this case, H3K27me3 increase on autophagy genes)
B-- take a few random SETS of genes
C-- make the boxplots of A vs B, and if by t-tests (or wilcoxon.test test) the difference is statistically significant,
would this support the hypothesis that "the H3K27me3 increase on autophagy genes" is not random ?
You're dealing with counts here and the question is about enrichment so the standard way to answer it is with Fisher's exact test (or the Chi-squared test). If you're looking for a parametric alternative, you could formulate the question in terms of difference between proportions to make it more obvious: is the fraction of methylated genes in the autophagy set different from the fraction of methylated genes in the other genes ? This can be tested in a parametric way using a two-proportion Z-test. However this is equivalent to the Chi-squared or Fisher's tests with the added assumption that the binomial distribution can be approximated by a normal distribution. Note that actually, the tests are for equality of proportions (i.e. equality is the null hypothesis of the tests). What I don't get is why you would take random samples of the genes.