I am confused about an approach to evaluate enrichment and was wondering if you could help me understand if what I am doing makes sense.
I have counts from regions that overlap histone markers in a dataset and I would like to know if histone markers are enriched in my dataset compared to a random set using a randomization approach.
I have created 1000 similar datasets and created counts for number of regions overlapping histones in these null datasets. In some cases, this distribution is normal, but in others it is not.
In cases where the distribution of the proportions from the null datasets is normal:
I can use these data to find the mean, standard deviation, and degrees of freedom and compare this distribution (mean, sd and df) to my observed count. Is this correct?
# "sim.null" is a normal distribution of counts of overlaps that I get from 1000 simulations (this is matched on my original dataset for some features)
sim.null= rnorm(sd=0.001, mean=0.01, n=1000)
# I would like to compare it with the counts I get from my dataset
observed = 0.0125
t = mean(sim.null)-observed / (sd(sim.null)/sqrt(1000))
2*pt(-abs(t),df=999)
# Is this the same as doing this?
t.test(sim.null, mu=observed, alternative="two.sided")$p.value
In cases where the null dataset is not normal: would maybe a Fisher exact test be appropriate?
Thank you very much, any suggestions are very appreciated!
Actually shouldn't the empirical p-value be:
And not
since we want the number of occurrences more or as extreme as the observed?
My version counts the occurrences, yours doesn't :)
Yes sorry! I meant to write
length(sim.null[sim.null>=observed])
which gives the same answer and yours is better!Thanks so much for your help!!
From the answers I understood that, under a normal distribution, these should all be giving a similar finding, is this correct?
I don't think the t.test conversion is correct because it gives me very different values compared to the p-value computed from normal distribution. I don't know why, but ignoring the t.test then and I could compute 95% CI directly I hope with this:
Which does not include the observed value 0.0125, so it will be significant. I hope this is correct..
Thank you so much for helping me understand, this is extremely helpful and hard to find on text books (well for who doesn't know what to look for as a non-statistician)...