Dear Biostars,
This might be one of the most obvious statistical related question in high-throughput sequencing data analysis. The question is, how one can calculate the enrichment of real versus random regions/peak overlaps?
For ex: The overlap between sox2 peaks and oct peaks is statically significant or not ?
My total no.of sox2 peaks = 4000
The no.of sox2 peaks that overlap oct4 = 2500
The no.of random sox2 peaks that overlap oct4 = 20
I agree that above example doesn't even need a statistical test to confirm the enrichment of 2500 over 20. But how one can statistically show this significance of enrichment as a p value per se ?
I was doing some thing like this. Do you think it is correct ? If not could you please suggest a better way ? Many thanx in advance!
= log (((The no.of sox2 peaks that overlap oct4 - The no.of random sox2 peaks that overlap oct4)/My total no.of sox2 peaks)*100)
= log ( ( (2500-20) / 4000) 100)
look at KS test : http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test