I am interested in calculating the significance of the overlap between unlimited amount of gene sets.
I was able to use python's scipy hypergeom distribution for the case of 2 gene sets using the cdf function.
How could I achieve the same for an unlimited amount of gene sets?
This question becomes much easier to answer by using a simple resampling test. The null hypothesis you are testing against is that the overlap of the gene sets is indistinguishable from what you would get by random sampling. To test this perform 10,000 iterations of the following approach. For each gene set, randomly select genes from the genome equal to the number of actual genes in each set, and then compute the overlap of genes among all resampled sets. The p-value is then the number of iterations where your simulated overlap value was greater than or equal to the observed overlap (plus one), over the number of simulations (plus one).
This procedure is simulating a hypergeometric distribution for 2 samples, and a multivariate hypergeometric distribution for for 3 or more samples. You could indeed go the hypergeometric route directly if you wanted. I only suggested the simulation because it's really simple to code compared to the alternative and will give you comparable results.
Would the simulation approach give me a two sided p value?
I just need the probability of having equal or less than the size of intersection (cdf) - without regarding to the equally unlikeable intersection (1-cdf / survival function)
How would a population size parameter fit in this approach?
Doesn't the p value should be affected by the total 'available genes' ?
For example,. if I have two gene sets A, B.
|A| = 250, |B| = 220, |A&B| = 65
As far as I understand, the p value should change dramatically if the population size is 300 or 30,000.
Could you please explain where/how does it fit in this approach?
You are sampling from the total population, which is how it is taken into consideration. So for your example, each round of iteration you would sample 250 genes from the total pool of 30,000 genes to make a simulated set A, and then sample another 220 genes from the 30,000 to make a simulated set B. You then check how many genes overlap between the simulated A and simulated B sets, and compare that to your observed of 65. I talk about this in my tutorial here.
Thanks rpolicastro. Just to check, if the observed number falls out of the stimulated distribution (sum(null>=observed is zero)), then the p value depends solely on the number of iterations?
Is there another approach to solve this? would a multivariate hypergeometric distribution help?
This procedure is simulating a hypergeometric distribution for 2 samples, and a multivariate hypergeometric distribution for for 3 or more samples. You could indeed go the hypergeometric route directly if you wanted. I only suggested the simulation because it's really simple to code compared to the alternative and will give you comparable results.
Would the simulation approach give me a two sided p value? I just need the probability of having equal or less than the size of intersection (cdf) - without regarding to the equally unlikeable intersection (1-cdf / survival function)
The simulation itself simply generates an appropriate null dataset.
right tail = (sum(null >= observed) + 1) / (n_simulations + 1)
left tail = (sum(null <= observed) + 1) / (n_simulations + 1)
How would a population size parameter fit in this approach? Doesn't the p value should be affected by the total 'available genes' ?
For example,. if I have two gene sets A, B. |A| = 250, |B| = 220, |A&B| = 65 As far as I understand, the p value should change dramatically if the population size is 300 or 30,000.
Could you please explain where/how does it fit in this approach?
You are sampling from the total population, which is how it is taken into consideration. So for your example, each round of iteration you would sample 250 genes from the total pool of 30,000 genes to make a simulated set A, and then sample another 220 genes from the 30,000 to make a simulated set B. You then check how many genes overlap between the simulated A and simulated B sets, and compare that to your observed of 65. I talk about this in my tutorial here.
Thanks rpolicastro. Just to check, if the observed number falls out of the stimulated distribution (sum(null>=observed is zero)), then the p value depends solely on the number of iterations?