I have two gene lists, derived from two independent datasets. I want to compute the significance of overlap between two subgroups. This is the case of 2 lists of differentially expressed genes in each dataset. I want to know whether the overlap between both groups would be given by chance. For instance:
Dataset1 total: 500
Dataset1 subgroup: 100
Dataset2 total: 300
Dataset2 subgroup: 50
Intersection between subgroups: 25
Union 1 and 2 (no duplicates): 600
I want to compute how significant is the overlap between subgroups against what would be gotten by chance. How would you do this in Python? I was looking at Fisher's exact test or hypergeometric test but have some problems putting my data into the analyses.
From what I understand, the contingency table would be:
Dataset1 Dataset2
In_subgroup 100 50
Not_subgroup 400 250
Total 500 300
Here, note that the universe is comprised by 600 unique elements, and not 500+300 (as there are duplicates within dataset1 and dataset2). Given this, and based on another post, I would do this in R:
phyper(24, 100, 500, 50, lower.tail = FALSE)
[1] 9.15e-19
Translating this into Python I would use in scipy:
>>> scipy.stats.hypergeom.cdf(24, 600, 500, 300)
0.0
Can I assume that the difference between both results is numeric error?
Hypergeometric Test On Gene Set
Similarity of equally-sized distributions can be assessed using a Kolmogorov-Smirnov test. Since your distributions are unequal in size, you may want to look at Mann–Whitney U test. Scipy stats has functions to calculate both quantities.
@Mensur, thanks for your comment. Perhaps I'm missing something, but I do not want to compare 2 distributions, I want to assess the likelihood of getting the overlap between subgroups out of chance.
Your two subgroups can be thought of as distributions of numbers. Start with zeros for each gene in the genome, and put 1 when they occur in your list. Do the same for your other list, and you have two distributions of numbers. Not sure whether that's a better test than what you are doing already, but it can be done.