This question in related to this one.
As in the above post, I have two lists of gene names and I am calculating the intersection between them.
I am trying to calculate the p value that the intersection of these lists occurs by chance.
I wrote this code in python. and I am not sure how can I validate that it is actually doing the calculation I am intending on doing.
from scipy.stats import hypergeom as hg
import pandas as pd
def main(gene_path1, gene_path2, pop_size):
genes1 = pd.read_csv(gene_path1, sep='\n', header=None)
genes2 = pd.read_csv(gene_path2, sep='\n', header=None)
intersection = pd.merge(genes1, genes2, how='inner').drop_duplicates([0])
len_genes1 = genes1[0].count()
len_genes2 = genes2[0].count()
len_intersection = intersection[0].count()
pvalue2 = hg.cdf(int(len_intersection)-1, int(len_genes1)+int(pop_size), int(len_genes1), int(len_genes2))
print(f'Genes1 len: {len_genes1}, Genes2 len: {len_genes2}, Intersection: {len_intersection}, pvalue: {pvalue2}')
For example, if:
len_genes1 = 62
,
len_genes2 = 52
,
pop_size = 120
,
len_intersection = 11
Then I get a pvalue = 0.0768
Could you provide a way I could validate that this is indeed the right calculation?
*Note that I'm calculating a one sided p value
Another question is for a the lists I have (of size 62 and 52) what is a reasonable population size? I noticed that if I scale the population size I am going really fast to p value = 1.
As you probably can tell, I'm a newbie in the field of statistics - try and be as explicit as you can :)
There is no "reasonable" value for the population size, it is something you need to calculate.
In general the population size is the universe of genes that could possibly have been in both your gene lists.
So, for example, if genes1 and genes2 were RNAseq DE genes from two experiments, the population size is the number of genes that were sufficiently highly expressed enough to allow a DE test in both experiments.
Okay so that would be a union of both gene lists, correct? Without filtering based on fold change and adj. p-value.