Hypergeometric test on gene lists
0
1
Entering edit mode
4.3 years ago

This question in related to this one.
As in the above post, I have two lists of gene names and I am calculating the intersection between them.
I am trying to calculate the p value that the intersection of these lists occurs by chance. I wrote this code in python. and I am not sure how can I validate that it is actually doing the calculation I am intending on doing.

from scipy.stats import hypergeom as hg
import pandas as pd
def main(gene_path1, gene_path2, pop_size):
    genes1 = pd.read_csv(gene_path1, sep='\n', header=None)
    genes2 = pd.read_csv(gene_path2, sep='\n', header=None)

    intersection = pd.merge(genes1, genes2, how='inner').drop_duplicates([0])

    len_genes1 = genes1[0].count()
    len_genes2 = genes2[0].count()
    len_intersection = intersection[0].count()

    pvalue2 = hg.cdf(int(len_intersection)-1, int(len_genes1)+int(pop_size), int(len_genes1), int(len_genes2))
    print(f'Genes1 len: {len_genes1}, Genes2 len: {len_genes2}, Intersection: {len_intersection}, pvalue: {pvalue2}')

For example, if:
len_genes1 = 62,
len_genes2 = 52,
pop_size = 120,
len_intersection = 11
Then I get a pvalue = 0.0768

Could you provide a way I could validate that this is indeed the right calculation?
*Note that I'm calculating a one sided p value

Another question is for a the lists I have (of size 62 and 52) what is a reasonable population size? I noticed that if I scale the population size I am going really fast to p value = 1.

As you probably can tell, I'm a newbie in the field of statistics - try and be as explicit as you can :)

python p value statistics • 1.2k views
ADD COMMENT
1
Entering edit mode

There is no "reasonable" value for the population size, it is something you need to calculate.

In general the population size is the universe of genes that could possibly have been in both your gene lists.

So, for example, if genes1 and genes2 were RNAseq DE genes from two experiments, the population size is the number of genes that were sufficiently highly expressed enough to allow a DE test in both experiments.

ADD REPLY
0
Entering edit mode

the population size is the number of genes that were sufficiently highly expressed enough to allow a DE test in both experiments

Okay so that would be a union of both gene lists, correct? Without filtering based on fold change and adj. p-value.

ADD REPLY

Login before adding your answer.

Traffic: 2597 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6