Question

Hypergeometric test on gene lists

1

Entering edit mode

4.3 years ago

Eliran Turgeman ▴ 10

This question in related to this one.
As in the above post, I have two lists of gene names and I am calculating the intersection between them.
I am trying to calculate the p value that the intersection of these lists occurs by chance. I wrote this code in python. and I am not sure how can I validate that it is actually doing the calculation I am intending on doing.

from scipy.stats import hypergeom as hg
import pandas as pd
def main(gene_path1, gene_path2, pop_size):
    genes1 = pd.read_csv(gene_path1, sep='\n', header=None)
    genes2 = pd.read_csv(gene_path2, sep='\n', header=None)

    intersection = pd.merge(genes1, genes2, how='inner').drop_duplicates([0])

    len_genes1 = genes1[0].count()
    len_genes2 = genes2[0].count()
    len_intersection = intersection[0].count()

    pvalue2 = hg.cdf(int(len_intersection)-1, int(len_genes1)+int(pop_size), int(len_genes1), int(len_genes2))
    print(f'Genes1 len: {len_genes1}, Genes2 len: {len_genes2}, Intersection: {len_intersection}, pvalue: {pvalue2}')

For example, if:
len_genes1 = 62,
len_genes2 = 52,
pop_size = 120,
len_intersection = 11
Then I get a pvalue = 0.0768

Could you provide a way I could validate that this is indeed the right calculation?
*Note that I'm calculating a one sided p value

Another question is for a the lists I have (of size 62 and 52) what is a reasonable population size? I noticed that if I scale the population size I am going really fast to p value = 1.

As you probably can tell, I'm a newbie in the field of statistics - try and be as explicit as you can :)

python p value statistics • 1.2k views

ADD COMMENT • link 4.3 years ago by Eliran Turgeman ▴ 10

1

Entering edit mode

There is no "reasonable" value for the population size, it is something you need to calculate.

In general the population size is the universe of genes that could possibly have been in both your gene lists.

So, for example, if genes1 and genes2 were RNAseq DE genes from two experiments, the population size is the number of genes that were sufficiently highly expressed enough to allow a DE test in both experiments.

ADD REPLY • link 4.3 years ago by i.sudbery 20k

0

Entering edit mode

the population size is the number of genes that were sufficiently highly expressed enough to allow a DE test in both experiments

Okay so that would be a union of both gene lists, correct? Without filtering based on fold change and adj. p-value.

ADD REPLY • link 4.2 years ago by VBer ▴ 200