Question

Usefulness of hypergeometric test in pathway enrichment analysis

0

Entering edit mode

2.8 years ago

Francois Piumi ▴ 70

Hi, I had a question about the usefulness of hypergeometric test in pathway enrichment.

Suppose I have a background containing 12000 genes, of which 600 are annotated as "cell cycle".

In my experiment, I have 60 genes associated with an annotation, of which 20 are associated to "cell cycle".

If my experiment had been done at random, I would have expected 3 genes among the 60 associated with cell cycle (60 * (600/1200))

I have 20 genes in my experiment instead of 3, so I know that I enriched my experiment in "cell cycle" genes.

Could you please explain me the usefulness of a statistical test to obtain this conclusion? I don't see the relationship between the urn problem and pathway enrichment since genes are not selected at random in my experiment and I'm not interested a priori in probablity of drawing white or black balls.

enrichment statistics hypergeometric test • 1.6k views

ADD COMMENT • link 2.8 years ago by Francois Piumi ▴ 70

score 3 · Answer 1 · 2022-02-01

3

Entering edit mode

2.8 years ago

Papyrus ★ 3.0k

It is precisely to confirm that your genes were not selected at random that you do the test. One thing is to observe an enrichment (which does not need a statistical test) and the other is to check if you could have observed that enrichment by chance alone. The latter is the function of the test. The test will give you a probability (p-value) of observing your numbers if the genes were truly selected at random from the background (H0). So, imagine if you had 4 of your 60 genes in the "cell cycle" annotation. Is it enough to say they are enriched? (you expected 3). What about 10 out of 60? Well, the p-value will give you an "objective" scale with which to say, "from this point onwards I consider that this enrichment is statistically significant".

ADD COMMENT • link 2.8 years ago by Papyrus ★ 3.0k

2

Entering edit mode

I think a better formulation is that the test informs us about the likelihood that the observed number of genes (20 out of 60 in the example above) could be occurring by random chance alone considering the frequency of annotations in the database.

The test is about the enrichment, not the selection of the genes. The gene selection process may have been valid and with no error whatsoever. The selection does not factor into whether the genes are enriched for a given annotation or not.

The annotation may fail to pass for both reasons: the selection was incorrect, or the genes are not actually enriched for the annotation.

I think the latter part of the answer has the same interpretation that I give, but you do start out talking about a random selection of the genes.

ADD REPLY • link 2.8 years ago by Istvan Albert 101k

1

Entering edit mode

Thanks for the input! Yes, I may have worded it a bit confusingly, (I was trying to follow with the wording of the OP). By "genes selected at random" I was referring to genes being independently sampled (with no bias) from the background. But you're totally right: that is the assumption of the test, not what is being tested. What is being computed is the probability of observing a certain proportion/enrichment.

Overall, I think the main question being asked was "why do I need a test if I can directly observe that my genes' pathway is in higher proportion in my data (a.k.a "enriched") versus the proportion I see in the background?", so that the OP could have been confusing the notion of "observing a certain effect" versus "having evidence of an observed effect being statistically relevant".

ADD REPLY • link 2.8 years ago by Papyrus ★ 3.0k

0

Entering edit mode

Thanks very much Papyrus and Istvan for these clarifications. Papyrus, your last sentence is absolutely true. I didn't catch that statistical relevance was based on random selection of genes while experimental data are not. In my experimental data, I expect the same proportion of "cell cycle" genes than in the background. If I don't, I can say that I got an enrichment in my data (higher proportion based on a simple observation). In the statistical test, I will compare the probability of getting "at random" what I expect (3 genes) to what I get (20 genes).

ADD REPLY • link 2.8 years ago by Francois Piumi ▴ 70