Hi, I had a question about the usefulness of hypergeometric test in pathway enrichment.
Suppose I have a background containing 12000 genes, of which 600 are annotated as "cell cycle".
In my experiment, I have 60 genes associated with an annotation, of which 20 are associated to "cell cycle".
If my experiment had been done at random, I would have expected 3 genes among the 60 associated with cell cycle (60 * (600/1200))
I have 20 genes in my experiment instead of 3, so I know that I enriched my experiment in "cell cycle" genes.
Could you please explain me the usefulness of a statistical test to obtain this conclusion? I don't see the relationship between the urn problem and pathway enrichment since genes are not selected at random in my experiment and I'm not interested a priori in probablity of drawing white or black balls.
I think a better formulation is that the test informs us about the likelihood that the observed number of genes (20 out of 60 in the example above) could be occurring by random chance alone considering the frequency of annotations in the database.
The test is about the enrichment, not the selection of the genes. The gene selection process may have been valid and with no error whatsoever. The selection does not factor into whether the genes are enriched for a given annotation or not.
The annotation may fail to pass for both reasons: the selection was incorrect, or the genes are not actually enriched for the annotation.
I think the latter part of the answer has the same interpretation that I give, but you do start out talking about a random selection of the genes.
Thanks for the input! Yes, I may have worded it a bit confusingly, (I was trying to follow with the wording of the OP). By "genes selected at random" I was referring to genes being independently sampled (with no bias) from the background. But you're totally right: that is the assumption of the test, not what is being tested. What is being computed is the probability of observing a certain proportion/enrichment.
Overall, I think the main question being asked was "why do I need a test if I can directly observe that my genes' pathway is in higher proportion in my data (a.k.a "enriched") versus the proportion I see in the background?", so that the OP could have been confusing the notion of "observing a certain effect" versus "having evidence of an observed effect being statistically relevant".
Thanks very much Papyrus and Istvan for these clarifications. Papyrus, your last sentence is absolutely true. I didn't catch that statistical relevance was based on random selection of genes while experimental data are not. In my experimental data, I expect the same proportion of "cell cycle" genes than in the background. If I don't, I can say that I got an enrichment in my data (higher proportion based on a simple observation). In the statistical test, I will compare the probability of getting "at random" what I expect (3 genes) to what I get (20 genes).