I am looking at the DAVID documentation here and specifically the contingency table: https://david.ncifcrf.gov/content.jsp?file=functional_annotation.html
In DAVID annotation system, Fisher Exact is adopted to measure the gene-enrichment in annotation terms.
In human genome background (30,000 gene total), 40 genes are involved in p53 signaling pathway. A given gene list has found that 3 out of 300 belong to p53 signaling pathway. Then we ask the question if 3/300 is more than random chance comparing to the human background of 40/30000.
However, the contingency table shown adds up to 30,300. Isn't it supposed to add up to 30,000? Is that a typo or are there multiple ways to perform the Fisher Exact test?
Looks like a typo. Please let them know.
I will. Thanks for the suggestion.
I was mostly just concerned that maybe I am not fully understanding Fisher's exact test. I found that page while looking for examples that involve gene set enrichment.
It still shows the same contingency table.
It does not look like a typo to me.
It says the human genome has 30,000 genes and that is 29,960 + 40 (second column). Then it says a given gene list has 300 genes, and that is 3 + 297 (first column). Pathway and Not In Pathway means how many genes are involved (and not) in p53 signalling in his set of 300 genes (first column) and in the whole genome (second column). So all the numbers add up. If you sum all 4 numbers in the middle cells of course you get 30,300, that is because you are summing the 2 set you are comparing (30,000 is the genome, 300 is the candidate gene list). Of course the 300 genes are included in the 30,000 but does not matter here. It is like 2 sets, one is your candidate set of 300 and the 30,000 is the background).
Another example, maybe easier:
An organisms has 100 fully annotated genes and I am studying local adaptation. I find 10 genes out of these 100 that are involved in adaptation (according to my results).
What the Fisher exact test in David does is look for any functional enrichment against a background (that could be all the genes of your organisms).
For example, in the 10 candidate genes I identified I may have 8 genes involved in p53 signalling, and the remaining 2 genes not involved. Of course these 8 genes could also be involved in another pathway, but here we test one pathway at the time. Then imagine that in my organism, 30 (out of the total 100 genes which also include my 10 genes involved in adaptation) are involved in p53 signalling, therefore the remaining 70 are not. What the test do is look whether 8/10 is "more" statistically than 30/100, if it is then my adaptation genes are enriched in p53 pathway. Meaning that in my 10 genes set there are more genes involved in p53 than you would expect by chance, and you could also conclude (given my stupid example) that p53 pathway has a strong role in adaptation. Of course this is a very stupid example and p53 is not involved in adaptation.
David does this for every functional category (or pathway, or GO) in your dataset, and then it outputs all p-values. Like this for example: p53 signaling pathway (P = 1.24 × 10−5, Fisher's test, Bonferroni adjusted)
So the contingency table shown above is just a small example, looking only at 1 pathway (p53 signalling). Like Devon said below, you only test one pathway (or GO) at the time. With my silly example the contingency table would be:
8 30 2 70
For a classic Fisher’s exact test, you are counting some genes twice. There is another example here: https://www.pathwaycommons.org/guide/primers/statistics/fishers_exact_test/
However, reading that page again now, it seems they are not showing the Fisher’s exact test, but their modified version of it. I guess that is where the confusion comes in.