DAVID Fisher Exact Test contingency table
1
1
Entering edit mode
7.6 years ago
igor 13k

I am looking at the DAVID documentation here and specifically the contingency table: https://david.ncifcrf.gov/content.jsp?file=functional_annotation.html

In DAVID annotation system, Fisher Exact is adopted to measure the gene-enrichment in annotation terms.

In human genome background (30,000 gene total), 40 genes are involved in p53 signaling pathway. A given gene list has found that 3 out of 300 belong to p53 signaling pathway. Then we ask the question if 3/300 is more than random chance comparing to the human background of 40/30000.

screenshot

However, the contingency table shown adds up to 30,300. Isn't it supposed to add up to 30,000? Is that a typo or are there multiple ways to perform the Fisher Exact test?

pathway stats david • 8.0k views
ADD COMMENT
0
Entering edit mode

Looks like a typo. Please let them know.

ADD REPLY
0
Entering edit mode

I will. Thanks for the suggestion.

I was mostly just concerned that maybe I am not fully understanding Fisher's exact test. I found that page while looking for examples that involve gene set enrichment.

ADD REPLY
0
Entering edit mode

It still shows the same contingency table.

ADD REPLY
0
Entering edit mode

It does not look like a typo to me.

It says the human genome has 30,000 genes and that is 29,960 + 40 (second column). Then it says a given gene list has 300 genes, and that is 3 + 297 (first column). Pathway and Not In Pathway means how many genes are involved (and not) in p53 signalling in his set of 300 genes (first column) and in the whole genome (second column). So all the numbers add up. If you sum all 4 numbers in the middle cells of course you get 30,300, that is because you are summing the 2 set you are comparing (30,000 is the genome, 300 is the candidate gene list). Of course the 300 genes are included in the 30,000 but does not matter here. It is like 2 sets, one is your candidate set of 300 and the 30,000 is the background).

Another example, maybe easier:

An organisms has 100 fully annotated genes and I am studying local adaptation. I find 10 genes out of these 100 that are involved in adaptation (according to my results).

What the Fisher exact test in David does is look for any functional enrichment against a background (that could be all the genes of your organisms).

For example, in the 10 candidate genes I identified I may have 8 genes involved in p53 signalling, and the remaining 2 genes not involved. Of course these 8 genes could also be involved in another pathway, but here we test one pathway at the time. Then imagine that in my organism, 30 (out of the total 100 genes which also include my 10 genes involved in adaptation) are involved in p53 signalling, therefore the remaining 70 are not. What the test do is look whether 8/10 is "more" statistically than 30/100, if it is then my adaptation genes are enriched in p53 pathway. Meaning that in my 10 genes set there are more genes involved in p53 than you would expect by chance, and you could also conclude (given my stupid example) that p53 pathway has a strong role in adaptation. Of course this is a very stupid example and p53 is not involved in adaptation.

David does this for every functional category (or pathway, or GO) in your dataset, and then it outputs all p-values. Like this for example: p53 signaling pathway (P = 1.24 × 10−5, Fisher's test, Bonferroni adjusted)

So the contingency table shown above is just a small example, looking only at 1 pathway (p53 signalling). Like Devon said below, you only test one pathway (or GO) at the time. With my silly example the contingency table would be:

8 30 2 70

ADD REPLY
0
Entering edit mode

For a classic Fisher’s exact test, you are counting some genes twice. There is another example here: https://www.pathwaycommons.org/guide/primers/statistics/fishers_exact_test/

enter image description here

In this case, the expression of 30 genes has been analyzed: 15 differentially expressed genes were identified and 15 genes were associated with the GO term ‘‘DNA-templated transcription, elongation’. The totals for differential expression and gene set membership are the marginal values, as they lie on the periphery of our 2-by-2 table.

However, reading that page again now, it seems they are not showing the Fisher’s exact test, but their modified version of it. I guess that is where the confusion comes in.

ADD REPLY
0
Entering edit mode
6.5 years ago

Hi!

One gene can be part of more than one Pathway. So, the total number of genes would be higher than the number of Genome. I recommend you to look for the intersection between the genes in "Pathway" and "Not In Pathway".

ADD COMMENT
2
Entering edit mode

You only test one pathway at a time and the sum of the cells in the contingency table must sum to the total number of genes (or the relevant subset) or else you're violating assumptions of the test. I don't even know what it would mean for a gene to be both in and not in a given pathway at the same time...it'd be like a Schrödinger gene or something.

My guess is actually that the second column in the original table isn't really meant to be the second column of the contingency table, but rather the row sums. Then everything would add up correctly.

ADD REPLY
0
Entering edit mode

"Schrödinger gene"... I will have to work that into a presentation somehow

ADD REPLY

Login before adding your answer.

Traffic: 1684 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6