I'm trying to do a Fisher's exact test. I'd like to learn if the genes that are differentially expressed in a microarray meta-analysis are enriched for genes that are transcriptionally controlled by Gene X.
What I'm trying to understand is: for my Fisher's exact test, what should all of the cells in the contingency table sum to (the "Total Number")?
This could be all the genes in the genome - but then what number do I use for that figure? The number of HUGO symbols? The number of Entrez genes (I'm using Entrez gene identifiers for my analyses)? If so, how can I find those numbers?
It could also be the number of probes used to determine my differentially expressed genes, or Gene X. The reasoning behind this is that the microarray probe set will only pick up a portion of the total possible genes in the genome. If I am using microarray probes to identify differentially expressed genes, or if the ChIP-on-chip for Gene X only studied a certain number of locations, I will necessarily miss any genes in the unstudied part of the genome (the part for which there are no probes) that might bind Gene X or be differentially expressed. Therefore, the Total Number should represent the portion of the genome I'm studying, not the whole genome.
I know that Fisher's exact test results really depend on what answer I choose for this, so I'd like to have a good argument for whatever I pick.
I wasn't sure that you could!
One complication is that this is for a meta-analysis, each of which is on a different platform with a different probe set. There are only about 10,000 genes in common across the three datasets.
The ideal strategy to make all cells of the contingency table = 10,000, right?
Taking only the common genes is safe otherwise you could also view the genes not represented in one set as missing values. In an ideal world, detection of differentially expressed genes should not depend on the platform used so you could consider that any gene represented on any platform has been tested. Anyway, a small variation in the number of background genes is not going to make a dramatic difference to the result. In addition, don't trust vendor supplied mappings, they can be wrong/out-of-date. To be accurate, you should map all the probe sequences from the different platforms to the same reference genome.