Hey, I’m trying to build my own custom enrichment test, but am confused about some details, specifically about the universe definition. I’m using a Fisher’s exact test in Python, and the P-value I’m generating is for the right-tailed distribution (i.e. the probability I will find this or a greater enrichment by chance).
Condition1 is whether the genes were implicated in my GWAS.
Example: [A, B, C]
are the genes implicated, while [D – H]
are genes in the underlying annotation that were not implicated.
Condition2 is whether the genes are in a biological pathway.
Example: [B,C,D]
are the genes in the pathway, while [I-Z]
are genes in other pathways that are not in this pathway.
So the annotation used for Condition1 has some, but not complete overlap with the pathway data set used for Condition2. Specifically, genes B – D (N = 3) overlap between both data sets, A and E – H (N = 5) only in the first data set, I – Z (N=18) only in the second data set.
Now I can imagine several ways to construct the contingency table used for the Fisher exact test:
Option 1: Use the background of each condition regardless of overlap.
P= 0.0269, OR = 44.0
Option 2: First filter everything for existence in both data sets.
P= 1, OR = NA
Option 3:Filter everything for existence in one data set (here: filtering for the background from the GWAS, i.e. Condition1).
P= 0.2857, OR = 8
I feel like Option 1 might better account for the actual probability of randomly drawing the genes from the respective data set. Also, it’s maybe less restrictive since for example just because a gene is not annotated in a pathway in the pathway database (as is my basis for the background of condition2), it may not be grounds for an exclusion (we would need to know the actual background of all genes that have been researched in regards to pathways, but that seems impossible). But I’m not sure if Option 1 may introduce bias in some scenarios. But there are also scenarios where Option 2 would not be possible, because we don't know the background at all (e.g. if our condition 2 was 'genes that have been described as causal for phenotype X'. Option 3 seems like a possible middle ground (and particularly easier if we don't know the true background of one of the conditions), and how I've always assumed it was done. But I'm not sure which of these is really correct.
What is the proper way to do this? Is it maybe a different option altogether?
Edit: Added an option 3.