By comparing some discrepancies between the results of Over Representation Analyses (one-sided Fisher exact test, a.k.a. hypergeometric test) performed with enricher()
(from ClusterProfiler R library) and with other web tools such as MsigDB, I realized there is an unaddressed ambiguity (it was at least for me) in the definition of genes in the query list (eg. upregulated genes) and genes in the universe/background:
While other tools and general workshops suggest that k should be the complete query list and N the universe of measurable genes (e.g. the whole transcriptome for RNAseq), ClusterProfiler (I think the most widely used library for pathway analysis in R) restricts the analysis to only genes present in the annotation set in use.
That leads of course to generally larger p-values than what we would get with the conventional approach. I feel that restricting the analysis to only annotated genes is reasonable and more specific, but I think it's worth opening a discussion about that. Which approach do you usually use/recommend? Do you have any opinions to share about it?
P.S. I also opened a discussion on the GitHub page of ClusterProfiler (https://github.com/YuLab-SMU/clusterProfiler/discussions/478)