Hi, I'm doing an Over Representation Analysis using the clusterProfiler package. When I used the enrichGO function, I obtained a dataframe with the following columns:
_ONTOLOGY: BP (in my case)
_ID: GO ID.
_Description: Description of the Biological Process.
_GeneRatio: ratio of input genes that are annotated in a term.
_BgRatio: ratio of all genes that are annotated in this term.
_pvalue: ....
_p.adjust: p-value with FDR.
_qvalue: this should be the same as the p.adjust??
_geneID: list of gene symbols belonging to each GO ID.
_Count: sum of genes belonging to each GO ID.
The problem is that I don't how how they calculate the p-value. I read their documentation and paper (clusterProfiler 4.0) and I can't find it. I also looked for the calculation in the source code. Is it the possibility that the input genes are overexpressed? Is it based on the hypergeometric test? But theoretically how do they calculate that probability?
But, also, I found an error. I checked one by one the probability values obtained in the output of the enrichGO function and they do not coincide with those that are later plotted using the dotplot function of the "enrichoplot" package. But that's another story. Now, I want to know how they calculate the p-values.
Thanks!
You can actually see how they use
phyper()
to get the p-values, in the source code. If you runView(clusterProfiler::enrichGO)
you will see that the function calls another function namedenricher_internal
. This is an internal function of the package, you can view its code by doing:View(clusterProfiler:::enricher_internal)
. In that function, they usephyper()
.