Question

How does enrichGO function calculated p-value?

1

Entering edit mode

22 months ago

Ezequiel ▴ 10

Hi, I'm doing an Over Representation Analysis using the clusterProfiler package. When I used the enrichGO function, I obtained a dataframe with the following columns:

_ONTOLOGY: BP (in my case)

_ID: GO ID.

_Description: Description of the Biological Process.

_GeneRatio: ratio of input genes that are annotated in a term.

_BgRatio: ratio of all genes that are annotated in this term.

_pvalue: ....

_p.adjust: p-value with FDR.

_qvalue: this should be the same as the p.adjust??

_geneID: list of gene symbols belonging to each GO ID.

_Count: sum of genes belonging to each GO ID.

The problem is that I don't how how they calculate the p-value. I read their documentation and paper (clusterProfiler 4.0) and I can't find it. I also looked for the calculation in the source code. Is it the possibility that the input genes are overexpressed? Is it based on the hypergeometric test? But theoretically how do they calculate that probability?

But, also, I found an error. I checked one by one the probability values obtained in the output of the enrichGO function and they do not coincide with those that are later plotted using the dotplot function of the "enrichoplot" package. But that's another story. Now, I want to know how they calculate the p-values.

Thanks!

clusterProfiler p-value ORA DGE R • 2.3k views

ADD COMMENT • link updated 22 months ago by Papyrus ★ 3.0k • written 22 months ago by Ezequiel ▴ 10

0

Entering edit mode

You can actually see how they use phyper() to get the p-values, in the source code. If you run View(clusterProfiler::enrichGO) you will see that the function calls another function named enricher_internal. This is an internal function of the package, you can view its code by doing: View(clusterProfiler:::enricher_internal). In that function, they use phyper().

ADD REPLY • link 22 months ago by Papyrus ★ 3.0k

score 3 · Accepted Answer · 2023-02-14

according to the documentation, over representation analysis in clusterprofiler uses the hypergeometric distribution to calculate the p-value:

5.2 Over Representation Analysis

Over Representation Analysis (ORA) (Boyle et al. 2004) is a widely used approach to determine whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, e.g. a list of differentially expressed genes (DEGs).

The p-value can be calculated by hypergeometric distribution.

p=1−k−1∑i=0(Mi)(N−Mn−i)(Nn)

In this equation, N is the total number of genes in the background distribution, M is the number of genes within that distribution that are annotated (either directly or indirectly) to the gene set of interest, n is the size of the list of genes of interest and k is the number of genes within that list which are annotated to the gene set. The background distribution by default is all the genes that have annotation. P-values should be adjusted for multiple comparison.