Question

Proper way(s) to perform enrichment analysis in R

0

Entering edit mode

3.2 years ago

cwwong13 ▴ 40

I am not sure what is the proper way to carry out over-representation analysis (and also gene set enrichment analysis) for RNAseq data. Ideally, the analysis can be performed in R, otherwise, if the software/ platform can export the output file (also include all the non-statistical-significant term) will also be sufficient.

I come up with this question is mainly because I recently discovered a website: https://maayanlab.cloud/Enrichr/

I found that this website host a list of pathway/ ontology databases (quite comprehensively). However, the p-value calculated there cannot be reproduced with two other methods I previously used: using the R package clusterProfiler and the web of g:Profiler. Even more surprising to me, the results from clusterProfiler and g:Profiler are also not the same, i.e. I got three distinct results from three analysis platforms while I am using the same gene list, and same gmt file.

For reproducibility purposes (only the R section), here are the codes I used for clusterProfiler (I just recently upgraded to the latest version (v4.0.5):

interests <- c("TNF", "PTGS2", "ZFP36", "GHRL", "CDH6", "TNFRSF1A", "IL6", "CXCL8", "CD27", "PTHLH", "IL1B", "CCL2", "PPARA", "CCL3", "AGT", "CCL5", "IL10", "ICAM1", "EDN1", "CAT", "CXCL10", "NOS2", "COL3A1",  "IL1RN", "TIMP1", "FABP4", "MMP3", "RELA", "COL1A1", "VCAM1", "CALCA", "CASP3", "TGFB1", "SERPINE1", "IL18", "IGF1", "VEGFA", "SLC22A1", "SOD2", "LPL", "NOS3", "PTGS1", "IFNG", "FN1", "MMP8")
wp <- read.gmt(url("https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName=GO_Cellular_Component_2021"))
ewp <- enricher(interests, TERM2GENE=wp[,c("term", "gene")], TERM2NAME=wp[,c("term", "term")])
head(ewp)

Then, the most significant term is platelet alpha granule lumen (GO:0031093) with pvalue = 1.658067e-8, p.adjust = 8.456139e-7, and qvalue = 6.283200e-7

I then use the same gene list and query against the g:Profiler by manually uploading the same gmt file. The platelet alpha granule lumen (GO:0031093) has p.adjust = 1.642×10-6.

Finally, the original, enrichr returns the pavlue = 9.476e-9 and p.adjust = 5.780e-7.

Does anyone know what is the underlying cause of the differences? I tried to look up the underlying formula for the clusterProfiler, and find that it uses

pvalues <- apply(args.df, 1, function(n) phyper(n[1], n[2], n[3], n[4], lower.tail = FALSE))

More questions are thus raised:

Should we use a two-tail test for the null hypothesis? I read the article Isabelle Rivals, et al mentioned we probably should not use a one-tail test.
Assuming the one-tail test is justified, how can enrichr get a much lower p-value (I assume maybe the g:Profiler used the two-tail test, but not sure where to check their internal algorithm)?
enrichr stated that they use fisher's exact test, but the results are obviously off. I am not sure why.
Generally, do you have advice on where to perform the GSEA instead of the over-representation analysis I did here.

GSEA analysis enrichment R RNAseq • 2.5k views

ADD COMMENT • link updated 3.2 years ago by bigomics.team ▴ 90 • written 3.2 years ago by cwwong13 ▴ 40

1

Entering edit mode

4.Generally, do you have advice on where to perform the GSEA instead of the over-representation analysis I did here.

Check this post: Gene Set Enrichment Analysis

ADD REPLY • link 3.2 years ago by Hamid Ghaedi 3.3k

score 3 · Answer 1 · 2021-09-22

3

Entering edit mode

3.2 years ago

bigomics.team ▴ 90

If you look only for enrichment a one-tail should be fine (and actually correct), you would need two-tail if you also for test for "depletion". The tests may differ because they use either Fisher's exact or the Chi-squared test (but they should become similar if N is large). Also how they define the "background" is very important and very much affects the p-value.

To perform GSEA, I use the fGSEA (fast GSEA) R package. It is much-much faster than the original GSEA implementation of the Broad. It's a great package and the results are exactly the same as the original.

ADD COMMENT • link 3.2 years ago by bigomics.team ▴ 90

0

Entering edit mode

One follow-up question is: do you think the gene set size also matters? I find that many software has the default set size of 10 - 500 genes per set. I am not sure whether that also needs to be optimized.

ADD REPLY • link 3.2 years ago by cwwong13 ▴ 40

0

Entering edit mode

I find 10-500 reasonable. There is no "optimum" value. Sets that are too small may become unreliable, in most cases the p-value will reflect that but it is mostly unnecessary computation. Also sets that are greater than 500 become very generic and do not say much.

ADD REPLY • link 3.2 years ago by bigomics.team ▴ 90