Hi!
I have a set of 1435 genes with 445 DE and I am using enricher to analyse a specific group of pathways like this.
em <- enricher(genes, minGSSize=1,
maxGSSize = 20,
universe = names(gene_list),
TERM2GENE=hs_kegg_df,
pAdjustMethod = "bonferroni")
The hs_kegg_df is a set of lipid metabolism pathways. I got the results of em I found strange that none of the pathways had significant p values, so I analysed "KEGG_GLYCEROLIPID_METABOLISM" and it contains 59 genes where 2 are in my gene universe and 2 are DE.
I calculated the p value by hand by doing the hypergeometric test.
So N: Background genes -> 1435, n: DGE genes -> 445, M: Genes annotated in the geneset -> 2, S Geneset -> 59, x - Nº of DGE genes in S -> 2.
phyper(q = 2-1, m = 445, n = 1435 - 445, k = 2, lower.tail = FALSE)
The result was 0.09601563 and the enricher p value without adding the universe [EDIT] is 0.267, still not significant but different from enricher.
I wanted to ask how does the Enricher calculate the p value and if I got the formula wrong in any part?
Best Regards, Manuel
Can you post the enricher output for this one, especially the GeneRatio and BgRatio columns? It usually comes down to what people think is the universe compared to what it actually is in terms of intersection with the pathway database. The formula you use appears correct, and the *Ratio columns tell which numbers the test used.
Hello ATpoint, thank you for your feedback.
So the [3] position contains "KEGG_GLYCEROLIPID_METABOLISM" as you can see it has 59 genes included the gene ratio is 2/29 and the Bg Ratio is 2/84.
After reading to bk11 (thank you very much for the extensive answer). The pvalue expected from enricher is actually 0.1165 and not what I mentioned above.
Being given by phyper(2-1, 29, 84-29, 2, lower.tail = F)
84 is the number of genes from the universe that appear on the pathway set.
29 is the number of DGE genes that appear on pathway set.
The last number 2 is the number of genes from specific pathway that appear on universe.
And 2-1 (x=2) are the DGE genes that appear on specific pathway
With this I ask, let's say I am only analyzing one pathway, the p value always be most of the times 1...... , if the number of background genes is equal to the anotated genes the hypergeometric test will give 1-0=1.
This because the number of genes that appear on pathway set can be all DE this way making phyper(2-1, 29, 0, 2, lower.tail = F) which is equal to 1. But if another random pathway is added so that the it compares with one more pathway, since there is more than one pathway-set, all the DGE genes are not equal to all the genes in the universe therefore, statistically speaking, adding one more pathway would make this specific pathway significant... How can I make this test work for one specific pathway? Thank you.
With only 1, 2 or 3 that are DE and intersect these pathways I would not do overrepresentation analysis in the first place. It's so few genes, the pvalues are pointless to me. Either look at the genes manually and decide whether they're known to be critical to the process, such as rate limiting enzymes, or do GSEA to make statements about general shifts of a pathway.
Thank you @ATpoint . However since I have 1500 genes and not the entire Gene set, GSEA cannot be used right?
You do DE analysis, this should be done on all genes from an RNA-seq experiment (maybe after prefiltering), and the DE analysis returns a results table with pvalue, logFC etc. Rank by signed pvalue or similar metrics and run GSEA.
Yes, I understand ATpoint, i did exactly that, first DGE using limma voom between groups, yet since I don't have the entire gene set of possible genes I think is not recommended to use GSEA.
I have no idea what you mean. You have DE results you can rank and there are databases such as kegg and reactome that can serve as gene set sources. That is what gsea needs. I think it would help to define what you think a geneset is, because it seems to me there is a misunderstanding.