Hello everyone,
I've recently performed a GSEA analysis (using fGSEA in R) on an RNA-seq data set, consisting of 4 pairwise comparisons: 1.) drug 1 vs. untreated, 2.) drug 2 vs. untreated, and 3.) drug 1 + 2 vs. untreated).
Using edgeR, I found that there were ~2000 differentially expressed genes (DEGs) for drug 1 vs. untreated, ~7000 DEGs for drug 2 vs. untreated, and >9000 genes for drug 1+2 vs. untreated. After ranking the full list of genes in each of the 3 signatures (by the following formula: -log10(adjusted p-value)/sign(logFC)), I performed GSEA using fGSEA (v. 1.10.1) in R, against the Human GO and Reactome databases (~3700 pathways). After applying a Benjamini-Hochberg p-value adjustment, I obtained the following number of ‘significant’ gene sets/pathways (adj. p < 0.05): 1.) 70 for drug 1 vs. no tx, 2.) 0 for drug 2 vs. no tx and 3.) 106 for drug 1 + 2.
I find it strange that there aren’t any ‘significant’ gene sets obtained for the ‘drug 2 vs. no tx’ signature, given the large number of DEGs found. In fact, this signature shared more than 5000 significant DEGs (with same sign(logFC) direction) with the combo drug 1 + 2 treatment.
I have 2 questions: 1.) Is it possible to have no significant pathways despite having such a large fraction of the transcriptome being differentially expressed? If so, what conditions would account for that? 2.) I ranked my genes using -log10(adjusted p-value)/sign(logFC) (instead of raw p-value, as noted here), as I only had access to logFC and B-H adjusted p-values for each of the gene signatures. Is it appropriate to rank on adj. p-value? Could this account for the few enriched pathways observed for ‘drug 2 vs. no tx’?
Thanks, Andrew
I've upgraded to using fgseaMultilevel (fgsea v. 1.14) on my DEG lists (each of which contain 2000-3000 significant DEGs (adjP<0.05 each)) and am getting few, if any significantly enriched gene sets. Originally I was testing against ~5500 gene sets from REACTOME, KEGG, and GO, so I thought I may be heavily penalized by testing against so many gene sets, so I reduced the number by decreasing "maxsize" to 100, and only running against REACTOME, but the results have not really improved.
I also tried fgseaSimple with nperm as high as 100,000 and that actually made things "worse".
The number of DEGs doesn't correlate that well with the number of pathways. From what we saw, sometimes when you have some disruption of gene expression machinery, like CTCF or chromatin, there are many deferentially expressed genes, but it's not specific to any particular pathway.
That makes sense and is aligned with some of our ideas. My treatments include various retinoids, so I would have expected (at least) some significant enrichment for retinoic acid-specific transcriptional programs. Would a loosening of our p-value (adjP) cut-offs be appropriate (e.g. adjP<0.2)?
No, I don't recommend to loos P-value threshold, unless you do some kind of experimental validation follow-up. GSEA is known to be too sensitive, so if anything it make sense to use more strict threshold.
You can try to check out some other pathway databases. Just try to upload your regulated genes (one direction at a time) into Enrichr (https://maayanlab.cloud/Enrichr/). I had an experience with Vitamin D receptor, which as far as I understand should be similar, and we didn't have much hits in Reactome/KEGG/GO, but there was a nice pathway in WikiPathway and some chip-seq based pathways from ChEA.