Question

GSEA for RNA_seq dataset

1

Entering edit mode

3.6 years ago

1961012 ▴ 20

I have RNA-seq count data and I have already identifying differentially expressed genes DEGs. and Protein-protein interaction analysis for those DEGs.

I would like to perform GSEA for comparison with the previous analysis. and I am confused about what should I do

Should I take all genes in the RNA-seq count data for GSEA.
Should I take only DEGs or GSEA.
Should I take DEGs at the same cut-off (p-value) that used to consider the DEGs as input for PPI or is there any favorable cut-off

Thank you in advance!

DEGs GSEA RNA-seq • 2.3k views

ADD COMMENT • link updated 11 months ago by Ram 44k • written 3.6 years ago by 1961012 ▴ 20

score 2 · Answer 1 · 2021-04-15

I tried to explain my understanding of (f)GSEA and why one should use all (expressed) genes here over at this Bioc post:

https://support.bioconductor.org/p/9135326/#9135328

(...) You should use all genes, or at least all relevant genes. In DESeq2 that might be the genes surviving the independent filtering (=not being NA) or in edgeR those that survive filterByExpr. GSEA tests whether a gene set as a whole (rather than individual genes as we test in a pairwise comparison with the mentioned tools) show evidence to be over- or underexpressed. A geneset can (as a whole) show evidence to be overexpressed even though each gene individually does not need to be overexpressed (=being significant) in a pairwise comparison. It is simply two different types of questions one asks when using pairwise DE testing and GSEA. For DESeq2 I would therefore use all genes surviving the independent filtering, e.g. ranked by moderated and shrunken LFC after applying lfcShrink. As we rank genes for GSEA we obviously lose the information of the magnitude of the ranking metric (here the fold changes) so GSEA informs about global tendencies. I think it makes sense to always pair GSEA results with other information, like the fold changes from DESeq2. Even if your GSEA is significant, but it turns out that the fold changes of your DESeq2 analysis for the genes of that particular pathway you are fgsea-ing against are tiny (like very close to zero), then it is probably questionable whether the result is biologically meaningful, even though in GSEA rank space the analysis was significant. But I think the practice of combining different analysis methods to make a confident statements always makes sense, not just in the GSEA context. Does that make sense to you?

score 1 · Answer 2 · 2021-04-15

You should read up about GSEA, it sounds like you don't have a good grasp of what the process involves, which could lead you to misinterpret the results. The original paper gives a good overview of the theory, and this page gives some good tips on providing a rank statistic.

In short though, you should use all genes.

score 1 · Answer 3 · 2021-04-15

1

Entering edit mode

3.6 years ago

Zhilong Jia ★ 2.2k

Using all genes with signed P-value to rank genes, where the sign is from LogFC. and GSEAPreranked module in GSEA.

ref: Gene Ranking of RNA-Seq Data via Discriminant Non-Negative Matrix Factorization

ADD COMMENT • link 3.6 years ago by Zhilong Jia ★ 2.2k