Question

interpreting results from pathway analysis

1

Entering edit mode

18 months ago

kng ▴ 40

I am performing pathway analysis using results from RNA seq. I am using clusterprofileR from R and using the KEGG pathway. I obtained two sets of results using

gseKEGG(geneList     = my_gene_list,
               organism     = kegg_organism,
               nPerm        = 50000,
               minGSSize    = 3,
               maxGSSize    = 800,
               pvalueCutoff = 0.05,
               pAdjustMethod = "none",
               keyType       = "ncbi-geneid")

(1) my_gene_list was filtered list of top ~400 genes with the highest log2fold change

(2) my_gene_list was the entire gene list ~18K sorted based on log2fold change value.

I was expecting both methods to list the same set of pathways as top activated/suppressed pathways but method one gives me some pathways of interest as top list while the results from method 2 look mostly garbage. How do I interpret results from both of these approach?

RNA-seq kegg clusterprofiler GSEA pathway-analysis • 1.6k views

ADD COMMENT • link updated 16 months ago by Ram 44k • written 18 months ago by kng ▴ 40

3

Entering edit mode

you need to use the ranked full gene list for gseKEGG/GSEA analysis for over-repesentation analysis use a subset of the genes, use enrichMKEGG()

I have a video on it too if you want to check out

ADD REPLY • link 18 months ago by Ming Tommy Tang ★ 4.5k

score 1 · Answer 1 · 2023-07-12

here are some thoughts;

1- When using GSEA, the input has to be a ranked list of entire genes obtained from RNA-seq, so you should not pass a subset of genes like top ~400 as input to the function.

2- For ranking the genes, metrics that effectively can rank the genes, e.g., Wald statistics (from DESeq2) should be used. LogFC should be avoided in ranking the genes, as it can not consider the direction of dysregulation and also it ignores statistical uncertainty (so you may end up assigning higher ranks to genes with less statistical support). If you like to use logFC for ranking you may define a new metric and combine that with p-value, like : metric = -log10(p-value)/sign(log2FC)

So none of the approaches that you mentioned gives you the correct output.