I am performing pathway analysis using results from RNA seq. I am using clusterprofileR from R and using the KEGG pathway. I obtained two sets of results using
gseKEGG(geneList = my_gene_list,
organism = kegg_organism,
nPerm = 50000,
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.05,
pAdjustMethod = "none",
keyType = "ncbi-geneid")
(1) my_gene_list
was filtered list of top ~400 genes with the highest log2fold change
(2) my_gene_list
was the entire gene list ~18K sorted based on log2fold change value.
I was expecting both methods to list the same set of pathways as top activated/suppressed pathways but method one gives me some pathways of interest as top list while the results from method 2 look mostly garbage. How do I interpret results from both of these approach?
you need to use the ranked full gene list for gseKEGG/GSEA analysis for over-repesentation analysis use a subset of the genes, use enrichMKEGG()
I have a video on it too if you want to check out