Hi, I'm trying to start a project based on R where I input cancer patient data to find DEG's to ultimately search for possible pharmaceutical targets. My focus is can I can input the same data into GSEA and DEG to confirm each other's conclusions. Right now, I'm only using DEG (voom+limma package in R) to filter/select significant genes.
I know that these two analyses are completely different- GSEA takes in a priori gene sets and gives information relevant to significant gene SETS for each phenotype. DEG will look into individual GENES (not gene sets) and gives us a list of differentially expressed genes for each phenotype.
However, I was wondering if these can work together in harmony so that we can first use GSEA to filter significant gene sets and then use DEG to test individual genes significantly enriched in those gene sets of GSEA. I thought this would help because just performing DEG inherently lacks biological significance. But while GSEA has biological significance, it doesn't have the ability to detect at the level of individual genes. So why not make them work together to complement each other's strengths/weaknesses?
For example, I would run GSEA for two different cancer types (phenotype) A and B, and find gene set X is overexpressed. Then I would look into which group of individual genes are contributing the most to the enrichment score for gene set X. Then I would run a DEG analysis of those individual genes. If I find some genes that are significantly overexpressed for specific types of cancers, that actually itself can be a probable target.
I also do recognize the difficulty of running this together- there are so many different packages that have different methods (ex. normalization methods, etc). But putting these problems aside, I'm just asking that if I could get this right, would this be a good idea?
Thank you for your input :)
You should clarify this part. The result is actually an enrichment score with a specific direction (up or down). Not all genes are in the same direction, but there should be en enrichment at one end of the spectrum (see also: "leading-edge genes").
Hey, thanks for your kind input :) I do understand your method and why you would suggest one like that. However, don't you think both would work? GSEA first then DEG/DEG first then GSEA? But the reason why I thought GSEA first then DEG would work better is because if you do DEG first for tens of thousands of individual genes, then it is simply too inefficient- GSEA would help reduce dimensionality for the subsequent DEG test. Though, it would be interesting to see the differences of results of DEG first then GSEA vs GSEA first then DEG...
Also, when you say the result of a GSEA is not that a specific pathway is up- or down-regulated, you're basically saying that the reason why we do GSEA is just to see which pathway (gene sets) is affected by the phenotype, right? So basically up-regulation and down-regulation isn't significant in GSEA...
Hi! I don't understand how you would do a GSEA without having a list of Differentially Expressed Genes... What would you insert as input in the analysis? The GSEA starts with a list of DEGs, so in any case you need to do it beforre running GSEA. So, you do DEG analysis and find genes up regulated and down regulated in tumor VS normal. Now, you have a list of DEGs that you can analyse in different ways.
One for example is to decide the cutoff for pvalue and logFC to define what is really DE in the two conditions, imagine pvalue < 0,05 and logFC > |1,5|. In this way, you find genes that you can further analyze with an enrichment analysis such as GO pathway or KEGG on only upregulated genes for example (or down), in this way you can find pathways that move in the way of your genes (+ or -).
With GSEA you use all the genes as I said before and you obtain a list of pathways/biological processes in which your list of genes is involved, based on the ranking provided. In these lists you can find both up and downregulated genes, because as you know a pathway is composed by many components. So, GSEA is a general picture of what's going on.
For your aim, both methods can be good. You can find gene X, very important target to block that is one of the top of your DEG analysis; but also you can find that "epithelial to mesenchimal transition" is enriched in your GSEA analysis (based on the same DEG list), so you can pick one in the many genes involved as a target to block the entire pathway, instead of only few genes. ;)
It is not recommended to only use differentially expressed genes fro GSEA. See previous discussions:
You're right, sorry, I wrote something that can create confusion: with DEG list I mean the list that you obtain after you analyze your two conditions in the RNAseq/microarray, so technically they are just the genes that come out from analysis with pval and logFC. :)
Thank you everyone for the input, and after reading igor's comment, then I'm guessing I should use DEG package in R (DESeq2, limma, etc) and then use GSEA-pre ranked, is that right?
That would make sense.
How about using pathway analysis using R package "Rontotools"? Would that be a better idea since this would provide actual biological significance and is more accurate than gene set analysis (DEG, GSEA)?
There isn't any problem on the dimensionality of DEG test, as more genes lead to better estimation of the parameters of the genes. How do you do GSEA before having an ordered list?
Yes usually GSEA methods do not evaluate if a pathway is up or down-regulated (how would it know? )
Hi! I know this post has been a while. I saw your opininion on "it's better to run the GSEA on ALL your genes, not only over-expressed/under-expressed." Why would you think so? I usually do GSEA on the differential expressed genes. If my DEGs has a trend, so it will bias the GSEA results?
According to the GSEA documentation:
Hi old toppic, but I think the question is still of interest. I'm used to working with RNAseq data, mainly from mouse liver. Based on my experience, I would say that if there are thousands of DEG, analysing all genes with GSEA will give satisfying results. But if you run a GSEA on a dataset with very few DEG, you'll get false positives because the genes are ranked but the true distance between genes is not taken into account. You can try a GSEA on a dataset with randomly distributed genes (= no DEGs) and you might get some "significant" GSEA positive categories by chance. So now I prefer to filter my results and run the analysis on DEGs only, because if there are thousands of DEGs, there's not much change compared to running the analysis as a whole, because there are enough poorly expressed significant genes that produce background noise. But if there are not many DEGs, I get fewer false positives. Sometimes I get "nothing", wich can be a satisfying result. In effect, you're losing statistical power. But somehow, I find out that this loss can help summarising the data, so it might be interesting to discard genes that aren't affected by the experimental conditions.