Question

Combining GSEA (Gene Set Enrichment analysis) and DEG (Differentially Expressed Genes) to confirm results together- is it a good idea?

2

Entering edit mode

4.4 years ago

chokevin8 ▴ 30

Hi, I'm trying to start a project based on R where I input cancer patient data to find DEG's to ultimately search for possible pharmaceutical targets. My focus is can I can input the same data into GSEA and DEG to confirm each other's conclusions. Right now, I'm only using DEG (voom+limma package in R) to filter/select significant genes.

I know that these two analyses are completely different- GSEA takes in a priori gene sets and gives information relevant to significant gene SETS for each phenotype. DEG will look into individual GENES (not gene sets) and gives us a list of differentially expressed genes for each phenotype.

However, I was wondering if these can work together in harmony so that we can first use GSEA to filter significant gene sets and then use DEG to test individual genes significantly enriched in those gene sets of GSEA. I thought this would help because just performing DEG inherently lacks biological significance. But while GSEA has biological significance, it doesn't have the ability to detect at the level of individual genes. So why not make them work together to complement each other's strengths/weaknesses?

For example, I would run GSEA for two different cancer types (phenotype) A and B, and find gene set X is overexpressed. Then I would look into which group of individual genes are contributing the most to the enrichment score for gene set X. Then I would run a DEG analysis of those individual genes. If I find some genes that are significantly overexpressed for specific types of cancers, that actually itself can be a probable target.

I also do recognize the difficulty of running this together- there are so many different packages that have different methods (ex. normalization methods, etc). But putting these problems aside, I'm just asking that if I could get this right, would this be a good idea?

Thank you for your input :)

RNA-Seq R sequencing • 15k views

ADD COMMENT • link updated 6 weeks ago by arnaud_p • 0 • written 4.4 years ago by chokevin8 ▴ 30

score 4 · Answer 1 · 2020-07-17

4

Entering edit mode

4.4 years ago

mario.red8976 ▴ 130

Hi @chokevin8, I think that there is a misunderstanding in the methods. You use the GSEA method to analyze your DEGs list (or at least, is what I understood GSEA exists for).

So, you can first find your DEG list, with gene name/symbol/ID, pvalue and log FC. Then you use this list to run a GSEA. Also, it's better to run the GSEA on ALL your genes, not only over-expressed/under-expressed.

What a GSEA do is to rank your genes based on a certain value that you provide; usually, this value is the logFC of the genes, but sometimes I saw even calculations like: pvalue * logFC, in this way you also take into consideration the significance of the gene, even if GSEA doesn't care of it! ;)

Now, imagine that you have a DEG list with logFC, you load into GSEA program (https://www.gsea-msigdb.org/gsea/index.jsp , from BROAD instute) or online in a website like Enrichr (https://amp.pharm.mssm.edu/Enrichr/ ). GSEA basically take the ranked genes (from + to -, according to logFC) and confront them with the gene sets specified.

Then, this is VERY important, the result it is not that a specific pathway is up- or down-regulated, but the fact that the pathway is affected in some way by the condition that you re studying. In fact, you will have enriched genes both up and down-regulated. The result of GSEA is a broad picture of what's going on in your cell line / model.

I hope that this helps you!

ADD COMMENT • link 4.4 years ago by mario.red8976 ▴ 130

1

Entering edit mode

the result it is not that a specific pathway is up- or down-regulated

You should clarify this part. The result is actually an enrichment score with a specific direction (up or down). Not all genes are in the same direction, but there should be en enrichment at one end of the spectrum (see also: "leading-edge genes").

ADD REPLY • link 4.4 years ago by igor 13k

0

Entering edit mode

Hey, thanks for your kind input :) I do understand your method and why you would suggest one like that. However, don't you think both would work? GSEA first then DEG/DEG first then GSEA? But the reason why I thought GSEA first then DEG would work better is because if you do DEG first for tens of thousands of individual genes, then it is simply too inefficient- GSEA would help reduce dimensionality for the subsequent DEG test. Though, it would be interesting to see the differences of results of DEG first then GSEA vs GSEA first then DEG...

Also, when you say the result of a GSEA is not that a specific pathway is up- or down-regulated, you're basically saying that the reason why we do GSEA is just to see which pathway (gene sets) is affected by the phenotype, right? So basically up-regulation and down-regulation isn't significant in GSEA...

ADD REPLY • link 4.4 years ago by chokevin8 ▴ 30

3

Entering edit mode

Hi! I don't understand how you would do a GSEA without having a list of Differentially Expressed Genes... What would you insert as input in the analysis? The GSEA starts with a list of DEGs, so in any case you need to do it beforre running GSEA. So, you do DEG analysis and find genes up regulated and down regulated in tumor VS normal. Now, you have a list of DEGs that you can analyse in different ways.

One for example is to decide the cutoff for pvalue and logFC to define what is really DE in the two conditions, imagine pvalue < 0,05 and logFC > |1,5|. In this way, you find genes that you can further analyze with an enrichment analysis such as GO pathway or KEGG on only upregulated genes for example (or down), in this way you can find pathways that move in the way of your genes (+ or -).

With GSEA you use all the genes as I said before and you obtain a list of pathways/biological processes in which your list of genes is involved, based on the ranking provided. In these lists you can find both up and downregulated genes, because as you know a pathway is composed by many components. So, GSEA is a general picture of what's going on.

For your aim, both methods can be good. You can find gene X, very important target to block that is one of the top of your DEG analysis; but also you can find that "epithelial to mesenchimal transition" is enriched in your GSEA analysis (based on the same DEG list), so you can pick one in the many genes involved as a target to block the entire pathway, instead of only few genes. ;)

ADD REPLY • link 4.3 years ago by mario.red8976 ▴ 130

1

Entering edit mode

I don't understand how you would do a GSEA without having a list of Differentially Expressed Genes... What would you insert as input in the analysis? The GSEA starts with a list of DEGs

It is not recommended to only use differentially expressed genes fro GSEA. See previous discussions:

ADD REPLY • link 4.3 years ago by igor 13k

0

Entering edit mode

You're right, sorry, I wrote something that can create confusion: with DEG list I mean the list that you obtain after you analyze your two conditions in the RNAseq/microarray, so technically they are just the genes that come out from analysis with pval and logFC. :)

ADD REPLY • link 4.3 years ago by mario.red8976 ▴ 130

1

Entering edit mode

Thank you everyone for the input, and after reading igor's comment, then I'm guessing I should use DEG package in R (DESeq2, limma, etc) and then use GSEA-pre ranked, is that right?

ADD REPLY • link 4.3 years ago by chokevin8 ▴ 30

1

Entering edit mode

That would make sense.

ADD REPLY • link 4.3 years ago by igor 13k

0

Entering edit mode

How about using pathway analysis using R package "Rontotools"? Would that be a better idea since this would provide actual biological significance and is more accurate than gene set analysis (DEG, GSEA)?

ADD REPLY • link 4.3 years ago by chokevin8 ▴ 30

1

Entering edit mode

There isn't any problem on the dimensionality of DEG test, as more genes lead to better estimation of the parameters of the genes. How do you do GSEA before having an ordered list?

Yes usually GSEA methods do not evaluate if a pathway is up or down-regulated (how would it know? )

ADD REPLY • link 4.3 years ago by Lluís R. ★ 1.2k

0

Entering edit mode

Hi! I know this post has been a while. I saw your opininion on "it's better to run the GSEA on ALL your genes, not only over-expressed/under-expressed." Why would you think so? I usually do GSEA on the differential expressed genes. If my DEGs has a trend, so it will bias the GSEA results?

ADD REPLY • link 2.8 years ago by Icecrystal • 0

1

Entering edit mode

According to the GSEA documentation:

The GSEA algorithm does not filter the expression dataset and generally does not benefit from your filtering of the expression dataset. During the analysis, genes that are poorly expressed or that have low variance across the dataset populate the middle of the ranked gene list and the use of a weighted statistic ensures that they do not contribute to a positive enrichment score. By removing such genes from your dataset, you may actually reduce the power of the statistic.

ADD REPLY • link 2.8 years ago by igor 13k

0

Entering edit mode

Hi old toppic, but I think the question is still of interest. I'm used to working with RNAseq data, mainly from mouse liver. Based on my experience, I would say that if there are thousands of DEG, analysing all genes with GSEA will give satisfying results. But if you run a GSEA on a dataset with very few DEG, you'll get false positives because the genes are ranked but the true distance between genes is not taken into account. You can try a GSEA on a dataset with randomly distributed genes (= no DEGs) and you might get some "significant" GSEA positive categories by chance. So now I prefer to filter my results and run the analysis on DEGs only, because if there are thousands of DEGs, there's not much change compared to running the analysis as a whole, because there are enough poorly expressed significant genes that produce background noise. But if there are not many DEGs, I get fewer false positives. Sometimes I get "nothing", wich can be a satisfying result. In effect, you're losing statistical power. But somehow, I find out that this loss can help summarising the data, so it might be interesting to discard genes that aren't affected by the experimental conditions.

ADD REPLY • link 6 weeks ago by arnaud_p • 0