Question

GSEA vs GO-enrichemnt analysis

1

Entering edit mode

11 months ago

Pegasus ▴ 130

Hi all,

After reading through several posts, I came back to the interesting disucssion in the post below:

Combining GSEA (Gene Set Enrichment analysis) and DEG (Differentially Expressed Genes) to confirm results together- is it a good idea?

As discussed, for GSEA, we use two individual lists as input:

The up-regulated genes The down-regulated genes We can filter these two lists based on p-value and logFC, or we can set these parameters in the GSEA tool. The result is a set of GO terms enriched by the overrepresentation of up-regulated or down-regulated genes.

Point 1: What is the log(p-value) in the output of GSEA that is considered significant? Is it the same as for DEGs, i.e., 0.05? For example, I have only 40 upregulated genes, but the GSEA resulted in enrichemnt of 12 GO-terms logpvalue > 0.05

Point 2: How should we interpret the results if the same GO term sometime is enriched in both the downregulated and upregulated gene lists?

Point 3: If we use a logFC > 1.5 as a threshold in the GSEA settings, not all expressed genes are included. What is the difference between GSEA and regular GO-term enrichment analysis in this context?

As I understood "GO term enrichment looks at the frequency of GO terms among a set of genes, while GSEA evaluates the association of entire gene sets with phenotypes based on gene rankings", however, I found them similar!!!

I’d appreciate any help you can offer

RNA-SEQ • 1.9k views

ADD COMMENT • link 11 months ago by Pegasus ▴ 130

1

Entering edit mode

pvalues are about statistical significance hence are independent of what you are measuring, but once you apply a log to them and plot on a scale, they become a different concept, though not necessarily more meaningful,

it is just that biologists like to (incorrectly) think that smaller p-value means more certain result, hence the plotting of pvalues

remember that each method depends on annotations - how well a gene is annotated, hence operating on incomplete information

the annotation of a gene is merely a snapshot in time

ADD REPLY • link 11 months ago by Istvan Albert 102k

0

Entering edit mode

Thanks for yiour reply.

Lets say the files are well annotated and the alignmnet for the results showed 95% reads uniquly mapped to the reference. filtering based on pvalue >0.05 is a common step. The question here is what is the the best input;

For GO enriched analysis is DEG
But for GSEA is it;

a list of all (lest name it expressed genes, I know its misleading name)
a list of all expressed genes + pvalue > 0.05
a list of both DEG togother )pvlaue and fold change)
Up and down regulated genes lists like the ones for goseq (go-terms)

I previously tried a GSEA tool that recommended using the entire list of genes with their logFC values. While the tool did display enriched GO terms, it didn't indicate the direction (up or down) of the regulation. This led to confusion about whether the GSEA tool was including all genes regardless of their direction, which could be misleading.

ADD REPLY • link 11 months ago by Pegasus ▴ 130

1

Entering edit mode

When trying to get biological interpretation for differential gene expression analysis you have 2 standard options: overrepresentation analysis or enrichment analysis. First one based on hypergeometric test, second is functional class scoring technique based on distribution of all genes annotated to a geneset across a ranking that includes all measured genes. They both very limited approaches, with the ideal being a constraint based genome-wide model that accounts for topology of pathways. This one is probably unrealistic at the moment.

The 2 methods can be complementary as they answer slightly different questions. First one would give you the biological role for the top changing genes, second is best at picking up more subtle but coordinated (same direction) changes in all genes involved in a pathway, as you derive a score from all genes annotated to the process. They will often give you different results and point to different cell functions. For GSEA you need to provide a ranking of all genes included in your comparison, what metric to use for the ranking is largely up to you, naive and intuitive metric like logFC is often used but you could rank them based on pairwise comparison p-value, or you could rank them based on correlation with something if the question you trying to answer is what cellular processes are correlated with abundance of certain gene/protein.

For overrepresentation analysis I would recommend against splitting in up/down-regulated. Genes that work in coordination within the same biological process dont need to be necessarily coexpressed, as there is so many complex regulatory interactions between actors of same pathway, and GeneSets often include "positive" and "negative" contributors towards that cell function. Its common to get upregulated and downregulated genes annotated to the same pathway, which would be strong evidence that the pathway is perturbed and that biological process activity is affected. To guess in what direction is it changing you should probably take a closer look at pathway topology and interpret it yourself.

ADD REPLY • link 11 months ago by ultraanfibio ▴ 40

0

Entering edit mode

Thanks a lot,

What confused me is the mixing between both concepts like below :

"g:GOSt performs functional enrichment analysis, also known as over-representation analysis (ORA) or gene set enrichment analysis, on input gene list"

Based on your explanation, it seems that many typical RNA-seq analyses might be going in the wrong direction. I completely agree with not splitting upregulation and downregulation, as both directions can contribute to the same pathway. Therefore, it makes sense to start with the pathway itself, rather than just focusing on GO terms, and to examine the role of each individual gene within each pathway.

In this context:

Rearranging and grouping genes into sets and then using GSEA seems like a more valid approach.
Ranking genes: Some discussions suggest using criteria combining both logFC and p-value for ranking. however;

How we combine both? logFC is ranked from higher to lower as opposit as pvalue!!! so clearly can not multibe them!!
most of the website tool use a strict default ranking threshold (example logFC =2),so in such case how we select the best threshold to filter based on!

Given this discussion, where does goseq fit? Is it considered an Enrichment Analysis (EA) or Overrepresentation Analysis (ORA) method?

I appreciate any recommendations you might have!

ADD REPLY • link 11 months ago by Pegasus ▴ 130