Question

GSEA: Can a gene set have positive and negative enrichment scores?

1

Entering edit mode

5.2 years ago

penny.lane ▴ 20

I had thought I understood the meaning of the positive and negative normalized enrichment scores (NES) produced by GSEA, but it just occurred to me I don't know what to expect from a gene set that is enriched both at high and low ranks. For example, some gene sets (e.g., GO biological processes) contain genes that are co-regulated under different conditions, but actually move in opposite directions (producing a mix of positive and negative fold-changes), such that these genes fall at the very top and bottom of the rank-ordered list that GSEA is using to compute NES. So, do these high- and low-ranking genes just cancel each other out leading to an NES close to zero or does GSEA just report the larger absolute enrichment between the positive and the negative NES?

gsea • 8.8k views

ADD COMMENT • link 5.2 years ago by penny.lane ▴ 20

score 4 · Accepted Answer · 2019-09-16

4

Entering edit mode

5.2 years ago

shawn.w.foley ★ 1.3k

The short answer to your question is that if you have an equal distribution of up- and down-regulated genes within a gene set your NES (Normalized Enrichment Score) should be close to zero.

The medium answer is that the NES is calculated from the ES (Enrichment Score), which is a rolling sum of the score for each gene. Effectively, you have a ranked list of all genes, and as you move from the top to the bottom of all genes you're changing your ES: for each gene in your list you're adding to the ES and for each gene not in your list you're subtracting from the ES. The ES is also weighted based on how close the genes are to one another in the list. So if you have a list of 100 genes, and 50 are right at the top of your list (the 50 most up-regulated genes) and 50 are spread throughout the bottom of your list (all downregulated but with a wide range of log2FC values) then you may still end up with a strongly positive ES.

The long answer would be to dig through the GSEA publications and the (huge amount) of math they use to calculate their scores here and here.

ADD COMMENT • link 5.2 years ago by shawn.w.foley ★ 1.3k

0

Entering edit mode

Thanks for the quick response Shawn! Your explanation makes sense. So, what do people typically do to capture enriched gene sets where the genes in the set tend to change under the same conditions, but actually contain a mix of up- and down-regulated genes? The first thing that comes to mind is to run the analysis with ranks that are determined by the fold-change p-values that have been signed in the direction of the fold-change (which is what I've been doing), but then also run a second analysis where ranks are determined only by the fold-change p-values. It seems a little clunky though to have to run 2 separate analyses, no?

ADD REPLY • link 5.2 years ago by penny.lane ▴ 20

1

Entering edit mode

Rather than changing the sign of the p-value it's recommended to use the T-statistic. This number correlates to the p-value and is signed based on the fold change.

My recommendation would be to perform a Hypergeometric test (similar to a GO term enrichment but using the gene sets of interest). I'd take the significantly up-regulated genes and run the test, then take the down genes and run the test. In this case if you see the same gene set being found as significant in both tests you'd have your answer.