I will try to give you my take on it since I did struggle for awhile understanding what that means...hopefully now I got it...
So, to run GSEA you have your list of genes (L) and two conditions (or more), i.e. a microarray with normal and tumor samples. the first thing that GSEA does is to rank the genes in L based on "how well they divide the conditions" using the probe intensity values. at this point you have a list L ranked from 1...n.
Now you want to see whether the genes present in a gene set (S) are at the top or at the bottom of your list...or if they are just spread around randomly. to do that GSEA calculates the famous enrichment score, that becomes normalized enrichment score (NES) when correcting for multiple testing (FDR).
A positive NES will indicate that genes in set S will be mostly represented at the top of your list L. a negative NES will indicate that the genes in the set S will be mostly at the bottom of your list L.
Let's say that S1 has positive NES and S2 has negative NES. let's say also that your list of 1000 genes is ordered form the most up-regulated (top: 1,2,3,....) to the most down-regulated (bottom: ....n-3,n-2,n-1,n). a positive NES for S1 will mean that genes over-represented in that gene set are up-regulated in your dataset. negative NES for S2 instead indicated the opposite.
In the results you will also find a heatmap the subset of you data that belong to the signature analyzed. generally what I saw is that the more significantly enriched is the gene set, the better the division between the two conditions in the heatmap
Hopefully I understood it right and this helps, otherwise, please correct me :)
Thank you for your answer, it has been helpful.
Within your answer, you discuss a ranking of 1000 genes based upon "how well they divide the conditions". Later you refer to a "list of 1000 genes is ordered form the most upregulated (top: 1,2,3,....) to the most downregulated (bottom: ....n-3,n-2,n-1,n).." Is this list what you meant by "how well they divide the conditions"?
I would think that a gene that is the most highly differentially expressed gene between two conditions would "divide the conditions" the best. Since this change can be in either direction (upregulated or downregulated), would the ranked list in the beginning be based on the absolute value of the fold-change? If it was, then how does this relate to the 1000 genes that are ordered from the most upregulated to the most downregulated?
Sorry, I am just trying to better understand my data.
Thank you,
Chris
The explanation is very clear but I have a similar question as Chris.
If you have two conditions, A and B, you use condition A minus condition B, gene a is the most upregulated. If gene a is in a gene set S, when you do GSEA and look at the positive gene set, would gene a be the top on the list because it is the most upregulated? Would it also have the highest running ES since it is the most correlated to condition A (phenotype A)? However, that is not what I have seen in my data. The first gene in a gene set (ranked highest) does not necessarily have the highest running ES. Am I not understanding it correctly?
Thank you very much.
Yi
Thanks, this helps! I do have another question related to this. I pre-ranked my gene list from most upregulated to most downregulated. I saw some gene sets have initial positive ES then toward the end, it becomes negative. Does it mean that some genes upregulated are enriched in the pathway and also some genes downregulated are enriched in the same pathway as well?
I guess some genes are in the same pathway, but they may have positive or negative effect on the pathway (repressors and activators), so it may make sense to divide those genes to activators and repressors and then do GSEA individually? thanks!
Did you figure this out? I'm currently trying to analyze GSEA outputs and the figures/documentation are so confusing.