Question

Gene enrichment analysis clusterProfiler

1

Entering edit mode

4.2 years ago

camillab. ▴ 160

Hi,

I want to perform GSEA analysis with clusterProfiler package and I am following this post . But I have few doubts (and be patient, I am a biologist):

Is it better to perform GSEA using the Ensembl ID or the gene symbol?

I did it with both Ensembl & gene names just to see if I was getting the same results but surprisingly I got slightly different results. I am not sure which one should I choose since I don't want to be biased based on what would fit/support better my hypothesis (which is the one run on the gene symbol).

in the first figure (dotoplot) of the website I posted, the enriched terms are divided by activated and suppressed which reflect whether the enrichment score (ES) is positive or negative. A positive ES indicates gene set enrichment at the top of the ranked list whereas a negative ES indicates gene set enrichment at the bottom of the ranked list. but what does it mean practically? What does it tell me that e.g. cytosol ribosome (in the figure) has negative ES so it is at the bottom of the ranked list?

thank you in advanced

Camilla

RNA-Seq R clusterProfiler GSEA • 3.8k views

ADD COMMENT • link updated 4.1 years ago by Hannes ▴ 60 • written 4.2 years ago by camillab. ▴ 160

score 3 · Accepted Answer · 2020-10-24

Hi Camilla,

I just recently started working with this package but i will do my best to answer your question to the best of my current knowledge. About your first question:

Is it better to perform GSEA using the Ensembl ID or the gene symbol?

I personally would stick to ENSEMBL IDs as it is very often the case that multiple ENSEMBL IDs have the same gene symbol. I think this might also explain why your results would differ using geneSymbols vs ENSEMBL IDs. You can simply check how that on your input data using R: length(unique(YOURDATA$ENSEMBLID))& length(unique(YOURDATA$SYMBOL)). This will give you an idea on how much your input varies in size for the GSEA analysis. Ultimately when there are fewer genes to compare with in the enrichment model this will influence the outcome. So I would say using the dataset with the maximum of information (I assume that would be the ENSEMBL IDs) would be the less biased approach.

[...] the ranked list. [...] what does it mean practically?

As far as I understand the concept of the running enrichment score and the approach of clusterProfiler you have to provide a gene list where the effect size (i.e. log2(fc)) is sorted from high to low. Hence, your input data is already ranked in a way that up regulated genes (positive log2(fc)) have lower ranks than your downregulated genes. This explains why negative ES have a lower ranks compared to those with a positive ES.

I apologize if my current explanation is a little bit cryptic but you can run the gseaplot() function, which displays you the running ES over the position in the ranked list of genes for a particular pathway in your results list. Where this plot reaches it's maximum you see a dashed red line. I think where this line intersects with the y-axis it determines the rank position of this particular pathway. Please check here for details.

I hope this helps :)