Question

Need help choosing genes and calculating ranks for GSEA rank file

5

Entering edit mode

5.4 years ago

zlikowski ▴ 50

Dear all,

I am trying to do my first pathway enrichment analysis of a ranked gene list using GSEA, as described in the relatively recent protocol published here: https://www.nature.com/articles/s41596-018-0103-9. I have "successfully" completed the entire protocol, meaning that I have, more or less, learned how to perform the DE analysis (using DESeq2), how to format the rank, class, gmt and expression files for GSEA, and use other tools required by the protocol. However, I have put quotations marks around the word "successfully", because I am not sure whether I have used the correct gene set when preparing rank files for GSEA. Here's what's bothering me:

1) As I understand, an input rank file for running GSEA in preranked mode should contain gene IDs in one column, and gene ranks in the second column. However, I am not certain which genes exactly should such a file contain. Before running the DESeq2 functions, I usually pre-filter low count genes, as suggested in the vignette, but the above-mentioned protocol states that "...all (or most) genes in the given genome need to have a score. What does the term "all genes" refers to in the context of GSEA analysis, then? And does that mean that, if I want to run GSEA, I should not pre-filter (remove) low counts genes before running DESeq2 and obtaining the results of DE analysis? So, briefly, which genes should be included in the rank file for properly running GSEA in preranked mode - all annotated genes for a given genome, DESeq2 pre-filtered genes, only genes for which padj was calculated during DE analysis, or something else?

2) Next, once I do obtain the results of DE analysis, with the proper set of genes for GSEA, a certain number of genes will contain padj="NA" value. Since "NA" cannot be used to calculate rank, my next question is how to deal with those genes? Should I remove them before running GSEA (which might be in conflict with the "all genes requirement"), or should I change their padj value to some number, e.g. 1?

3) Finally, what to do when certain number of genes have padj=0. This also complicates rank calculation, and I was wondering whether it would be "fair" to change the padj values for those genes to the smallest non-zero value (which one?) that can be used for calculations on a "standard" 64-bit computer?

Thanks everyone in advance, help is greatly appreciated!

Mirko

RNA-Seq GSEA Calculating ranks Enrichment analysis • 4.8k views

ADD COMMENT • link updated 5.4 years ago by alserg ▴ 1000 • written 5.4 years ago by zlikowski ▴ 50

score 4 · Answer 1 · 2019-07-17

4

Entering edit mode

5.4 years ago

alserg ▴ 1000

1) I recommend to consider "all genes" as all expressed genes, considering only top 10-12K most expressed genes (by average expression, or baseMean, if it's a result of DESeq2). You can use all the genes as well, but the list will contain a lot of noise.

As for 2) and 3) you can use stat field of the results table for ranking, it is always present (for expressed genes).

Additionally, I would recommend to use fgsea package, it's much faster than Broad's version and compatible with R. See for example: http://bioconductor.org/packages/release/bioc/vignettes/fgsea/inst/doc/fgsea-tutorial.html and https://stephenturner.github.io/deseq-to-fgsea/

ADD COMMENT • link 5.4 years ago by alserg ▴ 1000

0

Entering edit mode

Hello,

my apologies for not responding sooner, I was out of the country. Thank you very much for your suggestions and help. I will try to do GSEA using stat and some other ranking criteria to see how much of a difference that makes, and post the results here once I'm done.

ADD REPLY • link 5.3 years ago by zlikowski ▴ 50