Question

GSEA PreRanked lists from DESeq2 results table

2

Entering edit mode

4.9 years ago

Assa Yeroslaviz ★ 1.9k

I have several tables of results from different DESeq2 runs. The next step would be to do GO enrichment or GSEA enrichment analysis.

For that I would like to create a ranked list of genes for GSEAPreRanked. But I'm not sure which value to take for the ranking. Do I use the log2FC values or the p-values, or even the adjusted p-values.

I have searched in different foren and the opinions varied.

When I use this command sign(resultsObject$log2FoldChange) * -log10(resultsObject$padj) I get Inf, if the padj=0.

FOr the GO enrichment I can use the goseq package, for the gsea I wanted to use fgsea, which does need a ranked gene list.

Is it better to rank the list by significance (adj. p-values) or by expression intensity ( fold-change)?

I would appreciate your opinions and/or reccomendations

thanks, Assa

gsea deseq2 preranked fgsea • 10.0k views

ADD COMMENT • link updated 4.9 years ago by jomo018 ▴ 730 • written 4.9 years ago by Assa Yeroslaviz ★ 1.9k

3

Entering edit mode

I know it's very common, but I am personally a little worried about using p-values as the ranking. You can have very strong changes with high p-values and very subtle changes with low p-values.

There is a nice example here where they use the test statistic as the ranking, which is a nice strategy: https://stephenturner.github.io/deseq-to-fgsea/

ADD REPLY • link 4.9 years ago by igor 13k

0

Entering edit mode

thanks for the link. it is a very god example.

ADD REPLY • link 4.9 years ago by Assa Yeroslaviz ★ 1.9k

1

Entering edit mode

I'd recommend against using p-adjusted values; use the unadjusted p-values instead. The default FDR adjustment squashes genes to have the same adjusted p-value, despite having different input p-values. The distribution of logFC is different for genes with a different average expression level, this is why I tend to rank on the signed p-values rather than the FCs.

ADD REPLY • link 4.9 years ago by russhh 5.7k

0

Entering edit mode

Good point about the same adjusted p-values. On a related note, there will also be a lot of adjusted p-values that are 1. Other than that, the adjusted and unadjusted p-values will correlate, so the rank order will be the same.

ADD REPLY • link 4.9 years ago by igor 13k

2

Entering edit mode

4.9 years ago

jomo018 ▴ 730

I tend to view p-values and adjusted p-values as a confidence measures, not enrichment measures, therefore less fit for GSEA.

Results for genes with high (bad) p-value are simply not reliable and should not be used for further analysis. Once you determine a p-value threshold, I would argue that FC (or log2FC) is then the proper measure for GSEA.

ADD COMMENT • link 4.9 years ago by jomo018 ▴ 730

0

Entering edit mode

If you are compiling a ranked list of genes, there should be genes there that are not significantly changing. One of the benefits for running a ranked list is to aggregate signal from many genes that are not necessarily significant on their own.

ADD REPLY • link 4.9 years ago by igor 13k

0

Entering edit mode

Yes, I realize that. However, if p-value is insignificant, I am not sure whether my estimate for that gene is correct. So I prefer discarding that gene altogether rather than placing it incorrectly in the ranked list.

ADD REPLY • link 4.9 years ago by jomo018 ▴ 730

score 6 · Accepted Answer · 2020-02-08

6

Entering edit mode

4.9 years ago

alserg ▴ 1000

Definetly don't do adjusted P-values. Signed log (nominal) P-value or statistic (stat column) should be fine. I personally use the latter, but I don't have any arguments for this. From my experience the results are very similar.

ADD COMMENT • link 4.9 years ago by alserg ▴ 1000

0

Entering edit mode

This is exactly what I mean. Some people use this values, other use a different one, sometimes without any reason. Especially if the results are similar.

The advantage of using the FC values is, that I don't have any 0 in the table.

What do you do with them, if you convert to Signed log (nominal) P-value?

ADD REPLY • link 4.9 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Usually, there is no P-value of exactly one. But as I said, I prefer using the statistic, which is very straightforward.

ADD REPLY • link 4.9 years ago by alserg ▴ 1000

0

Entering edit mode

Would you recommend removing genes with exactly 0 in the stat column? In my case I am using the F-stat column from edgeR::glmQLFTest which is zero for a small subset of genes (like 200 out of 17.000 that survival the FilterByExpr filter), so one would have ties. Or doesn't it matter? Would appreciate your comment. If you need more details please tell me.

ADD REPLY • link 4.9 years ago by ATpoint 86k

0

Entering edit mode

I'm not that familiar with edgeR pipeline. Isn't F-statistic a positive one, not signed? If so, it's a shady territory. The method will work: it will say whether or not gene set look uniformly distributed, but you should be careful with the interpretation.

In any case, don't remove gene based on statistic, even if it's zero. Only remove them on something uncorrelated, like average expression (that's what FilterByExpr does).

ADD REPLY • link 4.9 years ago by alserg ▴ 1000

1

Entering edit mode

Thanks for the reply! Yes, F-stat is positive, so I would multiply with (-1) for negative FCs.

ADD REPLY • link 4.9 years ago by ATpoint 86k

0

Entering edit mode

thanks Alexey for the answers and the help. I'm following the fgsea package instruction now using the stat column for my ranking.

I still have one question though. If I do decide to use the pvalue column I still have some very significant genes, which the sign-log conversion turn the value into Inf. How should one handle these kind of data? Would changing the Inf into the value of 312 (making this a p-value of 10^312 be a possible solution?