GSEA PreRanked lists from DESeq2 results table
2
2
Entering edit mode
4.9 years ago
Assa Yeroslaviz ★ 1.9k

I have several tables of results from different DESeq2 runs. The next step would be to do GO enrichment or GSEA enrichment analysis.

For that I would like to create a ranked list of genes for GSEAPreRanked. But I'm not sure which value to take for the ranking. Do I use the log2FC values or the p-values, or even the adjusted p-values.

I have searched in different foren and the opinions varied.

When I use this command sign(resultsObject$log2FoldChange) * -log10(resultsObject$padj) I get Inf, if the padj=0.

FOr the GO enrichment I can use the goseq package, for the gsea I wanted to use fgsea, which does need a ranked gene list.

Is it better to rank the list by significance (adj. p-values) or by expression intensity ( fold-change)?

I would appreciate your opinions and/or reccomendations

thanks, Assa

gsea deseq2 preranked fgsea • 9.9k views
ADD COMMENT
3
Entering edit mode

I know it's very common, but I am personally a little worried about using p-values as the ranking. You can have very strong changes with high p-values and very subtle changes with low p-values.

There is a nice example here where they use the test statistic as the ranking, which is a nice strategy: https://stephenturner.github.io/deseq-to-fgsea/

ADD REPLY
0
Entering edit mode

thanks for the link. it is a very god example.

ADD REPLY
1
Entering edit mode

I'd recommend against using p-adjusted values; use the unadjusted p-values instead. The default FDR adjustment squashes genes to have the same adjusted p-value, despite having different input p-values. The distribution of logFC is different for genes with a different average expression level, this is why I tend to rank on the signed p-values rather than the FCs.

ADD REPLY
0
Entering edit mode

Good point about the same adjusted p-values. On a related note, there will also be a lot of adjusted p-values that are 1. Other than that, the adjusted and unadjusted p-values will correlate, so the rank order will be the same.

ADD REPLY
6
Entering edit mode
4.9 years ago
alserg ▴ 1000

Definetly don't do adjusted P-values. Signed log (nominal) P-value or statistic (stat column) should be fine. I personally use the latter, but I don't have any arguments for this. From my experience the results are very similar.

ADD COMMENT
0
Entering edit mode

This is exactly what I mean. Some people use this values, other use a different one, sometimes without any reason. Especially if the results are similar.

The advantage of using the FC values is, that I don't have any 0 in the table.

What do you do with them, if you convert to Signed log (nominal) P-value?

ADD REPLY
0
Entering edit mode

Usually, there is no P-value of exactly one. But as I said, I prefer using the statistic, which is very straightforward.

ADD REPLY
0
Entering edit mode

Would you recommend removing genes with exactly 0 in the stat column? In my case I am using the F-stat column from edgeR::glmQLFTest which is zero for a small subset of genes (like 200 out of 17.000 that survival the FilterByExpr filter), so one would have ties. Or doesn't it matter? Would appreciate your comment. If you need more details please tell me.

ADD REPLY
0
Entering edit mode

I'm not that familiar with edgeR pipeline. Isn't F-statistic a positive one, not signed? If so, it's a shady territory. The method will work: it will say whether or not gene set look uniformly distributed, but you should be careful with the interpretation.

In any case, don't remove gene based on statistic, even if it's zero. Only remove them on something uncorrelated, like average expression (that's what FilterByExpr does).

ADD REPLY
1
Entering edit mode

Thanks for the reply! Yes, F-stat is positive, so I would multiply with (-1) for negative FCs.

ADD REPLY
0
Entering edit mode

thanks Alexey for the answers and the help. I'm following the fgsea package instruction now using the stat column for my ranking.

I still have one question though. If I do decide to use the pvalue column I still have some very significant genes, which the sign-log conversion turn the value into Inf. How should one handle these kind of data? Would changing the Inf into the value of 312 (making this a p-value of 10^312 be a possible solution?

ADD REPLY
0
Entering edit mode

Yes, changing Inf to a big number should work fine.

ADD REPLY
2
Entering edit mode
4.9 years ago
jomo018 ▴ 730

I tend to view p-values and adjusted p-values as a confidence measures, not enrichment measures, therefore less fit for GSEA.

Results for genes with high (bad) p-value are simply not reliable and should not be used for further analysis. Once you determine a p-value threshold, I would argue that FC (or log2FC) is then the proper measure for GSEA.

ADD COMMENT
0
Entering edit mode

If you are compiling a ranked list of genes, there should be genes there that are not significantly changing. One of the benefits for running a ranked list is to aggregate signal from many genes that are not necessarily significant on their own.

ADD REPLY
0
Entering edit mode

Yes, I realize that. However, if p-value is insignificant, I am not sure whether my estimate for that gene is correct. So I prefer discarding that gene altogether rather than placing it incorrectly in the ranked list.

ADD REPLY

Login before adding your answer.

Traffic: 1903 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6