I have several tables of results from different DESeq2 runs. The next step would be to do GO enrichment or GSEA enrichment analysis.
For that I would like to create a ranked list of genes for GSEAPreRanked. But I'm not sure which value to take for the ranking. Do I use the log2FC values or the p-values, or even the adjusted p-values.
I have searched in different foren and the opinions varied.
When I use this command sign(resultsObject$log2FoldChange) * -log10(resultsObject$padj)
I get Inf
, if the padj=0
.
FOr the GO enrichment I can use the goseq
package, for the gsea I wanted to use fgsea
, which does need a ranked gene list.
Is it better to rank the list by significance (adj. p-values) or by expression intensity ( fold-change)?
I would appreciate your opinions and/or reccomendations
thanks, Assa
I know it's very common, but I am personally a little worried about using p-values as the ranking. You can have very strong changes with high p-values and very subtle changes with low p-values.
There is a nice example here where they use the test statistic as the ranking, which is a nice strategy: https://stephenturner.github.io/deseq-to-fgsea/
thanks for the link. it is a very god example.
I'd recommend against using p-adjusted values; use the unadjusted p-values instead. The default FDR adjustment squashes genes to have the same adjusted p-value, despite having different input p-values. The distribution of logFC is different for genes with a different average expression level, this is why I tend to rank on the signed p-values rather than the FCs.
Good point about the same adjusted p-values. On a related note, there will also be a lot of adjusted p-values that are 1. Other than that, the adjusted and unadjusted p-values will correlate, so the rank order will be the same.