Question

Use of Pval v.s. Padj Statistic in Pre-Ranked GSEA Analysis

0

Entering edit mode

4 months ago

BioTinker • 0

Hi,

I ran into a little issue when attempting to do a Pre-Ranked GSEA Analysis on my gene list (Created by DESeq2). Specifically, when ranking my list of genes for GSEA based off of log2FoldChange, I got a large number of enriched gene sets that seemed fine. However, I also wanted to rank the gene list based off of the statistic sign(log2FoldChange) * -log10(Pvalue) to verify my other GSEA results and to take also statistical significance into account (The Pvalue).

When doing this, I first used the Padj (Adjusted P-values) for the genes (Output from DESeq2) as the "pvalue" in the ranking and got no enriched gene sets with an error of 53.68% of all genes being ties: "There are ties in the preranked stats (53.68% of the list)." When I switched to using raw Pvalues, the error went away and got a large number of enriched gene sets again.

In this case, is it alright to use direct Pvalues instead of Padj values and why? Furthermore, why are most of the padj values overlapping?

Thanks! Error with ranking with Padj Success with raw Pval

Gene Ontology GSEA RNA-Seq • 686 views

ADD COMMENT • link updated 4 months ago by i.sudbery 21k • written 4 months ago by BioTinker • 0

score 2 · Answer 1 · 2025-01-22

P-values are effectively a combination of the size of the difference between the means in two groups and the spread of values in those groups. This is most obvious if we think about a t-test where the statistic is t = (mean(group1) - mean(group2))/weighted_mean(sd(group1),sd(group2)) and within an experiment there is a direct monotonic relationship between t and p-value. There is a 1:1 linear relationship between p and the likelihood of the data being generated under a model where the difference between the groups is 0. Now, DESeq2 doesn't use a t-test, but the underlying logic is similar, even if the math is different.

Padj is calculated as a transformation of p-values designed to mean that for a given cut off q (say 0.05), we expect that q of the discoveries with Padj under that threshold are false discoveries. This is the only goal of Padj, and it doesn't assume a 1:1 relationship with anything else.

As Padj isn't designed for ranking genes, only thresholding them, I'd say that is not suitable for use in a ranking application (such as GSEA). In general, P-values are not thought to be a good ranking metric for GSEA either, and I believe Gordon Smyth (of Limma/edgeR fame) recommends using the raw statistic for significance ranking, although its my understanding that ranking by something like t should give the same ranking as p-value*sign(log2FolcChage).