fgsea after deseq2, criteria for ranking DEGs.
1
5
Entering edit mode
2.5 years ago
user230613 ▴ 380

Hi there!

I am following this tutorial to perform GSEA using fgsea R package. In the tutorial, DEG are ranked using stat variable from DESeq2. I have read in several other places that it is common as well to rank based on fold changes, p-values, a combination of these two ... I would appreciate any guidance and comments on this regard, I am quite newbie when it comes to GSEA.

Thank you

GSEA • 7.5k views
ADD COMMENT
1
Entering edit mode

For ranking genes, in general, statistic (stat column in this case) or fold change (log2FoldChange column in this case) are used.

ADD REPLY
0
Entering edit mode

Exactly, that is what I have read so far, do you know if there is any recommendation of which one to use? Should I first select only those DEG with pvalue<0.05?

ADD REPLY
0
Entering edit mode

I would prefer not to filter the genes. Use entire geneset, but ranked. For ranking, you can use either one (stat or fold change). Go with STAT column as tutorial mentioned.

ADD REPLY
0
Entering edit mode

Thank you! I am just a bit hesitated of obtaining significant hits in GSEA that are made from DEGs that have high p-values or that have a FC close to zero.

ADD REPLY
0
Entering edit mode

If you are not comfortable with STAT, you can rank the genes by logFC . Use logFC ranked genes in GSEA analysis. Ranking with logFC and subsequent GSEA is accepted in scientific community.

ADD REPLY
0
Entering edit mode

Just as an update, I have tried both, and the results are quite similar, so I will follow the tutorial and stick to stat (even if I consider that logFC is more "understandable"). Thanks for the help.

ADD REPLY
1
Entering edit mode

Maybe of interest: GSEA PreRanked lists from DESeq2 results table. It looks like going with stat column is good. No pre-filter of genes is necessary, but I'd filter those with NA p-values.

ADD REPLY
6
Entering edit mode
2.5 years ago
ATpoint 85k

There are basically two options, that would be the fold change or the pvalue/stat column (which are basically the same conceptually). BOth have advantages and disadvantages.

The pvalue (not the padj) is continuous and takes into account all the biases and variables that went into the analysis such as dispersion, magnitude of counts etc. That makes it a good ranking metric, but as in RNA-seq genes with large counts have more power that those with small counts you have a power bias. Meaning that pvalues tend to be smaller for genes with large counts than small counts at same effect size (fold change). That is the disadvantage.

Then the fold change, which is also continuous and is intuitive as unlike a pvalue it gives a direct idea of the magnitude of the change. Problem is that fold changes are biased and tend to be larger for genes with small counts. That is taken into account during stat calculation by the common DE tools, yet for a global ranking the inflated fold changes for genes with small counts or large standard errors is still a problem. DESeq2 offers the lfcShrink procedure to address that, and they claim this removes that bias, making the shrunken logFCs a good ranking metric.

When I use DESeq2 with the lfcShrink() I usually use the shrunken logFCs, because in my head that makes good sense. You can also use the stat column. If you use the pvalue (not padj because that has ties) then do it as -log10(pvalue) and sign them, meaning put a minus if the logFC was negative. The edgeR authors regularily recommend the pvalue (browse support.bioconductor.org for threads on that). In the end there is no fixed rule other than not using padj/FDR as that has many ties and don't use unshrunken fold changes as these are biased. Results should make some biological sense as always. Some people tried to compensate biased fold changes by doing logFC*log10(pvalue) so kind of penalizing large fold changes that have large (insigificant) pvalues, that is probably option 3 then.

ADD COMMENT
1
Entering edit mode

(+1) Nitpicking point and I'm not sure I'm getting it right myself. You say:

unshrunken fold changes as these are biased

I think in statistical parlance unshrunk estimates are unbiased but have high variance. This means that if you repeat the experiment many times the average of the unshrunk estimates approaches the true value (it's unbiased) but each individual estimate jumps up and down a lot (it has high variance). Conversely, shrunk estimates sacrifice unbiasedness in order to reduce variance. If you repeat the experiment many times, the average of the shrunk estimates underestimates the true value (it's biased towards zero) but the individual estimates are stable across replicates (low variance). I think the concept here is bias-variance tradeoff.

ADD REPLY
0
Entering edit mode

I think in statistical parlance unshrunk estimates are unbiased but have high variance.

Yes, can well be that I did not choose the proper terminology here. Lets probably better phrase it as "unshrunken fold changes have high standard errors". In DESeq2 that by the way is indicated in the lfcSE column. Large FCs with large SEs are consequently not a good choice for ranking as the estimate is not reliable.

ADD REPLY

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6