Question

fgsea after deseq2, criteria for ranking DEGs.

5

Entering edit mode

2.9 years ago

user230613 ▴ 380

Hi there!

I am following this tutorial to perform GSEA using fgsea R package. In the tutorial, DEG are ranked using stat variable from DESeq2. I have read in several other places that it is common as well to rank based on fold changes, p-values, a combination of these two ... I would appreciate any guidance and comments on this regard, I am quite newbie when it comes to GSEA.

Thank you

GSEA • 9.5k views

ADD COMMENT • link updated 2.9 years ago by ATpoint 88k • written 2.9 years ago by user230613 ▴ 380

1

Entering edit mode

For ranking genes, in general, statistic (stat column in this case) or fold change (log2FoldChange column in this case) are used.

ADD REPLY • link 2.9 years ago by cpad0112 21k

0

Entering edit mode

Exactly, that is what I have read so far, do you know if there is any recommendation of which one to use? Should I first select only those DEG with pvalue<0.05?

ADD REPLY • link 2.9 years ago by user230613 ▴ 380

0

Entering edit mode

I would prefer not to filter the genes. Use entire geneset, but ranked. For ranking, you can use either one (stat or fold change). Go with STAT column as tutorial mentioned.

ADD REPLY • link 2.9 years ago by cpad0112 21k

0

Entering edit mode

Thank you! I am just a bit hesitated of obtaining significant hits in GSEA that are made from DEGs that have high p-values or that have a FC close to zero.

ADD REPLY • link 2.9 years ago by user230613 ▴ 380

0

Entering edit mode

If you are not comfortable with STAT, you can rank the genes by logFC . Use logFC ranked genes in GSEA analysis. Ranking with logFC and subsequent GSEA is accepted in scientific community.

ADD REPLY • link 2.9 years ago by cpad0112 21k

0

Entering edit mode

Just as an update, I have tried both, and the results are quite similar, so I will follow the tutorial and stick to stat (even if I consider that logFC is more "understandable"). Thanks for the help.

ADD REPLY • link 2.9 years ago by user230613 ▴ 380

1

Entering edit mode

Maybe of interest: GSEA PreRanked lists from DESeq2 results table. It looks like going with stat column is good. No pre-filter of genes is necessary, but I'd filter those with NA p-values.

ADD REPLY • link 2.9 years ago by iraun 6.2k

score 6 · Answer 1 · 2022-06-09

There are basically two options, that would be the fold change or the pvalue/stat column (which are basically the same conceptually). BOth have advantages and disadvantages.

The pvalue (not the padj) is continuous and takes into account all the biases and variables that went into the analysis such as dispersion, magnitude of counts etc. That makes it a good ranking metric, but as in RNA-seq genes with large counts have more power that those with small counts you have a power bias. Meaning that pvalues tend to be smaller for genes with large counts than small counts at same effect size (fold change). That is the disadvantage.

Then the fold change, which is also continuous and is intuitive as unlike a pvalue it gives a direct idea of the magnitude of the change. Problem is that fold changes are biased and tend to be larger for genes with small counts. That is taken into account during stat calculation by the common DE tools, yet for a global ranking the inflated fold changes for genes with small counts or large standard errors is still a problem. DESeq2 offers the lfcShrink procedure to address that, and they claim this removes that bias, making the shrunken logFCs a good ranking metric.

When I use DESeq2 with the lfcShrink() I usually use the shrunken logFCs, because in my head that makes good sense. You can also use the stat column. If you use the pvalue (not padj because that has ties) then do it as -log10(pvalue) and sign them, meaning put a minus if the logFC was negative. The edgeR authors regularily recommend the pvalue (browse support.bioconductor.org for threads on that). In the end there is no fixed rule other than not using padj/FDR as that has many ties and don't use unshrunken fold changes as these are biased. Results should make some biological sense as always. Some people tried to compensate biased fold changes by doing logFC*log10(pvalue) so kind of penalizing large fold changes that have large (insigificant) pvalues, that is probably option 3 then.