This seems to be a simple question.
You have a list of genes from DEG analysis, with p-values, FDRs, & logFCs, etc. Previously, what I do for GSEA analysis is to filter in genes with FDR < 0.25 or 0.05, rank them by logFC (in other words, pre-rank the genes by logFC), and then do GSEA. Now I am wondering if this is a good way:
- There might be too many genes (typically ~50%). Assuming usually
there are 4~5 pathways involved and each pathway has about 500 genes, then the top 2,000 genes might be enough to be included for GSEA
analysis. - Not sure if logFC is the best way to rank genes. Maybe
use -log(PValue) as the magnitude of the rank score and the sign of logFC as the sign of the sore? i.e., use sign(logFC) * (-log(PValue)) as the rank score?
Googled briefly but didn't find a convention.
Thanks.
Your first point is asking about a good threshold or filter for your gene list. Typically, this would depend on what you're interested in. For example, you could be interested in only the strongest effects and therefore take only the most extreme logFC. I could also imagine situations in which you are only interested in certain categories of genes, maybe because you have some prior knowledge. On the second point, you have to consider what the parameter used for ranking represents: logFC represents the strength of the effect while log(p-value) represents "unexpectedness". To me, effect strength is more relevant than p-value because, without any other information, I wouldn't trust a small variation even if it is associated with a small p-value. Another way of putting it is that statistical significance doesn't imply biological relevance but a strong effect is likely to have some biological impact.
Please see my reply below to igor -- one of my experiences is that "true signals" (low p-values) should be weighted much more than "big signals" (large abs(logFC)s).