Which Should Take Priority When Selecting DEGs: FDR or Log Fold Change?
1
2
Entering edit mode
9 weeks ago
bioinfo1994 ▴ 20

Hello, I have RNA-seq data from two strains (three replicates each), one of which exhibits a specific phenotype. I aim to explore the genetic background responsible for this phenotype. I've performed differential gene expression analysis using both DESeq2 and edgeR. Now, I'm trying to determine how to select the top differentially expressed genes (DEGs).

Some studies use a fold-change cutoff (logFC > ±1) and rank genes by p-value (FDR), while others apply a statistical significance threshold (FDR < 0.05) and then rank the significant genes by fold-change (logFC > ±1). What is the best approach for selecting DEGs, and why?

Additionally, how should I proceed with further analysis of the top DEGs? Should I investigate the pathways associated with each gene in the DEG list?

Thank you for your assistance!

DESeq2 logFC FDR DEGs edgeR • 698 views
ADD COMMENT
3
Entering edit mode
9 weeks ago
Gordon Smyth ★ 7.7k

See https://bioconductor.org/packages/release/workflows/vignettes/RnaSeqGeneEdgeRQL/inst/doc/edgeRQL.html for an example workflow, which includes ranking of genes and pathway analyses by GO, KEGG and by focused gene sets.

Unless you have an overwhelming amount of DE, we recommend ranking by FDR and not by fold-change at all.

In my opinion, the use of logFC=1 cutoffs in the literature is arbitrary and counter-productive because it interacts badly with modern empirical Bayes differential expression tests as implemented in limma, edgeR or DESeq2. The 2-fold cutoff comes from the original microarray papers pre-2000 that had no replicates and no statistical tests. Later on, people started to include replicates and ordinary t-tests. It can be shown that ordinary t-tests or Welch t-tests can be improved by adding a logFC cutoff, and that a moderate logFC cutoff can decrease the FDR. This is because a t-test can be highly significant even for very small logFC if the standard deviation is also very small, just by chance.

The empirical Bayes tests implemented by limma and edgeR are different however. They take into account logFC, standard deviation and expression level all together when evaluating significance. They already achieve a compromise (optimal in a certain sense) between ranking by logFC and ranking by t-test. They put on a floor under the posterior variances and do not allow a gene to be significant with an extremely small logFC. So adding a logFC cutoff is no longer necessary and instead becomes counter-productive.

Giving priority to genes with very large logFC but moderate FDR has the effect of prioritizing low expressed genes over highly expressed genes, which tends to select genes that are of less rather than more biological importance.

Another problem is that the Benjamini-Hochberg algorithm for evaluating FDR is based on tail-probabilities for the ranked gene list. If you filter genes from the list after computing the FDR then the whole FDR calculation is invalidated for genes originally further down in the list. It is easy to show that you can actually increase the overall FDR, sometimes substantially, by removing genes with lower logFC values from the list.

Another consideration that highlights the arbitrariness of the logFC cutoff is the fact that logFCs output from limma, edgeR and DESeq2 are not raw log-fold-changes but are rather shrunk to a greater or lesser degree. In edgeR, the amount of shrinkage is user-specified. Which genes will satisfy a logFC=1 cutoff depends on how much shrinkage has been done, but unfortunately people in the literature just seem to apply the logFC=1 cutoff blindly without considering what the estimated logFCs actually mean.

ADD COMMENT
0
Entering edit mode

Thank you! The guide has been really helpful in teaching me how to use EdgeR. However, I'm still puzzled about why ranking by FDR makes more sense than ranking by logFC, especially after filtering for FDR < 0.05. If I rank by FDR, I could end up with genes that have a low FDR (e.g., 0.003) but a relatively low logFC (e.g., 1). On the other hand, if I rank by logFC, I might identify genes with a very high logFC (e.g., 8) but a slightly higher FDR (e.g., 0.04), which would still be statistically significant.

Why is it more important to focus on the smallest FDR in the top X genes list rather than prioritizing those with the highest logFC? Doesn’t it seem more biologically relevant to highlight genes with the greatest logFC, as long as they pass the FDR < 0.05 threshold?

ADD REPLY
1
Entering edit mode

Because it is too simplistic to focus on one factor (logFC) in isolation, instead of considering logFC, dispersion and expression level all together, which is what edgeR does for you. In the scenario you describe, the gene with logFC=8 must be very lowly expressed for it to give such a high FDR and is probably completely absent in one condition. The gene with logFC=1 is probably highly expressed. I have analysed thousands of RNA-seq experiments and published hundreds of papers on cancer, immunology and other diseases, but I've never seen a context in which a very low expressed gene that just happens to be non-detected in one condition is necessarily more biologically important than a highly expressed gene with a still substantial fold-change.

ADD REPLY
0
Entering edit mode

Ah, I see it now! Thank you so much for taking the time to explain everything in such detail. I truly appreciate your help!

ADD REPLY

Login before adding your answer.

Traffic: 1896 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6