See https://bioconductor.org/packages/release/workflows/vignettes/RnaSeqGeneEdgeRQL/inst/doc/edgeRQL.html for an example workflow, which includes ranking of genes and pathway analyses by GO, KEGG and by focused gene sets.
Unless you have an overwhelming amount of DE, we recommend ranking by FDR and not by fold-change at all.
In my opinion, the use of logFC=1 cutoffs in the literature is arbitrary and counter-productive because it interacts badly with modern empirical Bayes differential expression tests as implemented in limma, edgeR or DESeq2. The 2-fold cutoff comes from the original microarray papers pre-2000 that had no replicates and no statistical tests. Later on, people started to include replicates and ordinary t-tests. It can be shown that ordinary t-tests or Welch t-tests can be improved by adding a logFC cutoff, and that a moderate logFC cutoff can decrease the FDR. This is because a t-test can be highly significant even for very small logFC if the standard deviation is also very small, just by chance.
The empirical Bayes tests implemented by limma and edgeR are different however. They take into account logFC, standard deviation and expression level all together when evaluating significance. They already achieve a compromise (optimal in a certain sense) between ranking by logFC and ranking by t-test. They put on a floor under the posterior variances and do not allow a gene to be significant with an extremely small logFC. So adding a logFC cutoff is no longer necessary and instead becomes counter-productive.
Giving priority to genes with very large logFC but moderate FDR has the effect of prioritizing low expressed genes over highly expressed genes, which tends to select genes that are of less rather than more biological importance.
Another problem is that the Benjamini-Hochberg algorithm for evaluating FDR is based on tail-probabilities for the ranked gene list. If you filter genes from the list after computing the FDR then the whole FDR calculation is invalidated for genes originally further down in the list. It is easy to show that you can actually increase the overall FDR, sometimes substantially, by removing genes with lower logFC values from the list.
Another consideration that highlights the arbitrariness of the logFC cutoff is the fact that logFCs output from limma, edgeR and DESeq2 are not raw log-fold-changes but are rather shrunk to a greater or lesser degree. In edgeR, the amount of shrinkage is user-specified. Which genes will satisfy a logFC=1 cutoff depends on how much shrinkage has been done, but unfortunately people in the literature just seem to apply the logFC=1 cutoff blindly without considering what the estimated logFCs actually mean.
Thank you! The guide has been really helpful in teaching me how to use EdgeR. However, I'm still puzzled about why ranking by FDR makes more sense than ranking by logFC, especially after filtering for FDR < 0.05. If I rank by FDR, I could end up with genes that have a low FDR (e.g., 0.003) but a relatively low logFC (e.g., 1). On the other hand, if I rank by logFC, I might identify genes with a very high logFC (e.g., 8) but a slightly higher FDR (e.g., 0.04), which would still be statistically significant.
Why is it more important to focus on the smallest FDR in the top X genes list rather than prioritizing those with the highest logFC? Doesn’t it seem more biologically relevant to highlight genes with the greatest logFC, as long as they pass the FDR < 0.05 threshold?
Because it is too simplistic to focus on one factor (logFC) in isolation, instead of considering logFC, dispersion and expression level all together, which is what edgeR does for you. In the scenario you describe, the gene with logFC=8 must be very lowly expressed for it to give such a high FDR and is probably completely absent in one condition. The gene with logFC=1 is probably highly expressed. I have analysed thousands of RNA-seq experiments and published hundreds of papers on cancer, immunology and other diseases, but I've never seen a context in which a very low expressed gene that just happens to be non-detected in one condition is necessarily more biologically important than a highly expressed gene with a still substantial fold-change.
Ah, I see it now! Thank you so much for taking the time to explain everything in such detail. I truly appreciate your help!