Dear all,
I want to filter DE genes by log2FC > 2/-2
and p-value 0.0001
. Why this two versions of lof2FC and p-value filtering differ in up- and down-regulated genes numbers (from 800 to 1500) and which one is the right one ?
1. cond1 <- results(dds, contrast=c("condition","treated","untreated"), alpha = 0.0001, lfcThreshold = 1)
2. cond1 <- results(dds, contrast=c("condition","treated","untreated")
resLFC <- lfcShrink(dds, contrast=contrast, res=cond1, type="ashr")
up <- cond1[which(resLFC$log2FoldChange > 2 & resLFC$padj < 0.0001),]
down <- cond1[which(resLFC$log2FoldChange < -2 & resLFC$padj < 0.0001),]
I was reading this discussion, but it not clear yet. https://support.bioconductor.org/p/101504/
Consider the following 4 genes:
gene 1 has an adjusted P-value of greater than 0.01, and so is not selected as significant by either method. gene 2's confidence intervals do not overlap 0, so the adjust P-value is less than 0.01, but the MLE logFoldChange estimate is less than 1, so it is not selected as significant by either method either. However, gene 3's confidence interval doesn't overlap 0, so it has a p-value less than 0.01, but in addition, the maximum likelihood estimate of the log2FoldChange is greater than 1, so it is selected as significant by your second method (which filters genes with padj < 0.01 and lfc > 1). However, the first method (using lfcThreshold in results) doesn't do the same thing. It tests agains the null hypothesis that the abs(log2FoldChange) < 1. Because the confidence interval for gene 3 overlaps into the abs(log2foldChange) < 1 region, then the p-value in this case > 0.01. That is, we can be 99% confident the log2FoldChage isn't 0, and our best guess at the log2FoldChange is greater than 1, but we can't be 99% confident that the log2FoldChange is greater than 1. So it would not be selected as significant by the first method. By contrast, the lower bound of gene 4's confidence limit is greater than 1, and so it is selected as significant by both methods. We can be 99% confident* that the log2FoldChange for gene 4 is greater than 1.
* = Yes, I'm aware thats not exactly what confidence limits mean, but I think the simplification makes the principle in question here easier to understand.
So basically from your reply I get that for differential expression the second method is good enough?While the first one is not aimed to do DE and I loss some good genes? right?
The first method gives you genes that are definately DE, and where our best guess of the is that the logFoldChange is greater than 1 (either up or down), but it doesn't guarentee the logFoldChange is greater than 1. As to which you should use? Personally I preffer the first- if you have some biological reason to think that only changes greater than 1 are worth being intersted in, then I'd want to be confident that the changes were greater than 1. On the other hand, the second method has been the standard way of doing things in the field for years and if your purpose is to narrow the set of DE genes down to a more managable size (rather than having some reason to believe lfc>1 has some biologcial meaning), then its probably perfectly acceptable, particularly when combined with lfcShrink.
super clear explanation, thanks a lot!
Yes, I appreciate a lot this simplification explanation method!