Question

DESeq2 log2FC and p-value filtering

3

Entering edit mode

3.2 years ago

User000 ▴ 710

Dear all,

I want to filter DE genes by log2FC > 2/-2 and p-value 0.0001. Why this two versions of lof2FC and p-value filtering differ in up- and down-regulated genes numbers (from 800 to 1500) and which one is the right one ?

1. cond1 <- results(dds, contrast=c("condition","treated","untreated"), alpha = 0.0001, lfcThreshold = 1)

2. cond1 <-  results(dds, contrast=c("condition","treated","untreated")
    resLFC <- lfcShrink(dds, contrast=contrast, res=cond1, type="ashr")
    up <- cond1[which(resLFC$log2FoldChange > 2 & resLFC$padj < 0.0001),]
    down <- cond1[which(resLFC$log2FoldChange < -2 & resLFC$padj < 0.0001),]

DESeq2 RNAseq • 3.8k views

ADD COMMENT • link updated 3.2 years ago by i.sudbery 20k • written 3.2 years ago by User000 ▴ 710

score 0 · Answer 1 · 2021-09-16

0

Entering edit mode

3.2 years ago

Zhilong Jia ★ 2.2k

The difference between you two cond1s is parameter lfcThreshold

From the help Documentation of results:

lfcThreshold: a non-negative value which specifies a log2 fold change threshold. The default value is 0, corresponding to a test that the log2 fold changes are equal to zero. The user can specify the alternative hypothesis using the altHypothesis argument, which defaults to testing for log2 fold changes greater in absolute value than a given threshold. If lfcThreshold is specified, the results are for Wald tests, and LRT p-values will be overwritten.

ADD COMMENT • link 3.2 years ago by Zhilong Jia ★ 2.2k

0

Entering edit mode

I was reading this discussion, but it not clear yet. https://support.bioconductor.org/p/101504/

ADD REPLY • link 3.2 years ago by User000 ▴ 710

3

Entering edit mode

Consider the following 4 genes:

gene 1           |-¦----x------|¦ 
gene 2             ¦ |------x---¦--|
gene 3             ¦        |---¦--x------|
gene 4             ¦            ¦ |------x------|
      --------------------------------------------------
      |            |            |            |
      -1           0            1            2
                     log2FoldChange

      x = Maximum Likelihood Estimate of Log2FoldChange (recorded in log2FoldChange column)
      |----x----|  = 99% confidence interval on the log2foldChange

gene 1 has an adjusted P-value of greater than 0.01, and so is not selected as significant by either method. gene 2's confidence intervals do not overlap 0, so the adjust P-value is less than 0.01, but the MLE logFoldChange estimate is less than 1, so it is not selected as significant by either method either. However, gene 3's confidence interval doesn't overlap 0, so it has a p-value less than 0.01, but in addition, the maximum likelihood estimate of the log2FoldChange is greater than 1, so it is selected as significant by your second method (which filters genes with padj < 0.01 and lfc > 1). However, the first method (using lfcThreshold in results) doesn't do the same thing. It tests agains the null hypothesis that the abs(log2FoldChange) < 1. Because the confidence interval for gene 3 overlaps into the abs(log2foldChange) < 1 region, then the p-value in this case > 0.01. That is, we can be 99% confident the log2FoldChage isn't 0, and our best guess at the log2FoldChange is greater than 1, but we can't be 99% confident that the log2FoldChange is greater than 1. So it would not be selected as significant by the first method. By contrast, the lower bound of gene 4's confidence limit is greater than 1, and so it is selected as significant by both methods. We can be 99% confident* that the log2FoldChange for gene 4 is greater than 1.

* = Yes, I'm aware thats not exactly what confidence limits mean, but I think the simplification makes the principle in question here easier to understand.

ADD REPLY • link 3.2 years ago by i.sudbery 20k

0

Entering edit mode

So basically from your reply I get that for differential expression the second method is good enough?While the first one is not aimed to do DE and I loss some good genes? right?

ADD REPLY • link 3.2 years ago by User000 ▴ 710

1

Entering edit mode

The first method gives you genes that are definately DE, and where our best guess of the is that the logFoldChange is greater than 1 (either up or down), but it doesn't guarentee the logFoldChange is greater than 1. As to which you should use? Personally I preffer the first- if you have some biological reason to think that only changes greater than 1 are worth being intersted in, then I'd want to be confident that the changes were greater than 1. On the other hand, the second method has been the standard way of doing things in the field for years and if your purpose is to narrow the set of DE genes down to a more managable size (rather than having some reason to believe lfc>1 has some biologcial meaning), then its probably perfectly acceptable, particularly when combined with lfcShrink.

ADD REPLY • link 3.2 years ago by i.sudbery 20k

0

Entering edit mode

super clear explanation, thanks a lot!

ADD REPLY • link 3.2 years ago by User000 ▴ 710

0

Entering edit mode

Yes, I appreciate a lot this simplification explanation method!

ADD REPLY • link 3.2 years ago by User000 ▴ 710