Why is getting logFC=20 causing a "problem" for you? You would only get such a large logFC from edgeR if the counts are zero in one group and in the hundreds of thousands in the other. Getting a large logFC seems a fair representation in such cases.
Anyway, you can shrink the log-fold-changes by as much or as little as you want in edgeR. Just use a larger value of prior.count
when you run glmFit
(or glmQLFit
). Or you can use the predFC
function, which will give the same result as glmFit
.
By default, the model fitting functions in edgeR are set to do light shrinkage of the logFCs by setting prior.count = 0.125
. If you increase prior.count
to 5, that will do quite heavy shrinkage. Note that shrinkage is not the same as simply multiplying the logFC by a scaling factor. Increasing prior.count
has much more effect on the logFC for low count genes than on logFCs for highly expressed genes because the logFC for low count genes are less certain. Changing prior.count
only affects the estimated logFCs. All the other output including test statistics and p-values remain the same.
The reason why the argument is called prior.count
is because the shrinkage corresponds to a Bayesian model where the size of the counts in a prior dataset encapsulates the strength of the prior belief that the logFCs should be close to zero. Long ago (15 years ago) we toyed around with the idea of estimating an optimal value for the prior.count automatically from the data (it's a chapter in Belinda Phipson's thesis) but eventually decided it would be safer and better to use preset values in the public edgeR package.
To illustrate the use of prior.count
, here is a toy example where edgeR does return logFC=20:
> counts <- matrix(c(0,1000,135000,2010),2,2)
> colnames(counts) <- c("A","B")
> rownames(counts) <- c("Gene1","Gene2")
> counts
A B
Gene1 0 135000
Gene2 1000 2010
> group <- c("A","B")
> design <- model.matrix(~group)
> fit <- glmFit(counts,design,dispersion=0,lib.size=rep(10e6,2))
> lrt <- glmLRT(fit)
> topTags(lrt)
Coefficient: groupB
logFC logCPM LR PValue FDR
Gene1 20.0426 12.7207 187149.74 0.0000e+00 0.0000e+00
Gene2 1.0071 7.2355 345.57 3.9103e-77 3.9103e-77
Changing the default setting to prior.count=5
shrinks the first logFC considerably while leaving the second almost unaffected:
> fit <- glmFit(counts,design,dispersion=0,prior.count=5,lib.size=rep(10e6,2))
> lrt <- glmLRT(fit)
> topTags(lrt)
Coefficient: groupB
logFC logCPM LR PValue FDR
Gene1 14.7207 12.7207 187149.74 0.0000e+00 0.0000e+00
Gene2 1.0036 7.2355 345.57 3.9103e-77 3.9103e-77
Please check out the post here in Bioconductor. It has been very nicely explained what you could do or should not do.
Thank you for the reply
Thank you for your reply.