Question

Using A Variable As A Blocking Factor Vs Putting It As A Term In The Model Formula?

21

Entering edit mode

12.1 years ago

Ryan Thompson ★ 3.6k

I am doing comparisons of edgeR, DESeq, and limma for RNA-Seq data (yes, I know limma is not really recommended for RNA-Seq, but I'd like to benchmark them). My experimental design has an experimental factor (Timepoint) and a blocking factor (Donor), something like this:

> expdata
       Donor Timepoint
L3_E1   4659        T0
L3_E2   4659      T120
L3_E5   5131        T0
L3_E6   5131      T120
L4_E1   4659        T0
L4_E2   4659      T120
L4_E5   5131        T0
L4_E6   5131      T120
L5_E9   5291        T0
L5_E10  5291      T120
L5_E13  5053        T0
L5_E14  5053      T120
L6_E9   5291        T0
L6_E10  5291      T120
L6_E13  5053        T0
L6_E14  5053      T120

Which you can construct using this code (obtained via `deparse

expdata <- structure(list(Donor = structure(c(1L, 1L, 3L, 3L, 1L, 1L, 3L,  3L, 4L, 4L, 2L, 2L, 4L, 4L, 2L, 2L), .Label = c("4659", "5053",  "5131", "5291"), class = "factor"), Timepoint = structure(c(1L,  2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("T0",  "T120"), class = "factor")), .Names = c("Donor", "Timepoint"), row.names = c("L3_E1",  "L3_E2", "L3_E5", "L3_E6", "L4_E1", "L4_E2", "L4_E5", "L4_E6",  "L5_E9", "L5_E10", "L5_E13", "L5_E14", "L6_E9", "L6_E10", "L6_E13",  "L6_E14"), class = "data.frame")

The way to do this analysis in DESeq and edgeR is to fit a model like ~ Donor + Timepoint and then compare it to a fit of the null model ~ Donor:

## DESeq
alt.formula <- ~Donor + Timepoint
null.formula <- ~Donor
cds <- newCountDataSet(feature.counts, expdata)
cds <- estimateSizeFactors(cds)
cds <- estimateDispersions(cds, modelFormula=alt.formula)
fit0 <- fitNbinomGLMs(cds, null.formula)
fit1 <- fitNbinomGLMs(cds, alt.formula)
pvalsGLM <- nbinomGLMTest(fit1, fit0)
padjGLM <- p.adjust(pvalsGLM, method="BH")
results <- data.frame(row.names=featureNames(cds),
                      LR=fits$null$deviance - fits$alt$deviance,
                      PValue=pvalsGLM, FDR=padjGLM)

## edgeR
design <- model.matrix(~Donor + Timepoint, data=expdata)
dge <- DGEList(counts=feature.counts,
               group=groupvar,
               genes=feature.annot)
dge <- estimateGLMCommonDisp(dge, design)
dge <- estimateGLMTrendedDisp(dge, design)
dge <- estimateGLMTagwiseDisp(dge, design)
fit <- glmFit(dge, design)
lrt <- glmLRT(dge, fit, coef="TimepointT120")
results <- topTags(lrt, n=nrow(lrt))

Now, I can also do the same thing in limma, adding "Donor" as a term in my model formula:

## Limma with Donor in the model
design <- model.matrix(~Donor + Timepoint, data=expdata)
nf <- calcNormFactors(feature.counts)
elist <- voom(feature.counts, design, plot=FALSE,
              lib.size=colSums(feature.counts)*nf)
fit <- lmFit(elist, design)
fit <- eBayes(fit)
results <- topTable(fit, coef="TimepointT120",
                    n=nrow(elist), sort.by="none")

which I understand produces a moderated paired t-test. However, in the limma docs and also in various mailing list postings (e.g. this one) , I see recommentations to use the block argument to limma's lmFit function along with the duplicateCorrelation function:

## Limma with Donor as blocking factor
design <- model.matrix(~Timepoint, data=expdata)
nf <- calcNormFactors(feature.counts)
elist <- voom(feature.counts, design, plot=FALSE,
              lib.size=colSums(feature.counts)*nf)
corfit <- duplicateCorrelation(elist, design, block=expdata$Donor)
fit <- lmFit(elist, design, block=expdata$Donor, correlation=corfit$consensus.correlation)
results <- topTable(fit, coef="TimepointT120",
                    n=nrow(elist), sort.by="none")

which is claimed to give better statistical power. Indeed, when I run both versions of the limma code, the second produces roughly the same gene list (sorted by p-value) only with much more significant p-values.

So now, I am confused because people who are using DESeq and edgeR are saying that you should treat "Donor" as a blocking factor by adding it as a term in your model, while for limma doing exactly the same thing is apparently suboptimal with respect to statistical power. And as far as I can tell, DESeq and edgeR don't have any direct equivalent of limma's "block" argument or duplicateCorrelation function. So is it really true that the analogous procedure (i.e. using ~Donor+Timepoint as the model formula) is correct in edgeR and DESeq but suboptimal in limma? If so, that seems like an odd contrast in what are otherwise quite similar analysis procedures.

So what is the difference between the two blocks of limma code above? Is the second block still doing a paired test? Which of the two blocks is really a closer match to the edgeR and DESeq blocks above?

rna-seq limma edger deseq • 15k views

ADD COMMENT • link updated 10.5 years ago by jdblischak ▴ 110 • written 12.1 years ago by Ryan Thompson ★ 3.6k

1

Entering edit mode

Very interesting question which I would also want to know the answer to. Maybe ask on the Bioconductor mailing list, where edgeR and limma questions are usually answered?

ADD REPLY • link 12.1 years ago by Mikael Huss 4.8k

score 10 · Accepted Answer · 2014-05-20

Gordon Smyth answered this on the Bioconductor mailing list. See his explanation here. Summarizing his answer, the two approaches are similar. `duplicateCorrelation` fits a random effect for the blocking variable. It is especially useful to use when the design is unbalanced.

Also, for the sake of posterity since this is an old post, using limma with voom is a suitable approach to analyzing RNA-seq data. See this comparison of differential expression methods and also the voom publication.