Question

Is there any journal article to reference removeBatchEffect function from Limma?

0

Entering edit mode

2.6 years ago

ev97 ▴ 40

I am trying to correct my RNA data from 3 sources of variation. I wrote this post asking about my problem and one of the users ( ATpoint) has recommended me to use removeBatchEffect from limma.

However, I cannot find any published information apart from this one to see how it works, in which cases you should use it, comparisons between other methods, etc. In general, more information to reference the reason why I chose this method and not another one.

In my case, I know that I have to do this to be able to adjust by 3 variables...

On the other hand, I would like to confirm if it is possible to keep and not touch the biological difference from my data such as age, sex and group and WBC count (several biological variables). Is there a way to preserve it? because the only mention that I always find is about preserving the treatment information.

The design matrix is used to describe comparisons between the samples, for example treatment effects, that should not be removed. The function (in effect) fits a linear model to the data, including both batches and regular treatments, then removes the component due to the batch effects

Note that my 2 main objectives are:

Get adjusted counts for future analyses.
Run linear mixed models (with variancePartition package) in order to assess the contribution of several variables (such as age, diabetes, sex, hypertension, cholesterol, etc) to the expression variation of each gene.

Could anybody help me, please?

Thanks very much in advance

batch removebatcheffect limma rna-seq batch-effect • 1.8k views

ADD COMMENT • link updated 10 months ago by Ram 44k • written 2.6 years ago by ev97 ▴ 40

0

Entering edit mode

Thanks very much for your replies, I really appreciate all the information that you provided. Thanks

ADD REPLY • link 2.6 years ago by ev97 ▴ 40

score 5 · Accepted Answer · 2022-07-25

There is no citation for remoteBatchEffects, mostly because it is so simple.

Here is the complete code:

function (x, batch = NULL, batch2 = NULL, covariates = NULL, 
    design = matrix(1, ncol(x), 1), ...) 
{
    if (is.null(batch) && is.null(batch2) && is.null(covariates)) 
        return(as.matrix(x))
    if (!is.null(batch)) {
        batch <- as.factor(batch)
        contrasts(batch) <- contr.sum(levels(batch))
        batch <- model.matrix(~batch)[, -1, drop = FALSE]
    }
    if (!is.null(batch2)) {
        batch2 <- as.factor(batch2)
        contrasts(batch2) <- contr.sum(levels(batch2))
        batch2 <- model.matrix(~batch2)[, -1, drop = FALSE]
    }
    if (!is.null(covariates)) 
        covariates <- as.matrix(covariates)
    X.batch <- cbind(batch, batch2, covariates)
    fit <- lmFit(x, cbind(design, X.batch), ...)
    beta <- fit$coefficients[, -(1:ncol(design)), drop = FALSE]
    beta[is.na(beta)] <- 0
    as.matrix(x) - beta %*% t(X.batch)
}

In the simplest case of a single batch, and removing the lines to catch edge cases and errors, this can be simplified to:

function (x, batch = NULL, design = matrix(1, ncol(x), 1)) 
{
    batch <- model.matrix(~batch)[, -1, drop = FALSE]
    fit <- lmFit(x, cbind(design, batch))
    beta <- fit$coefficients[, -(1:ncol(design)), drop = FALSE]
    as.matrix(x) - beta %*% t(batch)
}

If you had 8 samples, two conditions (c(0,0,0,0,1,1,1,1)) and two batches c(1,1,0,0,1,1,0,0), then this would simply furhter to:

design <- model.matrix(~condition + batch)
fit <- lmFit(x, design)
corrected <- as.matrix(x) - fit$coefficents[,3] %*% t(batch)

The code basically fits a GLM which has all the terms in your design matrix, plus new ones added for the batches and the covariates. Then it predicts your Y (counts) using only the fitted coefficients, returning the residuals. Its such a common and common sense statistical approach, that I'm not sure anyone would bother to publish on it.

By the way, if you want to correct for more than two factors at once, you just need to use the "covariates" argument to pass the design matrix of all three of your nuisance variables. .