Question

Trying to understand the maths behind one Limma function

5

Entering edit mode

4.7 years ago

Mozart ▴ 330

Hi there,

After having read the vignette, I am really keen to understand in depth how the Limma functions 'removebatcheffect' actually works. Looked at the function:

#  removeBatchEffect.R

#  A refinement would be to empirical Bayes shrink
#  the batch effects before subtracting them.

removeBatchEffect <- function(x,batch=NULL,batch2=NULL,covariates=NULL,design=matrix(1,ncol(x),1),...)
#  Remove batch effects from matrix of expression data
#  Gordon Smyth and Carolyn de Graaf
#  Created 1 Aug 2008. Last revised 1 June 2014.
{
    if(is.null(batch) && is.null(batch2) && is.null(covariates)) return(as.matrix(x))
    if(!is.null(batch)) {
        batch <- as.factor(batch)
        contrasts(batch) <- contr.sum(levels(batch))
        batch <- model.matrix(~batch)[,-1,drop=FALSE]
    }
    if(!is.null(batch2)) {
        batch2 <- as.factor(batch2)
        contrasts(batch2) <- contr.sum(levels(batch2))
        batch2 <- model.matrix(~batch2)[,-1,drop=FALSE]
    }
    if(!is.null(covariates)) covariates <- as.matrix(covariates)
    X.batch <- cbind(batch,batch2,covariates)
    fit <- lmFit(x,cbind(design,X.batch),...)
    beta <- fit$coefficients[,-(1:ncol(design)),drop=FALSE]
    beta[is.na(beta)] <- 0
    as.matrix(x) - beta %*% t(X.batch)
}

My understanding is that when you have just one known source of bias (say, different days in which an experiment was performed), covariates is = null. Thus, the function will be the following:

batch <- as.factor(batch)
contrasts(batch) <- contr.sum(levels(batch))
batch <- model.matrix(~batch)[,-1,drop=FALSE]

once a design matrix has been created..

fit <- lmFit(x,cbind(design,X.batch),...)
beta <- fit$coefficients[,-(1:ncol(design)),drop=FALSE]
beta[is.na(beta)] <- 0
as.matrix(x) - beta %*% t(X.batch)

So that's the hard bit for a lay person... So once you have your 'dds' object (I am using DESeq2), design formula (es. ~condition), Limma's function adds the batch object just created and then it fits a linear model for each gene in the dataset.

Is that correct?
can anyone explain what is the beta object in this case? I guess the following strings will add the beta variable to the equation?

Sorry if my questions may be very naive for you but I am really trying to understand every single steps and in the meantime trying to study a bit of stats.

limma RNA-Seq batch-effect • 4.4k views

ADD COMMENT • link updated 8 months ago by Ram 44k • written 4.7 years ago by Mozart ▴ 330

1

Entering edit mode

Related: https://support.bioconductor.org/p/116821/#116828

ADD REPLY • link 4.7 years ago by Kevin Blighe 88k

score 10 · Accepted Answer · 2020-04-27

10

Entering edit mode

4.7 years ago

Kevin Blighe 88k

Obviously limma should not be used to model the raw or normalised counts from DESeq2 due to the fact that these counts will follow a negative binomial distribution, whereas limma's removeBatchEffect() function [and limma, generally speaking] will expect that your data is normalised and follows a normal distribution (to be specific: that the residuals after fitting the model follow a normal distribution).

In a practical setting, the following would occur:

via whichever means, the user identifies a batch effect in their data, batch
user includes batch in the design formula when using DESeq2, for example, ~ condition + batch
when deriving test statistics for the main condition of interest, condition, DESeq2 will, internally, treat batch as a covariate and 'adjust' the test statistics for condition, including the p-values, obviously. We could just as easily derive test statistics for batch, and these would conversely be adjusted for condition. This is how covariate-adjustments work in regression models.
user then wishes to use this data for downstream applications; so, transforms the normalised counts via rlog(..., blind = FALSE) or vst(..., blind = FALSE)

-----------

Now, we have performed our differential expression analysis using DESeq2 and adjusted for the effect of batch in the test statistics output only - no modification to our expression data has been made for batch, or, indeed, anything else in our design formula. However, for our downstream programs / methods, like clustering, PCA, and other cool stuff, we want to completely eliminate the batch effect directly from our expression data; so, we use:

limma::removeBatchEffect(assay(vst), batch = colData(dds)[,batch'])

Limma, via the code that you have posted above, will estimate the effect of batch for each gene by fitting a model of form ~ batch [for each gene], and then use these estimates ('beta coefficients') to directly modify the expression levels in your assay(vst) object. So, the 'beta' to which you inquire is a measure of the effect of batch for each gene. In a typical regression scenario, we would derive odds ratios from these beta coefficients - the odds ratio is the exponent of the beta coefficient.

Trust that this helps.

Kevin

ADD COMMENT • link 4.6 years ago by Kevin Blighe 88k

4

Entering edit mode

Specifically, a linear model is an equation of the form:

y = beta0 + beta1*x1 + beta2*x2 + beta3 * x3  + .... + epsilon

Lets say you have case where you have gene expression (as assay(vst), as Kevin says), so y is assay(vst). The xs represent the predcitors, so if formula is ~condition + batch, and batch can be "batch1", "batch2" or "batch3", then x1 will be 0 for WT and 1 for treated, x2 will be 1 if batch is "batch2" and x3 will be 1 if batch is "batch3". We often reffer to the x's in the vector form X.

The betas are reffered to as the coefficients. In the sort of gene expression models we are looking at here, where the output is log gene expression, the are equivalent to log fold changes.

episilon represents the random noise in the system - its the bit that isn't accounted for by any of our other predictors.

So we fit this model for each gene, including the batch effects. We then split it into three pieces:

 y = (Condition) + (Batch) + episilon

where

Condition = beta0 + beta1*x1
Batch = beta2*x2 + beta3 *x3

and say that if we only want to only get the condition dependent part of the gene expression, is can do:

Condition + epilison = y - (Batch) = y - beta2 * x2 + beta3*x3

We say that this is the "residual" after regressing out the effect of batch.

this is what is happening in this lines:

    beta <- fit$coefficients[,-(1:ncol(design)),drop=FALSE]

coefficients is a matrix that contains all the betas (beta0, beta1, beta2, beta3) for all the genes. This gets only those columns associated with batch.

then in this line:

as.matrix(x) - beta %*% t(X.batch)

We multiple the betas by the xs (%*% is matrix multiplication in R, so we multiple the 20,000 x 2 beta matrix by the 2 x nt(X.batch)` matrix to get a 20,000 x n matrix batch effects for each of the 20,000 genes in each of the n samples, and then subtract them from the expressions to leave only the Condition effects.

ADD REPLY • link 4.7 years ago by i.sudbery 20k

0

Entering edit mode

Thanks Kevin and i.sudbery for the big help in explaining me the nitty gritty details of this code. Sorry if this sounded like somewhat an offence for someone; for sure, didn't intend to open a debate.

Thanks you all for providing best explanation on the topic.

ADD REPLY • link 4.6 years ago by Mozart ▴ 330

4

Entering edit mode

No problem, trying to understand the math is good. I was a worried earlier that you might get distracted by focusing on code steps you didn't understand to the detriment of understanding the concept and the math (because I see that happen often). But the answers have included concepts so it's a good outcome.

Don't forget that help is available for very function step. E.g., ?lmFit would explain what fit$coefficients is.

ADD REPLY • link 4.6 years ago by Gordon Smyth ★ 7.7k

0

Entering edit mode

Thanks a lot, then. Have a good one.

ADD REPLY • link 4.6 years ago by Mozart ▴ 330

2

Entering edit mode

Sorry if this sounded like somewhat an offence for someone

Don't worry. It is a perfectly valid question and an good example for a question where a user invested time and effort. Nothing wrong with it.