Tutorial:removeBatchEffect explained using base R linear models
1
10
Entering edit mode
7 months ago
nhaus ▴ 420

Hi all,

I always had a some difficulties to understand what removing the effect of a covariates actually means and what happens "under the hood". So I decided to make a small R example that shows what limma::removeBatchEffect does, by "implementing" the method just using base R linear models. It really helped me to understand so I though I would share it:

Batch correction while preserving a biological effect

library(limma)

# Parameters
num_genes <- 100
num_samples <- 12

# Simulate some data: 100 genes and 12 samples
# Base expression levels
base_expression <- matrix(rnorm(num_genes * num_samples, mean = 10, sd = 3), nrow = num_genes, ncol = num_samples)
rownames(base_expression) <- paste("Gene", 1:num_genes)
colnames(base_expression) <- paste("Sample", 1:num_samples)

# Create batch factors
batch <- factor(rep(c("Batch1", "Batch2", "Batch3"), 4))

# Create a biological factor (treatment)
treatment <- factor(rep(c("Control", "Treatment"), 6))

# Apply consistent batch effects
batch_effects <- c(5, -4, 9) # Different consistent effects for each batch
batch_matrix <- matrix(batch_effects[as.integer(batch)], nrow = num_genes, ncol = num_samples, byrow = TRUE)

# Apply consistent treatment effects
treatment_effects <- c(0, 15) # No effect for control, positive effect for treatment
treatment_matrix <- matrix(treatment_effects[as.integer(treatment)], nrow = num_genes, ncol = num_samples, byrow = TRUE)

expression <- base_expression + batch_matrix + treatment_matrix

# Manual adjustment using lm() preserving biological condition
adjusted_expression_manual <- apply(expression, 1, function(gene_expression) {
    full_model <- lm(gene_expression ~ batch + treatment)
    reduced_model <- lm(gene_expression ~ treatment)
    # isolate the component of the gene expression predictions that is due to the batch effect.
    # Is the portion of gene expression that can be attributed solely to the batch, independent of the biological variable (treatment).
    batch_effect <- full_model$fitted.values - reduced_model$fitted.values
    adjusted_expression <- gene_expression - batch_effect # remove the batch effect
    return(adjusted_expression)
})


# Convert adjusted expression to a matrix for easier handling
adjusted_expression_matrix_manual <- matrix(unlist(adjusted_expression_manual), nrow = num_genes, byrow = TRUE)
colnames(adjusted_expression_matrix_manual) <- colnames(expression)
rownames(adjusted_expression_matrix_manual) <- rownames(expression)

# Adjust using limma's removeBatchEffect while preserving biological effects
design_matrix <- model.matrix(~treatment)
adjusted_expression_limma <- removeBatchEffect(expression, batch = batch, design = design_matrix)

# Comparing the two approaches
comparison <- adjusted_expression_matrix_manual - adjusted_expression_limma
print("Difference between Manual and limma Approaches:")
print(head(comparison))

Batch correction if you do not want to preserve anything

# Parameters
num_genes <- 100
num_samples <- 12

# Simulate some data: 100 genes and 12 samples
# Base expression levels
base_expression <- matrix(rnorm(num_genes * num_samples, mean = 10, sd = 3), nrow = num_genes, ncol = num_samples)
rownames(base_expression) <- paste("Gene", 1:num_genes)
colnames(base_expression) <- paste("Sample", 1:num_samples)

# Create batch factors
batch <- factor(rep(c("Batch1", "Batch2", "Batch3"), 4))

# Create a biological factor (treatment)
treatment <- factor(rep(c("Control", "Treatment"), 6))

# Apply consistent batch effects
batch_effects <- c(5, -4, 9) # Different consistent effects for each batch
batch_matrix <- matrix(batch_effects[as.integer(batch)], nrow = num_genes, ncol = num_samples, byrow = TRUE)

# Apply consistent treatment effects
treatment_effects <- c(0, 15) # No effect for control, positive effect for treatment
treatment_matrix <- matrix(treatment_effects[as.integer(treatment)], nrow = num_genes, ncol = num_samples, byrow = TRUE)

expression <- base_expression + batch_matrix + treatment_matrix
# Manual adjustment using lm() preserving biological condition
adjusted_expression_manual <- apply(expression, 1, function(gene_expression) {
    model <- lm(gene_expression ~ batch)
    adjusted_expression <- residuals(model) + mean(gene_expression) # Adding back the mean
    return(adjusted_expression)
})

# Convert adjusted expression to a matrix for easier handling
adjusted_expression_matrix_manual <- matrix(unlist(adjusted_expression_manual), nrow = num_genes, byrow = TRUE)
colnames(adjusted_expression_matrix_manual) <- colnames(expression)
rownames(adjusted_expression_matrix_manual) <- rownames(expression)

# Adjust using limma's removeBatchEffect while preserving biological effects
design_matrix <- model.matrix(~treatment)
adjusted_expression_limma <- removeBatchEffect(expression,
    batch = batch,
    design = matrix(1, ncol(expression), 1) # essentially says, there are no experimental conditions, all belong to same category
)


# Print the results
print("Adjusted Expression Matrix (Manual):")
print(head(adjusted_expression_matrix_manual))

print("Adjusted Expression Matrix (limma):")
print(head(adjusted_expression_limma))

# Comparing the two approaches
comparison <- adjusted_expression_matrix_manual - adjusted_expression_limma
print("Difference between Manual and limma Approaches:")
print(sum(comparison))
removebatcheffects limma batch-effect • 1.1k views
ADD COMMENT
2
Entering edit mode

(+1) "implementing" the method just using base R linear models.: I find implementing from basic starting point to be one of the best, if not the best, way to learn what a procedure does.

ADD REPLY
0
Entering edit mode
8 weeks ago
aUser ▴ 70

What are the use cases where we need to remove everything including biological effect? I thought that we need to preserve biological effect in every situation. I just want to know the conditions/situations where biologcal effect does not matter.

ADD COMMENT
0
Entering edit mode

Biological and batch effects are known effects that are based on prior knowledge, a variable in the design. If we remove both effects, we get the residuals, i.e. data minus model, the model being both effects. There might be unexpected information in these residuals.

The typical workflow of using limma is to perform an exploratory (i.e. without prior) analysis (e.g. plotMDS) before any modelling in order to determine what is the main effect that drives the data and to verifiy that this main effect derives either from the biological conditions or the batch effect, which means that the experiment/data reflects the known effects.

This is only a point of view from an analogy to a simple linear regression.

ADD REPLY
0
Entering edit mode

Thank you for the explanation. In most of the experiments, we are testing the biological effect, for instance, (1) what a treatement intervention is doing or (2) what drives the cancer progress/causes ? In these situation, we need to preserve the biological effect so that we can estimate the intervention/cancer affect.

I wanted to know what are the situations where we do not care about preserving biological effect as in the above tutorial: "Batch correction if you do not want to preserve anything" did to eliminate biologcal effect as well. What are the cases where this strategy is applicable?

ADD REPLY
0
Entering edit mode

I didn't clearly get your point before. I think the given code/example cannot answer it and I have no answer to it.

Batching correcting without the design matrix may produce misleading results because it may remove genuine non-batch effects in the data, if they exist. I do not see any purpose in doing that.

If you have already read this thread, you'd better start a new question.

ADD REPLY

Login before adding your answer.

Traffic: 1950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6