Question

model.matrix in EdgeR to account for batch effects with multiple samples

0

Entering edit mode

15 months ago

Ann ▴ 40

Hello everyone!

I have data from identical experiments repeated twice, once in 2018 and again in 2023. As expected, there appears to be a batch effect by year.

In these experiments, we have 4 samples (1, 2, 3, 4) with 8 replicates per sample.

On the MDS plot, my samples clustered more by year, indicating a batch effect (see attached MDS plot). MDS

According to the EdgeR manual, including the variable 'Year' in the model can account for batch effects. However, since I have four samples to compare, I am unsure if my formula for the design matrix is correct.

# Design accounting for year (=batch) ===================
sample_year <- metadata$sample_year # Year
sample_year
# [1] 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023
# [16] 2023 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018
# [31] 2018 2018
# Levels: 2018 2023
cpn <- metadata$cpn # Samples
cpn
#  [1] 2 2 4 4 3 3 1 1 2 2 4 4 3 3 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1
# Levels: 1 2 3 4

# Design =============================================
design <- model.matrix(~cpn + sample_year)
rownames(design) <- colnames(y)
design

     (Intercept) cpn2 cpn3 cpn4 sample_year2023
id1            1    1    0    0               1
id2            1    1    0    0               1
id3            1    0    0    1               1
id4            1    0    0    1               1
id5            1    0    1    0               1
id6            1    0    1    0               1
id7            1    0    0    0               1
id8            1    0    0    0               1
id9            1    1    0    0               1
id10           1    1    0    0               1
id11           1    0    0    1               1
id12           1    0    0    1               1
id13           1    0    1    0               1
id14           1    0    1    0               1
id15           1    0    0    0               1
id16           1    0    0    0               1
id17           1    0    0    1               0
id18           1    0    0    1               0
id19           1    0    0    1               0
id20           1    0    0    1               0
id21           1    0    1    0               0
id22           1    0    1    0               0
id23           1    0    1    0               0
id24           1    0    1    0               0
id25           1    1    0    0               0
id26           1    1    0    0               0
id27           1    1    0    0               0
id28           1    1    0    0               0
id29           1    0    0    0               0
id30           1    0    0    0               0
id31           1    0    0    0               0
id32           1    0    0    0               0
attr(,"assign")
[1] 0 1 1 1 2
attr(,"contrasts")
attr(,"contrasts")$cpn
[1] "contr.treatment"

attr(,"contrasts")$sample_year
[1] "contr.treatment"

Where is the cpn1 column in my design matrix?
Is model.matrix correct for dealing with batch effects and future pairwise comparisons of all possible combinations across my samples (1vs2, 1vs3...3vs4)?
Would it be better to import into EdgeR and analyse only one pair at a time (e.g. 1vs2)?
Next I plan to use the following pipeline. Is it okay?
- estimateDisp()
- glmQLFit(y, design, robust = TRUE)
- makeContrasts()
- tr.c1_2 <- glmTreat (fit, contrast = my.contrasts[, 'c1_2'], lfc = log2(LFC))
- topTags(tr.c1_2, Inf)

EdgeR design batch • 780 views

ADD COMMENT • link 15 months ago by Ann ▴ 40

score 3 · Accepted Answer · 2024-02-17

Where is the cpn1 column in my design matrix?

It is the intercept. You could use ~sample_year + cpn instead so the batch term becomes the intercept and it is more straight-forward to form the correct contrasts as the cpn coefficients are not directly available in the model matrix.

Is model.matrix correct for dealing with batch effects...

Yes, absolutely.

Would it be better to import into EdgeR and analyse only one pair at a time (e.g. 1vs2)?

There is no universal rule. Generally, if groups are reasonably similar then keep them in one analysis rather than splitting. If not, then splitting might make sense. For example, say you have treatment A vs treatment B in several completely different organs, like spleen, brain, liver, heart and lung ... and your goal is to compare treatments, not organs. Then you might maybe want to split the analysis since expression levels between organs is likely to be very different, hence estimated mean-variance trend could not be representative to either of the organs. But in general, most of the time you leave groups in one analysis.

Next I plan to use the following pipeline. Is it okay?

See edgeR manual for corrent best practices. Be sure that the glmTreat threshold is moderate and you have the power to really test against a threshold rather than 0.