Hello everyone!
I have data from identical experiments repeated twice, once in 2018 and again in 2023. As expected, there appears to be a batch effect by year.
In these experiments, we have 4 samples (1, 2, 3, 4) with 8 replicates per sample.
On the MDS plot, my samples clustered more by year, indicating a batch effect (see attached MDS plot).
According to the EdgeR manual, including the variable 'Year' in the model can account for batch effects. However, since I have four samples to compare, I am unsure if my formula for the design matrix is correct.
# Design accounting for year (=batch) ===================
sample_year <- metadata$sample_year # Year
sample_year
# [1] 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023
# [16] 2023 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018
# [31] 2018 2018
# Levels: 2018 2023
cpn <- metadata$cpn # Samples
cpn
# [1] 2 2 4 4 3 3 1 1 2 2 4 4 3 3 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1
# Levels: 1 2 3 4
# Design =============================================
design <- model.matrix(~cpn + sample_year)
rownames(design) <- colnames(y)
design
(Intercept) cpn2 cpn3 cpn4 sample_year2023
id1 1 1 0 0 1
id2 1 1 0 0 1
id3 1 0 0 1 1
id4 1 0 0 1 1
id5 1 0 1 0 1
id6 1 0 1 0 1
id7 1 0 0 0 1
id8 1 0 0 0 1
id9 1 1 0 0 1
id10 1 1 0 0 1
id11 1 0 0 1 1
id12 1 0 0 1 1
id13 1 0 1 0 1
id14 1 0 1 0 1
id15 1 0 0 0 1
id16 1 0 0 0 1
id17 1 0 0 1 0
id18 1 0 0 1 0
id19 1 0 0 1 0
id20 1 0 0 1 0
id21 1 0 1 0 0
id22 1 0 1 0 0
id23 1 0 1 0 0
id24 1 0 1 0 0
id25 1 1 0 0 0
id26 1 1 0 0 0
id27 1 1 0 0 0
id28 1 1 0 0 0
id29 1 0 0 0 0
id30 1 0 0 0 0
id31 1 0 0 0 0
id32 1 0 0 0 0
attr(,"assign")
[1] 0 1 1 1 2
attr(,"contrasts")
attr(,"contrasts")$cpn
[1] "contr.treatment"
attr(,"contrasts")$sample_year
[1] "contr.treatment"
- Where is the cpn1 column in my design matrix?
- Is model.matrix correct for dealing with batch effects and future pairwise comparisons of all possible combinations across my samples (1vs2, 1vs3...3vs4)?
- Would it be better to import into EdgeR and analyse only one pair at a time (e.g. 1vs2)?
- Next I plan to use the following pipeline. Is it okay?
- estimateDisp()
- glmQLFit(y, design, robust = TRUE)
- makeContrasts()
- tr.c1_2 <- glmTreat (fit, contrast = my.contrasts[, 'c1_2'], lfc = log2(LFC))
- topTags(tr.c1_2, Inf)
Thank you very much!