Let's suppose your various phenotypic factors are stored in a data.frame
called treatments
library(magrittr)
treatments <- data.frame(
pheno = rep(c('Ctrl', 'Ctrl', 'Ctrl', 'Disease', 'Disease'), each = 4),
cell = rep(paste('A', 1:4, sep = ''), times = 5),
pt = rep(paste('ID', 1:5, sep = ''), each = 4)
)
treatments %>% head
I find it easier to separate out the cellular / disease parts from the patient-matching part when building designs for this type of study (I've done them quite a lot for leukaemia datasets with various cell subpopulations over the last 18 months). So we define
pheno.design <- model.matrix(~ cell * pheno, data = treatments)
patient.design <- model.matrix(~ -1 + pt, data = treatments)
> pheno.design %>% head(3)
> patient.design %>% head
Now, you can use the following as your design matrix in limma, if you want to then struggle with setting up your contrast matrix:
binary.design <- cbind(pheno.design, patient.design[, c(2,3,5)])
colnames(binary.design) <- c("Intercept", make.names(colnames(binary.design))[-1])
Why will your contrasts matrix be ugly if you use this design?
Let's suppose you're interested in comparing disease vs control in cell type A2.
Algebraically, the contrast you want is:
(fitted val for cell type A2, Dis) - (fitted val for cell type A2, Ctrl)
The design takes [cell type A1, control status, patient ID1] as baseline (the value that is fitted when only the intercept term is included). If you include the coefficient for 'phenoDisease' as well, you will fit the disease-baseline sample [cell type A1, disease status, patient ID4].
To get the fitted value for [cell type A2, disease status] we compute the mean of the fitted values for the three different control individuals:
[Fitted val: A2, Control, ID1] = Intercept + cellA2;
[A2, Ctrl, ID2] = Intercept + cellA2 + ptID2
[A2, Ctrl, ID3] = Intercept + cellA2 + ptID3
=> [A2, Ctrl] = Intercept + cellA2 + (ptID2 + ptID3)/3
Similarly, for cell type A2 in the disease samples:
=> [A2, Disease] = Intercept + cellA2 + phenoDisease + cellA2.phenoDisease + ptID5 / 2
So the contrast you want to define is
[Contrast: Dis vs Ctrl in A2] =
[A2, Disease] - [A2, Control] =
phenoDisease + cellA2.phenoDisease + (ptID5 / 2) - (ptID2 + ptID3) / 3
You don't see these kinds of ugly contrasts for binary model matrices in studies where there's inter-treatment matching (say you had three patients, 4 cell subpops from each and treated the 4 cellpops +/- drug giving 24 samples; the ptID coefficients would cancel out in the definition of your contrast).
You might find it neater to encode the patient matrix as orthogonal differences between the patients in each experimental arm. This removes the ptID terms from the final contrast matrix (they are put inside the model matrix, rather than in the contrast matrix; and is much easier to explain to the bench)
pt.delta.matrix <- patient.design %*% matrix(
c(c(1, -1/2, -1/2, 0, 0),
c(0, 1, -1, 0, 0),
c(0, 0, 0, 1, -1)),
ncol = 3,
dimnames = list(c(), c("pt.delta1.23", "pt.delta2.3", "pt.delta4.5"))
)
> pt.delta.matrix
delta.design <- cbind(pheno.design, pt.delta.matrix)
colnames(delta.design) <- c("Intercept", make.names(colnames(delta.design))[-1])
The contrast I tried to explain above then comes out as:
[Contrast: Dis vs Ctrl in A2] =
[A2, Disease] - [A2, Control] =
phenoDisease + cellA2.phenoDisease
Apologies for the long-winded explanation.
Thanks for the great answer and time you put into this response! I have one question, can I still use this if patients have a systemic disease, so 4 cell types and all disease vs 4 cell types and all control? Again, thanks for all the effort it helped me a ton!
Yeah you could do that. If I've understood correctly, and you've set up the design as in delta.design, I think your contrast would be phenoDisease + (cellA2.phenoDisease + cellA3.phenoDisease + cellA4.phenoDisease)/4.