Hi,
I'm re-analyzing an Affymetrix array with 200+ samples distributed across 38 groups. They refer to multiple cell types (CD8+, CD4+, Granulocytes, etc.), so there isn't a "control" condition on which the groups are to be compared against. I've implemented code to RMA-normalize the CEL files and perform pair-wise comparisons for each two cell types of interest, but I'm stuck on how to extract DEGs among 3 or more groups?
My design formula considers batch effects (as originally reported by the study's authors), and is as following:
design <- model.matrix(~0+group+batch)
where group refers to samples cell types.
I then go on to perform pairwise comparisons and extract DEGs, examples of which are:
contr.matrix <- makeContrasts(
BASOPHILSvsEARLYBCELLS = Basophils - EarlyB, # (coef = 1)
CD8NAIVEvsCD8EM = CD8Naive - CD8EM, # (coef = 2)
EOSINOPHILSvsMONOCYTES = Eosino - Monocyte, # (coef = 3)
levels = colnames(design)
)
and to extract, for instance, DEGs between basophils and early B-cells I do:
fit <- lmFit(expr.rma, design)
fit2 <- contrasts.fit(fit, contr.matrix)
fit2 <- eBayes(fit2)
tt <- topTable(fit2, coef = 1, adjust.method = "BH", n=Inf)
but how would I proceed to extract DEGs among any of these groups?
Yes, that's how I'm extracting the DEGs from each 2-way comparison. But I'd like to know whether it would be possible to calculate DEGs all at once among e.g Basophils X Early B-cells X CD8-naive X CD8EM x Monocytes X Eosinophils, as one would do by an ANOVA analysis.
My design is quite big because there are a lot of samples and groups (I don't want to perform all-vs-all comparison, just these groups of interest). I pasted it here https://pastebin.com/Nf4D3PSJ (note that HTAxxx columns correspond to batches).
Edit: I guess I could perform all pairwise comparisons and just calculate the union of DEGs across comparisons. But I'm not sure type 1 error would be properly controlled using this strategy.
Section 3.2.6 of the manual https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf might be of interest to make ANOVA-like tests that detect whether there are any DEGs among all indicated groups.
Thank you for pointing me to this guide (I was only reading limma since I'm working on array data)!
I've adapted my code to read:
However, this yields very low P-values with probes that present strikingly similar expression profiles. This is the first few lines of
tt
:Also, when I run summaryTests(fit2), everyone is significant in all comparisons but the batch columns! See: https://pastebin.com/jHiYNYuN
What am I missing here? It seems I'm actually testing for expression, thus all probes appear significant - but I have included an intercept in my design formula. Should I only include the groups to be compared in the design formula?
EDIT: my
design
:I realize now that I should be using a
design
of the formdesign <- model.matrix(~group+batch)
, where(Intercept)
now shows in the first column, and now I haveMy question now is whether I should subset my whole dataset prior to perform this only to retain groups of interest (in this case, samples related to the comparisons between Basophils x CD4CM x CD4EM x CD8CM x CD8EM), or if using
coef = 2:4
is enough to obtain these DEGs across these specific groups, which are only a portion of all groups present in this dataset? I also have a lot of other coefficients calculated for both the remaining groups as well as the batches.Is it really of interest for your research to know which genes are differential between a basophil and a CD8+? These are fundamentally different cell types. I would really try and only perform the meaningful comparisons that help driving your research and avoid making unnecessary comparisons. I'd stick with the pairwise comparisons instead of these Anova-like tests. What question do you want to answer?
This is a quite complete dataset of cellular lineages. I used basophile as an example for simplicitiy because it was at the start of my sample array. What I would like to really see was differences between the (somewhat related) Megakaryocyte / erythroid progenitor VS granulocyte/monocyte progenitor VS Common myeloid progenitor VS Naive CD4+ VS Naive CD8+.
Create new combined levels / factors in your input metadata (via
paste()
orpaste0()
) such that you can then conduct the comparisons that you want.Yes, I have each of the groups from my previous message as a factor within the
group
variable that compose my design. At this moment I'm leaning towards @ATPoint's suggestion of performing solely 2-group comparisons, although I would like to know whether my previous reasoning and code is appropriate for an ANOVA-like comparison, particularly in the presence of thebatch
covariate as all examples that I saw in other posts had only a simple design of the formdesign <- model.matrix(~ group)