Hi all, I am working to analyze some microarray data (GSE116560) with hopes of comparing gene expression between patients on a ventilator versus those not on a ventilator for three different timepoints (ARDS diagnosis day 1, 4, or 8). My experimental matrix can be represented by design:
set.seed(123)
design<-data.frame("id"=1:68,"ARDS_day"=c(rep(1,30),rep(4,22),rep(8,16)),"age"=sample(18:84,replace=
T,size = 68),"gender"=rep(c("male","female"),68/2),"vfd_days"=rep(c(0,10,0,25),17))
design$vent_status<-ifelse(design$vfd_days==0,"on_vent","off_vent")
design$condition<-paste0(design$vent_status,"_",design$ARDS_day)
design<-model.matrix(~age+gender+condition,design)
design
then I will fit the linear model and run emperical bayes as follows:
fit <- lmFit(y,design)
cont.matrix <- makeContrasts(day4_on_vs_off_vent = conditionon_vent_4 - conditionoff_vent_4,
levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
topTable(fit2, adjust="BH", n = Inf)
I am quite confused however. If anyone could help me with these questions I'd be greatly appreciative!
Should I use an intercept ("model.matrix(~1+age+gender+...")? I understand that this is usually best for numeric covariates (e.g. age) and shouldn't matter for gender, but feel unsure what I should do if I have both types.
How do I make the appropriate contrast for on vent vs. off vent for ARDS day1? Design has a column "conditionon_vent_1" but no column "conditionoff_vent_1." Does this mean "conditionoff_vent_1" is represented by (Intercept)?
I have been looking at the limma userguide extensively and still feel confused. Thanks for any guidance!
Brian
This is extremely helpful, thank you. Do you have any insight into why it's best to specify the variable of interest first?
I think I have figured out what I needed to with the help of your answer!
When there are multiple categorial variables, at most one can retain all of its levels in the model. If two or more variables are "one-hot encoded" (made into indicator variables) it results in a condition called "multicollinearity" which means the linear model cannot be fit, because effects are non-identifiable (see https://discourse.pymc.io/t/real-life-example-on-housing-price-regression-advice-requested/3050/5).
To address this common issue,
model.matrix
will retain all levels of the 1st variable; and drop the 1st level of every subsequent variable.Therefore, to ensure your variable of interest retains all of its levels, specify it first.