I have a linear regression model on someone else's unpublished microarray data in which I'm testing the significance of certain 2nd and 3rd-order variable interactions via ANOVA on the nested model (i.e with and without the specific interactions). For many of the interactions, however, I am obtaining a p-value distribution that's skewed towards high p-values. Can incorrect model selection lead to something like that? Could it be because I encoded the model matrix incorrectly? Might the skewed distribution reflect some underlying structure in the data? Am I choosing the wrong model vs. null model?
The Model
There are three cell types, dummy-encoded with the two variables cell.type.b and cell.type.c, and there are three drug treatment conditions (drug1, drug2, and drug1+drug2) encoded by the absence or presence of each drug.
intercept cell.type.b cell.type.c drug1 drug2
1 0 0 0 0
1 0 0 0 1
1 0 0 1 0
1 0 0 1 1
1 1 0 0 0
1 1 0 0 1
1 1 0 1 0
1 1 0 1 1
1 0 1 0 0
1 0 1 0 1
1 0 1 1 0
1 0 1 1 1
I also have (but did not include in the table for simplicity) the 6 interaction parameters
(cell.type.b + cell.type.c)*(drug1 + drug2 + drug1*drug2).
The null hypothesis is that the model without a particular interaction cell.type x drug.type is not significantly worse than a model including that interaction.
Calculating p-values
To calculate the p-values I have the following R function I essentially copied-and-pasted from the Leek and Storey sva R package. It's just a vectorized version of using the hat matrix to extract residuals. Models are input in matrices like the table above; my model has additional columns for the interactions, of course. In this case, there is a separate F statistic calculated for each row the data matrix (i.e. each gene of the microarray data set):
f.pvalue <- function(dat,mod,mod0){
# This is a function for performing
# parametric f-tests on the data matrix
# dat comparing the null model mod0
# to the alternative model mod.
n <- dim(dat)[2]
m <- dim(dat)[1]
df1 <- dim(mod)[2]
df0 <- dim(mod0)[2]
p <- rep(0,m)
Id <- diag(n)
resid <- dat %*% (Id - mod %*% solve(t(mod) %*% mod) %*% t(mod))
resid0 <- dat %*% (Id - mod0 %*% solve(t(mod0) %*% mod0) %*% t(mod0))
rss1 <- resid^2 %*% rep(1,n)
rss0 <- resid0^2 %*% rep(1,n)
fstats <- ((rss0 - rss1)/(df1-df0))/(rss1/(n-df1))
p <- 1-pf(fstats,df1=(df1-df0),df2=(n-df1))
return(p)
}
Hi @Eric, I have seen this pattern in 2 different project while analyzing second level interactions using the R-Maanova package. I did not find any particular reason or explanation, so any sensible reply to your question is also of high interest to me :) Did you use any particular package for the analysis or did you simply use a standard anova test implemented in R? Cheers
Hi @Eric. I'm using the hat matrix to extract residuals from the microarrays and then directly calculate F statistics, so I'm using the standard R function
pf(q, df1, df2)
to get p-values.