I am using propeller (https://rdrr.io/github/Oshlack/speckle/man/propeller.anova.html) to calculate significance between cell proportions in hypertrophy vs. control. My model is designed like this.
design <- model.matrix(~0 + grp + md_subset$batch)
I understand propeller does not set an intercept to not have a baseline level and all categorical variables to be represented as separate column (as stated in their github above). So altogether I have 10 samples, 3 controls and 7 hypertrophy and they were done in 4 batches. My design matrix looks like this:
I have a couple of questions regarding the design matrix. One, why do not I have just 2 columns in the design matrix, one column for condition, whether control represented by 0 for that row and 1 representing hypertrophy, and one column for batch, represented by 1,2,3 or 4 in the rows as their are 4 batches? What I see is each condition has its own column and so does each batch. Looks to me like hot encoding but why is that necessary in this case? Second, if hot encoding is what is necessary and that is the reason each variable has a separate column, then why is batch 1 column not present as my sample 3 (row 3) is batch 1? Would this design matrix not take into account batch 1 effects?
Lastly, when I run this, i get an out put like this:
propeller.anova(prop.transformed, design=design, coef=c(1,2,3,4,5), robust=TRUE,
trend=FALSE, sort=TRUE)
First, does running this with the design signify that I am looking at how proportion of each cell type between conditions is significant controlling for batch effects? The reason I ask this is because normally in regression, from my understanding, is the first variable after ~ is the variable you are testing and any additional variables are confounders you want to control for. However, in packages like DESeq2 or edgeR I have seen confounders controlled for occur first and the last variable is the variable you are testing how the response variable is related to. Also, does the p value and F statistic here show how much of an affect does the condition play on proportion and if it is significant? And each column labeled PropMean.* signify the coefficients for that variable for the different cell types?
I appreciate any help and clarification.