I've been reading on the usage of Combat since I want to apply it to some expression data (not for differential expression analysis) and I've seen some mixed information regarding the variables to include in the model matrix.
The standard call to Combat in R:
ComBat(dat=edata, batch=batch, mod=modcombat)
Now consider the situation where you have three variables of interest treatment, sex and age. Leaving as much of the variation in the data due to these covariates is desirable. Consider also you have a large list of confounder variables conf1, conf2 ... confn whose effect on the data would be best removed. From what I understand, modcombat
is the design matrix for the linear model which indicates the variables that explain the observed expression AND that we want to "remove". So it should technically be constructed like this:
model.matrix(~ conf1 + conf2 ... + confn)
Now I've also been told by a colleague that its the other way around, that the variables included in the design matrix should be the variables of interest to "keep":
model.matrix(~ treatment + sex + age)
Then the SVA package tutorial states that:
Just as with sva, we then need to create a model matrix for the adjustment variables, including the variable of interest
Which contradicts the other two approaches and makes no sense to me because if you include everything in the model (i.e. model.matrix(~ treatment + sex + age + conf1 + conf2 ... + confn)
) then how would Combat know what you want to remove and what you want to keep. Does anyone know what the correct usage actually is?
EDIT: so Ive been doing more digging around and there seems to be this systematical problem where people tend to not know how to actually use SVA or ComBat. Take for example this bioconductor post where a user asks a similar question about SVA. He even goes on to analyze how related the inferred surrogate variables are with a known confounder (batch) when he runs SVA with and without including the confounder as an "adjustment variable". He finds that the results completely contradict what initially is understood from the SVA/ComBat tutorial. As it is usual with these questions no one has responded clearly.
I think part of the problem is that the tutorial handles certain concepts loosely such as "variable of interest" and "adjustment variable". Can adjustment variables be further divided into known but "desired" confounders (e.g. Sex and Age) and known undesired confounders (e.g. Batch, RNA integrity index, etc.). Wouldn't the "desired confounders" be the same as additional variables of interest? Can there be multiple variables of interest or does it strictly have to be one?. Essentially I think a clear definition of "variable of interest" and "adjustment variable" is lacking along with examples that go beyond the usage of 1 variable of interest and no adjustment variables where no issues and doubts come up.
Thank you for your reply, it does make sense in a way for the usage to be like you described, however we aren't 100% sure. There seems to be a fundamental problem with the documentation of both SVA and ComBat as I have mentioned in an edit to my question
It may not be clear as, in this workflow by Jeff Leak, who is renowned for his work in batch effects,
batch
is provided in the data being passed tomod
: http://jtleek.com/genstats/inst/doc/02_13_batch-effects.htmlThis does contradict the actual
ComBat()
documentation, though.Is mod even needed? I would prefer to just have the corrected matrix returned to me, with which I would then proceed with the usual workflow.
NB for others arriving here: use ComBat-seq for correcting bulk RNA-seq raw counts.
Hi Kevin, thanks for the link,
I had seen it before (it is a very nice workflow), but in the only call to
ComBat
I'm seeing in the workflow is this one:Isn't he inputting the batch as
batch
, and the model.matrix (in this case without variable of interest) tomod
? Or am I missing something here?Yes, but, to me, it seems that
batch
would still be in themodcombat
object, no?modcombat
is constructed usingpheno
...pheno
appears to still containbatch
(never removed)Yes, it is inside
pheno
, but by constructingmodcombat
with this formula:~1
, you generate a null model matrix full of 1s, not with batch information. I'm referring to this (my output):(or maybe I'm confusing things)
Ah, indeed, I missed the
~ 1
. So, we are all aligned here:yeah passing
pheno
tomodel.matrix()
does nothing here except ensuring the resulting vector has the right length. I've also seen this link you posted before, but I don't think it clears any doubts as the examples are too simplified.