Question

What exactly should and shouldn't go in Combat's model matrix?

6

Entering edit mode

4.0 years ago

n,n ▴ 390

I've been reading on the usage of Combat since I want to apply it to some expression data (not for differential expression analysis) and I've seen some mixed information regarding the variables to include in the model matrix.

The standard call to Combat in R:

 ComBat(dat=edata, batch=batch, mod=modcombat)

Now consider the situation where you have three variables of interest treatment, sex and age. Leaving as much of the variation in the data due to these covariates is desirable. Consider also you have a large list of confounder variables conf1, conf2 ... confn whose effect on the data would be best removed. From what I understand, modcombat is the design matrix for the linear model which indicates the variables that explain the observed expression AND that we want to "remove". So it should technically be constructed like this:

model.matrix(~ conf1 + conf2 ... + confn)

Now I've also been told by a colleague that its the other way around, that the variables included in the design matrix should be the variables of interest to "keep":

model.matrix(~ treatment + sex + age)

Then the SVA package tutorial states that:

Just as with sva, we then need to create a model matrix for the adjustment variables, including the variable of interest

Which contradicts the other two approaches and makes no sense to me because if you include everything in the model (i.e. model.matrix(~ treatment + sex + age + conf1 + conf2 ... + confn)) then how would Combat know what you want to remove and what you want to keep. Does anyone know what the correct usage actually is?

EDIT: so Ive been doing more digging around and there seems to be this systematical problem where people tend to not know how to actually use SVA or ComBat. Take for example this bioconductor post where a user asks a similar question about SVA. He even goes on to analyze how related the inferred surrogate variables are with a known confounder (batch) when he runs SVA with and without including the confounder as an "adjustment variable". He finds that the results completely contradict what initially is understood from the SVA/ComBat tutorial. As it is usual with these questions no one has responded clearly.

I think part of the problem is that the tutorial handles certain concepts loosely such as "variable of interest" and "adjustment variable". Can adjustment variables be further divided into known but "desired" confounders (e.g. Sex and Age) and known undesired confounders (e.g. Batch, RNA integrity index, etc.). Wouldn't the "desired confounders" be the same as additional variables of interest? Can there be multiple variables of interest or does it strictly have to be one?. Essentially I think a clear definition of "variable of interest" and "adjustment variable" is lacking along with examples that go beyond the usage of 1 variable of interest and no adjustment variables where no issues and doubts come up.

batch effects expression gene combat • 7.3k views

ADD COMMENT • link updated 4.0 years ago by Papyrus ★ 3.1k • written 4.0 years ago by n,n ▴ 390

score 3 · Answer 1 · 2021-06-18

I still do not see too much confusion regarding the use of the ComBat function, though.

The vignette, the manual for the ComBat function, and the Jeff Leek tutorial we have commented agree with this use (i.e. batch being inputted in batch, and the rest, including interest, if desired, in mod).

At the start of the tutorial, the description states:

The sva package assumes there are two types of variables that are being considered: (1) adjustment variables and (2) variables of interest. For example, in a gene expression study the variable of interest might an indicator of cancer versus control. The adjustment variables could be the age of the patients, the sex of the patients, and a variable like the date the arrays were processed. Two model matrices must be made: the “full model” and the “null model”. The null model is a model matrix that includes terms for all of the adjustment variables but not the variables of interest. The full model includes terms for both the adjustment variables and the variables of interest.

In my reading, this means that adjustment variables include all other variables that one is interested in adjusting for within a experiment (which could even be batch). Then, for the specific case of there being batch variable within those variables, the package provides a ComBat function to "remove" that adjustment variable from the data, while leaving the other variables (quote: "adjustment variables, including the variable of interest") protected.

I agree with the OP that the starting definitions could be better improved by underscoring that there may be some adjustment variables (e.g. batch) that one may want to remove from the data, while not others. This (non-explicit) general idea comes (I believe) from literature on the field which differentiates between types of adjustment variables/confounding factors, which may then be handled differently. For example, those which are known to the researcher and not subject to measurement error, such as batch, are different from variables which have been measured.

Try taking a look at Chapter 3, section 3.4 from this book which defines types of confounding variables and discusses when/how to use ComBat and other methods such as SVA (it is by A. E. Teschendorff, who developed ISVA, among other things).

score 1 · Answer 2 · 2021-06-02

1

Entering edit mode

4.0 years ago

Papyrus ★ 3.1k

I don't normally use ComBat (I use limma::removeBatchEffect), but looking at the documentation, I'd say that in batch you include your batch variable and in the mod argument you include a design with any other variables (NOT batch) you want to "preserve" while removing batch.

This is because when you fit the multivariate model, the estimation of batch will change if you also have other covariates (because they can explain common variance). So you could say that batch is "being adjusted for" the other covariates and that may be the reason for the wording in the tutorial.

I'm not sure ComBat lets you adjust for multiple batches though, so you either combine them into one or use another method.

ADD COMMENT • link 4.0 years ago by Papyrus ★ 3.1k

0

Entering edit mode

Thank you for your reply, it does make sense in a way for the usage to be like you described, however we aren't 100% sure. There seems to be a fundamental problem with the documentation of both SVA and ComBat as I have mentioned in an edit to my question

ADD REPLY • link 4.0 years ago by n,n ▴ 390

0

Entering edit mode

It may not be clear as, in this workflow by Jeff Leak, who is renowned for his work in batch effects, batch is provided in the data being passed to mod: http://jtleek.com/genstats/inst/doc/02_13_batch-effects.html

This does contradict the actual ComBat() documentation, though.

Is mod even needed? I would prefer to just have the corrected matrix returned to me, with which I would then proceed with the usual workflow.

NB for others arriving here: use ComBat-seq for correcting bulk RNA-seq raw counts.

ADD REPLY • link 4.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, thanks for the link,

I had seen it before (it is a very nice workflow), but in the only call to ComBat I'm seeing in the workflow is this one:

batch = pheno$batch
modcombat = model.matrix(~1, data=pheno)
modcancer = model.matrix(~cancer, data=pheno)
combat_edata = ComBat(dat=edata, batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)

Isn't he inputting the batch as batch, and the model.matrix (in this case without variable of interest) to mod? Or am I missing something here?

ADD REPLY • link 4.0 years ago by Papyrus ★ 3.1k

0

Entering edit mode

Yes, but, to me, it seems that batch would still be in the modcombat object, no?

modcombat is constructed using pheno... pheno appears to still contain batch (never removed)

ADD REPLY • link 4.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Yes, it is inside pheno, but by constructing modcombat with this formula: ~1, you generate a null model matrix full of 1s, not with batch information. I'm referring to this (my output):

library(bladderbatch)
data(bladderdata)

pheno = pData(bladderEset)
batch = pheno$batch
modcombat = model.matrix(~1, data=pheno)

head(modcombat)

(Intercept)
GSM71019.CEL           1
GSM71020.CEL           1
GSM71021.CEL           1
GSM71022.CEL           1
GSM71023.CEL           1
GSM71024.CEL           1

[...]

(or maybe I'm confusing things)

ADD REPLY • link 4.0 years ago by Papyrus ★ 3.1k

0

Entering edit mode

Ah, indeed, I missed the ~ 1. So, we are all aligned here:

ADD REPLY • link 4.0 years ago by Kevin Blighe 89k

0

Entering edit mode

yeah passing pheno to model.matrix() does nothing here except ensuring the resulting vector has the right length. I've also seen this link you posted before, but I don't think it clears any doubts as the examples are too simplified.

ADD REPLY • link 4.0 years ago by n,n ▴ 390

score 1 · Answer 3 · 2021-06-17

I've found that in combatSeq's github (a version of combat for raw rna-seq counts) the model matrix passed to combat is specifically stated to be comprised of variables to be preserved:

In ComBat-seq, user may specify biological covariates, whose signals will be preserved in the adjusted data. If the user would like to specify one biological variable, they may use the group parameter

If users wish to specify multiple biological variables, they may pass them as a matrix or data frame to the covar_mod parameter

It still is unclear if this is the same intended usage for legacy combat though.