removing batch effects using ComBat and SVA
2
8
Entering edit mode
8.5 years ago
LJ ▴ 280

Dear all,

I have a problem in processing my microarrays data.And I know how the batch effect is structured,soI would like to remove batch effect using ComBat. But I also want to adjust for unknown sources of noise using sva. Is there a way of combining these two methods? And is it feasible that I run ComBat first, get the ComBat-adjusted matrix, and run sva on this? Is it bad practice to run a second adjustment on data which has already been adjusted by a different package.

Thanks in advance.

SVA ComBat R Batch-effect • 30k views
ADD COMMENT
0
Entering edit mode

Hello, i am checking SVA for estimating batches in my expression atlas. this expression atlas is composed my multiple experiments (different dates) and multiple tissues analysed. Contrarily to the general aim of this analysis, i am not interested in performing a DE analysis, but i am interested in normalizing this data for a following a network analysis. For this reason i'll consider the two methods:

(1) SVA : identifying and estimating surrogate variables for unknown sources of variation in high-throughput experiments. (2) ComBat: directly removing known batch effects using ComBat in SVA.

For the 1st method what will be the full model matrix that i have to set?

Hope it is clear, Thanks in advance.

ADD REPLY
0
Entering edit mode

Hi, I cannot download the bladderdata example and I am having problems with the format of the matrices to run ComBat.

How dat, batch and mod should be formatted?

Thanks.

ADD REPLY
9
Entering edit mode
8.5 years ago
ATRX ★ 1.1k

Ideal way would be: Get the batches from sva (if you are using microarray) or svaseq (RNA-Seq) and provide the batch vector to ComBat. ComBat will give you Batch Corrected matrix. Do the PCA plot and check if the samples are clustering based on genotypes or batches.

And is it feasible that i run ComBat first, get the ComBat-adjusted matrix, and run sva on this? For, running the ComBat you need batch information. There is no need to run sva on Batch adjusted matrix. The difference between ComBat and sva apart from their method is that ComBat does Batch Correction however, sva finds the batches. You may do Batch Correction using sva's surrogate variable by doing simple linear algebra. However, ComBat works fine too.

ADD COMMENT
0
Entering edit mode

Thanks for your reply.

(1)My microarrays data was collected in two experiments,and I do the PCA plot, finding the samples are clustering based on the two experiment batches.

So,i just set the batch information as the two experiment batches(1,1,1,...,2,2,2,...),then run the ComBat with the batches.The Batch adjusted matrix was the corrected expression data.And there is no need to run sva for removing the latent factors. Is it right?

(2)Or i run sva for identifying the surrogate variables,instead of using the two experiments batches. And the following code here is the sva package example:

mod = model.matrix(~as.factor(cancer), data=pheno)
 mod0 = model.matrix(~1,data=pheno)
 n.sv = num.sv(edata,mod,method="leek")
 svobj = sva(edata,mod,mod0,n.sv=n.sv)#### apply the sva function to estimate the surrogate variables

What does the variable of interest mean? Does it mean that we correct the data by removing cancer factor effects and batch effects? And how do i creat the full model matrix,cause i do not have other variable of interest like the example.

Looking forward to your reply,thanks in advance.

ADD REPLY
0
Entering edit mode

Yeah, the first one is correct. Check the PCA plot on corrected batch matrix and you may use limma to do DEGs analysis. I think even if you run sva and check the plots, you will see two batches but its worth trying. I didn't understand your question regarding variable of interest.

ADD REPLY
1
Entering edit mode

Thanks.One more question for you,and the following code is what i used in ComBat:

> pheno
              subgroups batch
    sample1           N     1
    sample2           N     1
    sample3           N     1
    sample4           N     1
    sample5           T     1
    sample6           N     1
    sample7           N     1
    sample8           S     1
    sample9           N     1
    sample10          N     1
    sample11          N     1
    sample12          N     1
    sample13          N     1
    sample14          S     1
    sample15          S     1
    sample16          N     1
    sample17          N     1
    sample18          T     1
    sample19          N     1
    sample20          N     1
    sample21          N     1
    sample22          T     1
    sample23          N     1
    sample24          N     1
    sample25          N     1
    sample26          N     1
    sample27          T     1
    sample28          N     1
    sample29          N     1
    sample30          T     1
    sample31          T     1
    sample32          T     1
    sample33          N     1
    sample34          T     1
    sample35          T     1
    sample36          S     1
    sample37          S     1
    sample38          N     1
    sample39          T     1
    sample40          S     1
    sample41          N     1
    sample42          T     2
    sample43          T     2
    sample44          T     2
    sample45          T     2
    sample46          S     2
    sample47          T     2
    sample48          T     2
    sample49          T     2
    sample50          T     2
    sample51          N     2
    sample52          N     2
    sample53          N     2
    sample54          N     2
    sample55          S     2
    sample56          S     2
    sample57          T     2
    sample58          T     2
    sample59          S     2
    sample60          T     2
    sample61          T     2
    sample62          T     2
    sample63          T     2
    sample64          N     2
    sample65          N     2
    sample66          N     2
    sample67          N     2
    sample68          N     2
    sample69          N     2
    sample70          N     2
    sample71          N     2
    sample72          N     2
    sample73          N     2
    sample74          N     2
    sample75          S     2
    sample76          S     2
    sample77          S     2
    sample78          S     2
    sample79          S     2
    sample80          S     2
    sample81          S     2
    sample82          S     2
    sample83          S     2
    sample84          S     2
    sample85          S     2
    sample86          S     2
    sample87          S     2
    sample88          S     2
    sample89          T     2
    sample90          T     2
    sample91          T     2
    sample92          T     2
    sample93          T     2
    sample94          T     2
    sample95          T     2
    sample96          T     2
    sample97          T     2
    sample98          T     2
    sample99          T     2
    sample100         T     2

batch<-pheno$batch
modcombat<-model.matrix(~1, data=pheno)
combat_mydata= ComBat(dat=mydata, batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)

And if i set

modecombat1<-model.matrix(~subgroups, data=pheno) 
combat_mydata= ComBat(dat=mydata, batch=batch, mod=modcombat1, par.prior=TRUE,prior.plots=FALSE)

so what is the difference between the two codes? And this is what i understand:(1)the former means we suppose the variances are all caused by batches,we correct data by removing batches effects,but at the same time,the variances caused by subgroups are eliminated ;(2)the latter means we think the variances are caused by batches and subgroups,so we correct data with the subgroups variances retained.

So,i should choose modecombat1<-model.matrix(~subgroups, data=pheno),am i right? Thanks.

ADD REPLY
2
Entering edit mode

You should do :

batch<-pheno$batch
modcombat<-model.matrix(~1, data=pheno)
combat_mydata= ComBat(dat=mydata, batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)

because you are regressing the batches and trying to removing any of the unwanted variation and not biological variation. In case of

modecombat1<-model.matrix(~subgroups, data=pheno)

you are regressing on subgroups or trying to remove the biological variation from the subgroups. Now, doing that your batch corrected matrix will more-o-less have no variation between them and therefore your Diff expression genes (DEGs) analysis will yield no statistically significant genes. You use subgroups information when you are performing limma / DEGs analysis and not when you are trying to remove the batch effects. The goals of two tasks are different.

ADD REPLY
0
Entering edit mode

So the code

modcombat<-model.matrix(~1, data=pheno)
combat_mydata<-ComBat(dat=mydata, batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)

only removes batch effects but subgroups variation retained. However,when i add a covariate level--subgroups,the following code

modcombat<-model.matrix(~subgroups, data=pheno)
combat_mydata<-ComBat(dat=mydata, batch=batch, mod=modcombat1, par.prior=TRUE, prior.plots=FALSE)

not only removes batch effects(that's what i want),but also removes biological variation from the subgroups.'cause i just only want to remove batch effects,so the former should be used? And when adding a covariate level to the mod,it means that you want to remove batch effects and some other variation caused by the covariate level simultaneously?

Many thanks.

ADD REPLY
0
Entering edit mode

Yes, use this code only.

modcombat<-model.matrix(~1, data=pheno)
combat_mydata<-ComBat(dat=mydata, batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)
ADD REPLY
2
Entering edit mode

ok,but i just saw a explanation for Covariates in ComBat here from Dr.Johnson (the author of "sva" package),it seems that he gave a opposite meaning of covariates, and the following is the answer to the purpose of including covariates in the mode:

Sorry about the late response. However, this is the most common question I answer on this forum.

As far as including covariates: covariates should ALWAYS be added unless you have a completely balanced factorial design, in which it won't make much difference whether you add them or not. However, if you have an unbalanced proportion of treatment/controls or if some treatments are missing from one of the batches, you MUST include covariates or you risk introducing bias or removing biological signal from the data.

Just to shed so light on why this is true, just think of a simple regression or ANOVA scenario where you have two possibly correlated predictors or factors predicting a response. My favorite example (that I use for my undergraduate students) is the relationship between shark bites (response) and ice cream sales (explanatory). If you fit the regression, you will clearly see a strong correlation between the two variables. However, if you introduce a second explanatory variable, temperature, you will then see no relationship between ice cream and shark bites. Now, if you "adjust" shark bites for ice cream without taking temperature into account, you will remove part of the relationship of interest (shark v. temp) and may not see any correlation between shark bites and temperature at all.

Now completing the analogy, let shark bites be gene expression, ice cream be batch, and temp be the biological condition of interest. I hope it is clear now why its important to include covariates!

Evan

So,i am geting more confused, am i missing something improtant?

ADD REPLY
0
Entering edit mode

Probably you are right. Thanks for letting me know. With my limited experience in case of Batch Correction using ComBat, I have not seen models including covariates. I could be wrong because in my case, I may have balanced proportion of treatment/controls. You may try both the models and see how your PCA plots look like and see which one is giving you better clusters. However, I would believe the authors more than my experience.

ADD REPLY
0
Entering edit mode

Now completing the analogy, let shark bites be gene expression, ice cream be batch, and temp be the biological condition of interest.

I think the temp should be barch or a covariate in a linear model, while the ice cream is the independent variable(gene expression), and shark bites is the dependent variable(biological condition of interest). Does anyone tell me if this understanding is appropriate?

ADD REPLY
0
Entering edit mode

I think Dr. Johnson's words are correct. In batch correction, the dependent variable is the gene expression, the independent variable of interest is the batch, and the unwanted covariate to be controlled is the biological condition. Thus, shark bites=gene expression, ice cream=batch, temperature=biological condition.

Dr. Johnson said:

adjust <shark bites> for <ice cream> without taking <temperature> into account, you will remove part of the relationship of interest (<shark> v. <temp>) and may not see any correlation between <shark bites> and <temperature> at all.

And it can be converted into:

adjust <gene expression> for <batch> without taking <biological condition> into account, you will remove part of the relationship of interest (<gene expr> v. <biological condition>) and may not see any correlation between <gene expression> and <biological condition> at all (in the result of further analyses).

ADD REPLY
0
Entering edit mode
4.4 years ago
zhao03 ▴ 70

Hi, i also used combat to remove the batch bais, however, there are some negative numbers in the result matrix. what is the means of those negative number, can i use this matrix to find the different gene? i inputted FPKM for running combat.

ADD COMMENT

Login before adding your answer.

Traffic: 1944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6