Question

SVA vs limma model

0

Entering edit mode

21 months ago

rk.khayami94 ▴ 10

I'm trying to remove batch effect from my data using the sva package. The process described here is like this:

library(bladderbatch)
data(bladderdata)
pheno = pData(bladderEset)
edata = exprs(bladderEset)

# The null model contains only the adjustment variables. Since we are not adjusting
# for any other variables in this analysis, only an intercept is included in
# the model.
mod0 = model.matrix(~1,data=pheno)

# The full model includes both the adjustment variables
# and the variable of interest
mod <- model.matrix(~1+ cancer, data=pheno)

# Identify the number of latent factors that need to be estimated.
n.sv = num.sv(edata,mod,method="leek")

# estimate the surrogate variables
svobj = sva(edata,mod,mod0,n.sv=n.sv)

# include the surrogate variables in both the null and full models
modSv = cbind(mod,svobj$sv)
mod0Sv = cbind(mod0,svobj$sv)

# Adjusting for surrogate variables using the limma package
fit = lmFit(edata,modSv)

Now I want to have a one vs all comparison. So, according to this post by Michael Love my model should be like this:

limma_mod <- model.matrix(~0 + cancer, data=pheno)

I'm confused whether I should use model.matrix(~1+ cancer, data=pheno) as my full model and model.matrix(~1, data=pheno) as the null model, then, for limma, merge model.matrix(~0+ cancer, data=pheno) with the surrogate variables, or use model.matrix(~0+ cancer, data=pheno) as my full model and model.matrix(~1, data=pheno) as the null model?

Note: using mod0 <- model.matrix(~0, data=pheno) and mod <- model.matrix(~0+ cancer, data=pheno) results in this error:

Error in solve.default(t(mod0) %*% mod0) : 'a' is 0-diml

r combat sva microaray batch-effect • 1.8k views

ADD COMMENT • link 21 months ago by rk.khayami94 ▴ 10

0

Entering edit mode

I don't really see how your post title relates to the body of your post, but as described here if you have known batch effects you want to use either ComBat (maybe to get CPM values for heatmaps) or simply account for them in your limma design matrix (for DE analysis), whereas SVA is useful if you have unknown sources of variation. Do you still wish to use SVA? If so I can dig through some old code and probably help out for real.

ADD REPLY • link 21 months ago by bkleiboeker ▴ 370

0

Entering edit mode

Sorry about the title. I am really confused on this topic. Working with combat is really easier but I actually don't know all sources of variation in my data. I would be grateful if you coul help me.

ADD REPLY • link 21 months ago by rk.khayami94 ▴ 10

score 2 · Accepted Answer · 2023-02-22

Ok, line by line:

Your code for generating mod0 and mod looks good but I can see you are working from the sva user guide (which is great), so just to make sure you understand how to adapt this to your real problem: if you know any other covariates or batch information, you should add them in your formulas for both mod0 and mod (explanation in section 2 of sva user guide)

For example, let's say you know batch and sex info for the samples. Then your code should read:

mod0 <- model.matrix(~1+batch+sex, data=pheno)
mod <- model.matrix(~1+cancer+batch+sex, data=pheno)

Everything else looks good up to fitting the linear model with lmFit(). I'm now seeing that your confusion is really just in the limma steps (but I'll keep the above point because your talk of batch makes me think you might be wanting to correct for both batch effects and surrogate variables).

I am almost certain that you do not actually want an "all vs. one" comparison because comparing against the surrogate variables introduced by sva would be meaningless. If, in your example above, you wanted to find genes differentially expressed in cancer samples relative to control samples, you'd simply need to

fit <- lmFit(edata,modSv)
fit <- eBayes(fit)
topTable(fit, coef="cancer")

Although actually the coef won't be exactly "cancer", it'll probably be "cancer" appended with whichever level appears first in your design matrix. Look at colnames(modSv), find the one that starts with "cancer..." and put that into the coef argument in topTable().

The "all vs. one" Bioconductor thread you linked is making a fundamentally different comparison than you want to make here. There, they are looking for genes expressed in only one tissue, so they want to contrast TissueA vs. TissueB and vs. TissueC and vs. TissueD. But for you this wouldn't make sense, because all vs. one would imply contrasting against surrogate variables introduced to the design matrix by sva (look at your object modSv to see what I mean).