SVA : Setting up the data from an ExpressionSet and applying sva function to extimate batches
0
4
Entering edit mode
7.5 years ago
lessismore ★ 1.4k

Hello,

I am checking SVA for estimating batches in an expression atlas.

This expression atlas is composed my multiple experiments (different dates) and multiple tissues analysed from wild type samples.

Contrarily to the general aim of this analysis, I am not interested in performing a DE analysis, but I am interested in normalizing this data for a following a network analysis.

For this reason I'll consider the two methods:

  1. SVA : identifying and estimating surrogate variables for unknown sources of variation.
  2. ComBat: directly removing known batch effects using ComBat in SVA.

Potentially I have no adjustment variables nor variables of interest. For the 1st method what will be the full model matrix that I have to set?

Hope it is clear,

Thanks in advance.

sva combat batch-effect • 12k views
ADD COMMENT
1
Entering edit mode

it depends on what you are interested in, for Combat if you are aware of the confounders then you directly use them in your model matrix and pass them through combat for reducing the effect of known confounders or batches. With SVA you are trying to identify the most number of confounding variables that you might use to model your data, however, take a note that you will not be seeing all the SVA values to be giving the best fit adjustment probably some will overfit or some will not have a proper differential estimate so you might have to make a trial for the values to see which is the best fit that keeps the order of your hypothesis. This is sample code but we need to understand understand your design

dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
##calculating the variables
n.sv <- num.sv(dat,mod,method="leek") #gives 11
## using 4
lnj.corr <- svaBatchCor(dat,mod,mod0,n.sv=4)
co <- lnj.corr$corrected
plPCA(co)
ADD REPLY
0
Entering edit mode

Thank you very much for your comment. I would like to test the two ways. I merged an amount of NGS experiments (basically on the same tissues) but carried out on several dates. A preprocessing has been done removing the background signal and quantile normalizing all data. Now i know there are batch-effects and i know the number of batches i have (so i am ready for the analysis with ComBat). But i also want to perform SVA because i want to check other possible confounders that i may miss using only the first approach. Then i will probably correct for the SVA results because, as i mentioned, i'm only interested in adjusting for the technical souces of variation and i wouldnt consider the biological ones for my purposes. is it more clear now? what would you suggest? thanks in advance!

ADD REPLY
2
Entering edit mode

SVA will only give you a numerical value of the confounders but you will not know which are actual attributes of this confounders unless you adjust for each attribute and plot PCA to see how the adjustment changes and what is reduced as an effect of the adjustments to avoid i) over fitting and ii) accounting for the exact confounders of your choice. Otherwise, SVA will give you a number and you will model everything around it and it will not tell you which each of the confounder effect it is. I hope it is now clear?

ADD REPLY
0
Entering edit mode

Thanks, i calculated my variables.

> mod = model.matrix(~as.factor(stage), data=Study)
> mod0 = model.matrix(~1,data=Study)
> n.sv = num.sv (data, mod,method = "leek")
> n.sv
[1] 7
> svobj = sva(data,mod,mod0,n.sv=n.sv)
Number of significant surrogate variables is:  7 
Iteration (out of 5 ):1  2  3  4  5

Now, i was expecting 6 surrogate variables. When you say "adjusting for each attribute" you mean reducing the dataset to check (with PCA) which attributes (observations) were causing this, correct? A second question: i don't understand when you write :

## using 4
lnj.corr <- svaBatchCor(dat,mod,mod0,n.sv=4)
co <- lnj.corr$corrected
plPCA(co)

There's no "svaBatchCor" function from what i have seen. Sorry for the petulance, i am following the tutorial (https://www.bioconductor.org/packages/devel/bioc/vignettes/sva/inst/doc/sva.pdf) and it's a completely new topic for me.

ADD REPLY
1
Entering edit mode

ah my bad, its a small function , you can find it in other posts but am providing it here.

svaBatchCor <- function(dat, mmi, mm0,n.sv=NULL){
    dat <- as.matrix(dat)
    Y <- t(dat)
    library(sva)
    ifis.nulln.sv))   n.sv <- num.sv(dat,mmi,method="leek")
    o <- sva(dat,mmi,mm0,n.sv=n.sv)
    W <- o$sv
    alpha <- solve(t(W) %*% W) %*% t(W) %*% Y
    o$corrected <- t(Y - W %*% alpha)
    return(o)
}

Usage of the function is as you have already specified and I also mentioned. So if you have 7 SV you can start from 3 or 4 and correct for them and try to see the PCA, however I hope you know how to model the matrix with SVA if you are using DESeq2 or edgeR. Since you will have to use them when you make your new model. Good luck

ADD REPLY
0
Entering edit mode

Thanks for your answer. I am starting to get things more clear. Could you please tell if your function is equivalent to Remove Surrogate Variables and Batch ?? From what i understood in this link they remove all surrogate variables from the matrix. Is it a not adviceable approach?? As i was mentioning, my purpose is not DE analysis but a gene co-expression network (GCN) analysis. Could you please link some relevant papers/tutorials/guidelines where i can follow reading about this? Thanks in advance.

ADD REPLY
0
Entering edit mode

If it's gene co expression network then you take a look at WGCNA. That's a better way to find coexpressed modules. I have not used surrogate variable analysis in WGCNA so am not well informed in that one. But if it works on normalized expression value you can always use SVA and create newly corrected normalized expression matrix and use WGCNA. Take a look at WGCNA tutorials and it's use with modeling of batch effects or surrogate variable. You will definitely find something. Good luck.

ADD REPLY
0
Entering edit mode

Hey @vchris_ngs, considering ComBat as the best method to correct the batch-effect biased matrix (read some papers about it), i decided to go like this:

library(batchQC)
sample <- pheno$Sample.Name
batch <- pheno$Batch
condition <- pheno$Covariate.1
pdata <- data.frame(sample, batch, condition)
modmatrix <- model.matrix(~as.factor(condition), data=pdata)
combat_data_matrix <- ComBat(dat=datamatrix, batch=batch, mod=modmatrix)

i referred to http://bioconductor.org/packages/release/bioc/html/BatchQC.html The density and Q-Q plots curves for the mean and the variance of the normal parametric assumption compared the empirical distribution let me decide for a parametric combat. Could you give me an opinion? Could you give me an example of overfitting?

ADD REPLY
2
Entering edit mode

I will not go over the debate of which method is good or better for batch correction. If you are trying to visualize the effects of a batch inside of a linear model, then you can specify the design matrix (without the batch factor) as well as the batch factor when calling removeBatchEffect() from the edgeR package. This will also do the work. Similarly you can perform the same with combat. All these will work if apriori you know which are the factors that contribute to your batches. If you do not have the confounding factors then you can use SVA. This is the clarification. Since SVA at times produces too many variables , then modelling around all of them might give a classical case of overfitting. So it is not always advisable to model all the variables that sva gives. With combat you can model only around the batches that you are sure about. If you want to know about over fitting, try this link. If I remember correctly that you are using WGCNA and want to use batch effect reduction there. Then according to WGCNA you can use comBat since that is what they use most of the time.

Excerpts from WGCNA:

"If data come from different batches, we recommend to check for batch effects and, if needed, adjust for them. We use ComBat for batch effect removal but other methods should also work. Finally, we usually check quantile scatterplots to make sure there are no systematic shifts between samples; if sample quantiles show correlations (which they usually do), quantile normalization can be used to remove this effect. "

"My data are heterogeneous. Can I still use WGCNA? Data heterogeneity may affect any statistical analysis, and even more so an unsupervised one such as WGCNA. What, if any, modifications should be made to the analysis depends crucially on whether the heterogeneity (or its underlying driver) is considered "interesting" for the question the analyst is trying to answer, or not. If one is lucky, the main driver of sample differences is the treatment/condition one studies, in which case WGCNA can be applied to the data as is. Unfortunately, often the heterogeneity drivers are uninteresting and should be adjusted for. Such factors can be technical (batch effects, technical variables such as post-mortem interval etc.) or biological (e.g., sex, tissue, or species differences). If one has a categorical source of variation (e.g., sex or tissue differences) and the number of samples in each category is large enough (at least 30, say) to construct a network in each category separately, it may be worthwhile to carry out a consensus module analysis (Tutorial II, see WGCNA Tutorials). Because this analysis constructs a network in each category separately, the between-category variation does not affect the analysis. If it is desired to construct a single network for all samples, the unwanted or uninteresting sources of large variation in the data should be adjusted for. For categorical (ordinal) factors we recommend using the function ComBat (from the package sva). Users who have never used ComBat before should read the help file for ComBat and work through the sva vignette (type vignette("sva") at the R prompt) to make sure they use ComBat correctly. For continuous sources of variation (e.g., postmortem interval), one can use simple linear regression to adjust the data. There may be more advanced methods out there that also allow the use of covariates and protect from over-correction. Whichever method is used, we caution the user that removal of unwanted sources of variation is never perfect and it can, in some cases, lead to removal of true interesting signal, and in rare cases it may introduce spurious association signal. Thus, only sources of relatively large variation should be removed. "

I think these explanations are pretty good enough to decide if you want to use combat and how you want to use them. It is already available in SVA package as well. So follow some examples and perform it. But just to understand one thing. If you do a plot and do not see variability due to batches in data in PCA plot based on batches then no point doing it even if you know there were batches. This will over fit.

ADD REPLY
0
Entering edit mode

I already came across this good explanation on WGCNA manual, it was quite clear as well as your answer. Thanks a lot!

ADD REPLY
0
Entering edit mode

Could you please explain why starting with 3 or 4 for correction when there are 7 SVs? Thanks.

ADD REPLY
0
Entering edit mode

Hi ivivek_ngs:

I've used your function and found that it gives the same result as obtaining the residuals from this linear model: Y ~ SVs

However: I think maybe this solution doesn't take into account the influence of SVs on our variable of interest?

For example, I found in this paper: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0808-5 that the function they use to regress SVs out of their data is (see the github code):

        cleaningP = function(y, mod, svaobj,  P=ncol(mod)) {
        X=cbind(mod,svaobj$sv)
        Hat=solve(t(X)%*%X)%*%t(X)
        beta=(Hat%*%t(y))
        cleany=y-t(as.matrix(X[,-c(1:P)])%*%beta[-c(1:P),])
        return(cleany)
}

If I modify your function this way it gives the same result as in that paper:

    svaBatchCor2 <- function(dat, mmi, mm0,n.sv=NULL){
        dat <- as.matrix(dat)
        Y <- t(dat)
        library(sva)
        if is.nulln.sv))   {n.sv <- num.sv(dat,mmi,method="leek")}
        o <- sva(dat,mmi,mm0,n.sv=n.sv)
        W <- cbind(mmi,o$sv)
        alpha <- (solve(t(W) %*% W) %*% t(W)) %*% Y
        o$corrected <- t(Y - W[,-c(1:ncol(mmi))] %*% alpha[-c(1:ncol(mmi)),])
        return(o)
      }

I am far from being an expert in this matter. Could you clarify the difference between the two approaches?

thanks

ADD REPLY
1
Entering edit mode

Apologies for the delayed response, the code you showed should also work, the code I showed gives your SVs and using those residuals you are just fitting them for the batch correction but that does not mean you are performing differential analysis. It is just taking care of the normalized dataset applying the residual adjustments that can be used for any visual inspection. Now including them in a model matrix for downstream differential analysis with regression should be done via multi-factorial design, that I did not show here. Hence that might be the reason you might not see it.

  1. Code snippet 1 as far as I understand is using a beta prior to just cleanse the effect nominally so that the residuals do not remove the non-linear or any secondary effects. (this is what I can understand from rough reading of the paper you cited)
  2. Code Snippet 2 is more or less similar on the lines I am doing directly giving you a corrected normalized matrix totally as a return of the function. But then again here discounted to the factor which residual optimally should be used to entirely regress your data and still be not reducing the non-linearity or secondary effects. To my knowledge , bulk RNASeq does not include that much on non-linearity effects given your data is high dimensional. However, batch effect correction for visualization is different from using batch SV residuals as factors for differential analysis.

One should use the same residual factor for both for consistency. This is what I can say. I can only feel you are including the probability of Type 1 into account but not much of a difference. However, both the approaches of the code does not tell you which is the optimal residual that gives best fit visualization and not run into over-fitting. This is my hunch. I might be not entirely correct. Let me know if I was clear enough?

ADD REPLY
0
Entering edit mode

Thanks for the detailed and thorough response.

Yes, I think my confusion relates to what you say here: "However, batch effect correction for visualization is different from using batch SV residuals as factors for differential analysis".

In my case I was playing with a linear model: Y ~ tumor/control + sva1 + sva2 to detect differential expressed genes (DEGs) between control and tumor. I fit the model and detect some significant DEGs. Thus, when writing a report on the results and making plots about the DEGs, the coefficients in the model do NOT match the differences in means of the genes for the original expression data (of course, because the model includes SVAs).

However, they also do NOT match the differences in means of the expression corrected for SVAs, as returned by the function you showed (which returns the residuals from Y ~ sva1 + sva2).

They YES match the values corrected by the other proposed function, which is the same as calculating the residuals from Y ~ sva1 + sva2 but substituting the SVA coefficients for the SVA coefficients of Y ~ tumor/control + sva1 + sva2.

This is why in this case, I think maybe the last solution is more appropriate, because what I want to represent is "the expression values that are being evaluated in my final model" (the real differences in means that are being tested across the coefficients)

ADD REPLY
0
Entering edit mode

Can you show some plots for the scenario. It sounds interesting. However, I also don’t expect a linear model using residuals multifactors for regressing the data giving differential gene expression will exactly match with batch corrected value based on mean expression for the same genes. I have doubts there but I might be mistaken. We do use priors or shrinkages in our differential model if you are using deseq2, so you might not get directly 1:1 relationship of mean values of degs similar to mean value of that same gene pulled from normalized & batch corrected gene expression. What’s the coefficient correlation? What model for differential are you using? voom/edgeR/ DESeq2? They do have a bit of differences as to how the entire flow of differential value for a gene is reached.

ADD REPLY
0
Entering edit mode

I'm using limma for the tests and such.

Yes, that's exactly what I was trying to observe: see If I got the same results from limma from doing: Y ~ tumor/control + sva1 + sva2 Versus doing the model with Y and tumor/control pre-adjusted for the linear influence of the SVAs: ** Y.adj ~ tumor/control.adj

What I find is that I get the same coefficients for those models, and these coefficients also match the differences in means for the adjusted values for visualization when adjusted using:

svaBatchCor2 <- function(dat, mmi, mm0,n.sv=NULL){
    dat <- as.matrix(dat)
    Y <- t(dat)
    library(sva)
    if is.nulln.sv))   {n.sv <- num.sv(dat,mmi,method="leek")}
    o <- sva(dat,mmi,mm0,n.sv=n.sv)
    W <- cbind(mmi,o$sv)
    alpha <- (solve(t(W) %*% W) %*% t(W)) %*% Y
    o$corrected <- t(Y - W[,-c(1:ncol(mmi))] %*% alpha[-c(1:ncol(mmi)),])
    return(o)
  }

I'm actually using DNA methylation data. I talked about expression because I didn't want the message to be too confusing. So I use "raw" limma (not with voom method). I think limma does not do more that approximate variance measurements to the data but should the model coefficients should be differences between mean of the compared groups in the absence of covariates. Incidentally I was unable to find the correlation coefficients from the limma output.

I could send you the code plus the example data if you are interested.

ADD REPLY
1
Entering edit mode

Wwell, limma is used for DNA methlylation. Should not be an issue, it is just not the voom transformation. Well in voom you will have the difference since it is for RNASeq, while traditionally limma is for array based works so DNAmethylation fits that bill, in that case your query should be valid. The only thing that I fear is, about the data type. Are you using the beta values or M values? There are dedicated workflows to take care of this. Now for removing or taking care of batch effects, the best guidance is first look at PCA on your data to understand what is capture in PC1, if its other than biological then your SVA/combat should come into play and anything you want to use for taking care of normalization and downstream differential.

A cross-package Bioconductor workflow for analysing methylation array data

Underlying premise, either use coefficients in your model matrix to fit them as covariates to make the idfferential or apriori use them in your data and perform the differential. Mostly both should be on similar lines if there are not a lot of SVs and not much of transformation as in voom or any prior shrinkage.

ADD REPLY
1
Entering edit mode

I'm using M-values as recommended because they're more gaussian. Indeed, I was looking at PCA before and after adjustment to see how the variance was shrinking in some dimensions related or not to known variables.

Regarding the use of the coefficients before the differential I think I will incorporate them to the model and just do these adjustments for visualization. Jeff Leek did not recommend pre-adjusting in this answer as one would change degrees of freedom: answer on cleaning data before differential testing

But, yes, as you said, I checked and usually there are not many changes in the number of detected significant features when pre-adjusting or adjusting inside the models.

Thanks for your comprehensive advice.

ADD REPLY
1
Entering edit mode

Glad it worked out now at your end.

I should have pointed out the part of Jeff Leek advice( I was indicating the same but I was not that comprehensive out there), since pre-adjustment is only advisable for a visualization perspective but for model they should be incorporated as covariates to not impact the degrees of freedom. I guess I mentioned the term residuals but I should have been more clear.

ADD REPLY
0
Entering edit mode

Hi @ivivek_ngs,

Excuse me, I have two questions. 1. When you write "plPCA(co)", but I can't seem to find the function "plPCA" anywhere. Is it "patch-based local PCA"? Could you please tell me which package is the "plPCA" function from; 2. Could you please link some relevant papers where i can follow reading about your method? Thanks in advance.

ADD REPLY
1
Entering edit mode

Hi, Apologies for the late reply. plPCA() was basically a wrapper around the PCA function developed in my former lab. Any standard PCA function or packages having PCA should work. There are currently multiple exploratory packages available in BioConductor or CRAN for performing PCA or any other standard dimension reduction that can be used for performing that step.

ADD REPLY
0
Entering edit mode

Thank you very much!!

ADD REPLY

Login before adding your answer.

Traffic: 1506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6