For RNA-seq data analysis using DESeq2, a recommended method for batch effect removal is to introduce the batch in the design of the experiment as design = ~ batch + condition
.
The presence of batch was already known from experiment design and also detected by PCA biplot on the log transformed raw counts. Post DESeq2 how can we assure that the batches have been correctly taken care of? PCA plot on counts(dds, normalized=TRUE)
did not how any changes.
Correcting the batch using ComBat brings out the desired grouping, however, I understand the output from Combat should not be used with DESeq2. Is there any other suggested tool for differential gene expression analysis post batch correction using ComBat?
Hello Kevin,
I am one of the confused ones regarding removing batch effects. How is 'accounted for' different from 'removal'? When is it correct to use the 'accounted for' (i.e. using the ~batch+cond formula) approach and when is it correct for removal of the batch effect? Thank you for shedding light on these dark areas.
Hello jfo, when we 'account for' the batch effect,
batch
is being treated as a covariate in the statistical model. In regression, when we want to adjust for covariates, say smoking status, we usually include them like this in our design formula:Then, when we derive our test statistics for
gene
, we known that these test statistics will have been adjusted for the effect of smoking status. The same logic applies when we are adjusting forbatch
and when we are designing the formula for DESeq2, Limma, EdgeR, etc.However, in none of this is the underlying data modified. We are merely modifying / adjusting our statistical inferences based on
smokingStatus
,batch
, etc.So, accounting for batch in this way is only really plausible when we are performing a differential expression analysis, in the case of expression studies.
The above is all well and good until we actually want to take our data and use it downstream. It would make no sense to perform 'machine learning' stuff if we knew that there was a batch effect in the data. The downstream program may have a way to account for
batch
just like the above, but usually not. In this scenario, we have to actually aim to modify our underlying data. This usually involves estimating the effect of batch and then directly adjusting the underlying data so that the batch effect is effectively wiped away.Thanks @Kevin, I do like your way of clarifying issues! I appreciate if you answer my question. I don't want to do differential expression analysis, I just want to normalize my RNAseq data and then do some clustering and classification .. Based on your explanation I realized after adding the batch in the design formula, I also have to remove the batch effect, right?
Yes
here is a code that I tried for clustering on DESeq2 normalized counts:
I have a similar question, I have read many posts on this subject and tried analyzing my data but got an error that I don't fully understand. I would greatly appreciate your help in clarifying what I am doing wrong.
I have RNA-seq data across different tissues, and I want to identify genes that are uniquely expressed in a tissue of interest. Some of these data were generated in the lab and sequenced in the same lane, and other data comes from public databases. Can I use DESeq2 normalization in this case? My understanding is that I can and I tried the following: type= batch
Hello again. Based on the data that you have, these 2 designs are exactly the same:
design = ~ condition
design = ~ type
In this case, you can only choose one or the other.
If you are interested in uniquely expressed genes, a differential expression analysis does not directly test for that. You may try, in addition, something more simple, such as calculating gene Z-scores in each tissue.
Hi Kevin, Thank you very much. The idea was to use DESeq2 normalization and extract the normalized counts for downstream analysis. To identify tissue-specific genes(lung transcriptome for example), I was planning to do a correlation using the normalized counts. Is there a reference for calculating gene Z-scores in each tissue, and does this require batch correction? Thanks again!
Hi!
What do you mean?
-------------
Z-scores are used generally for this purpose - these are intuitive to use as they represent standard deviations above or below the mean value.
Perhaps we first need to define what is truly "unique"? Is a unique gene one whose expression is literally 0 (zero) in one tissue but not another?; is a unique gene one whose fold-change difference in one tissue over another passes a pre-specified threshold, and that also has a statistically significant adjusted p-value?
So, perhaps first define unique?
Hi Kevin, You are right what I want to show is actually heat maps of Z-scores comparing expression levels across tissues similar to Figure 4 of Brown et al., 2014. I am still unsure of how I should do this analysis using data generated in two different labs.
https://www.nature.com/articles/nature12962 Thank you so much for your help!!
Your heart and liver samples are from different studies?
Yes, I have data generated in the lab(heart) and public data(liver). I also have other data that were generated in my lab but in different batches.
In that case, there's not much that you can do, in terms of batch correction. The best way forward is to just independently transform each dataset to Z-scores, and then compare that way.
Okay, thanks. Is there a need to normalize(B) or just transform the dataset to Z-scores(A)?
Thanks!!
You definitely need to first normalise the data if it is raw counts. Then transform it, and then a final transformation to Z-scores.
Possibilities include:
Method 1
Method 2
I'm confused about Method 1.step 2. DESeq2 will only take raw counts what kind of normalization can I apply in step 2 that will be compatible with DESeq2?
DESeq2 normalises the raw counts for you via the
DESeq()
function.Right, that is step 3 why do I need step 2 in method 1.
In fact, that is not step 3.
There is a Step 2, where, in the case of DESeq2, the raw counts are normalised for library size and where dispersion is modelled. It is on these normalised counts that the statistical inferences are made, i.e., the differential expression analysis.
As the normalised counts are still measured on a negative binomial distribution, they are not amenable for the majority of downstream analyses, including clustering, PCA, etc.; so, this is why we then employ a further transformation, Step 3, to bring these normalised counts to a normal distribution.
I see. Thank you so much for taking the time to explain this so well!!
In step 2 before normalising with DESeq2, would it help also to filter out the low-read-count genes? Something like:
Does that mean that we cannot perform differential expression analysis?