I am working on a meta-analysis involving 16 different RNA-seq datasets, each of which is under biotic stress. My goal is to perform batch correction to address technical variability across datasets. However, I am encountering an issue with the batch correction tool ComBat, which requires at least two levels of the batch variable.
Problem Description: For each dataset, my colData structure includes a batch variable that currently only has one level for each dataset. Here’s a simplified view of the structure for two representative datasets:
Dataset 1
Sample_ID treatment Batch
Sample1 control leaf
Sample2 control leaf
Sample3 control leaf
Sample4 treated leaf
Sample5 treated leaf
Sample6 treated leaf
Dataset 2
Sample_ID treatment Batch
Sample7 control root
Sample8 control root
Sample9 treated root
Sample10 treated root
Sample11 treated root
Sample12 treated root
Sample13 treated root
Sample14 treated root
In both cases, each dataset has only one level of the batch variable. ComBat requires at least two levels within the batch variable, and therefore, I am unable to use it for batch correction.
Questions:
- What alternative methods or tools can handle a single-level batch variable? I am particularly interested in methods that can address this issue effectively without requiring multiple batch levels.
- Are there best practices for combining batch variables or creating synthetic batch variables when only one level is present? For instance, could combining multiple factors from my colData (e.g., treatment type and batch) into a composite batch variable be a viable solution?
- How can I ensure that my batch correction effectively removes technical variability while preserving the biological signals?
Any advice or suggestions on how to proceed would be greatly appreciated!
Thank you.
Code used:
counts_dataset1 <- read.delim("counts_dataset1.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
counts_dataset2 <- read.delim("counts_dataset2.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
colData_dataset1 <- read.delim("colData_dataset1.csv", header= TRUE,sep = ",")
colData_dataset2 <- read.delim("colData_dataset2.csv", header= TRUE,sep = ",")
batch_dataset1<-colData_dataset1$batch
batch_dataset2<-colData_dataset2$batch
batch_dataset1_df <- data.frame(batch = batch_dataset1)
batch_dataset2_df <- data.frame(batch = batch_dataset2)
modcombat_dataset1 <- model.matrix(~1, data=batch_dataset1_df)
modcombat_dataset2 <- model.matrix(~1, data=batch_dataset2_df)
combat_mydata_dataset1= ComBat(dat=counts_dataset1, batch=batch_dataset1_df, mod=modcombat_dataset1, par.prior=TRUE, prior.plots=FALSE)
combat_mydata_dataset2= ComBat(dat=counts_dataset2, batch=batch_dataset2_df, mod=modcombat_dataset2, par.prior=TRUE, prior.plots=FALSE)
Error:
Error in ComBat(dat = counts_dataset1, batch = batch_dataset1_df, mod = modcombat_dataset1, :
This version of ComBat only allows one batch variable
Error in ComBat(dat = counts_dataset2, batch = batch_dataset2_df, mod = modcombat_dataset2, :
This version of ComBat only allows one batch variable
If you want to merge the datasets in my opinion you should work with all the datasets in a single object (table or SummarizedExperiment) and use the batch effect of at least the origin of each sample from each dataset:
That is not possible, a batch is a unit, if you only have one unit there can't be a batch effect on that unit. You might have other confounding variables but not a batch-effect one.
You could use
sva::svn
function to estimate other variables that might be confounding the analysis. But yes, sometimes a batch effect is a result of a combination of other factors, but these other factors must have more than one level.You can't unless you have some ground truth that you probably can not get in a meta-analysis: for example, adding known samples in the datasets when sequencing them.