Hi everyone,
I am conducting a meta-analysis of RNA-seq data using 10 different datasets and analyzing their PCA plots before and after batch correction. I have noticed significant differences in how these datasets cluster, and I am looking for insights into why this might be happening.
Individual PCA Plotting Before Batch Correction I run each dataset individually, apply variance stabilizing transformation (VST), and plot PCA for each dataset. The combined PCA plot is generated by combining individual PCA data points from each dataset.
Combined PCA Plot After Individual VST but Combined PCA Calculation I transform the datasets individually with VST, then combine the transformed data matrices and perform a combined PCA. This approach differs from the first code in that it applies PCA after combining the transformed data instead of calculating PCA for each dataset separately.
PCA Plot After Batch Correction I perform batch correction using ComBat to remove batch effects (e.g., platform or library layout or combination of both) after VST. The PCA is then plotted using the batch-corrected data.
I have shared three sets of code:
1. Code for PCA plotting before batch correction (Individual PCA Plotting Before Batch Correction)
counts_012 <- read.delim("counts_012.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
counts_040_R <- read.delim("counts_040_R.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
and so on...
colData_012 <- read.delim("colData_012.csv", header= TRUE,sep = ",")
colData_040_R <- read.delim("colData_040_R.csv", header= TRUE,sep = ",")
and so on...
dds_012 <- DESeqDataSetFromMatrix(countData = counts_012, colData = colData_012, design = ~ treatment)
dds_040_R <- DESeqDataSetFromMatrix(countData = counts_040_R, colData = colData_040_R, design = ~ treatment)
and so on...
aqp_vsd_012 <- varianceStabilizingTransformation(dds_012, blind=T)
aqp_vsd_040_R <- varianceStabilizingTransformation(dds_040_R, blind=T)
and so on...
pca_012 <- plotPCA(aqp_vsd_012, intgroup = "treatment")
pca_040_R <- plotPCA(aqp_vsd_040_R, intgroup = "treatment")
and so on...
pca_012_data <- pca_012$data
pca_040_R_data <- pca_040_R$data
and so on...
combined_data <- rbind(
cbind(pca_012_data, dataset = "012"),
cbind(pca_040_R_data, dataset = "040_R"),
and so on...
)
pca_012_manual <- prcomp(t(assay(aqp_vsd_012)))
pca_040_R_manual <- prcomp(t(assay(aqp_vsd_040_R)))
and so on...
pca_012_var_explained <- pca_012_manual$sdev^2 / sum(pca_012_manual$sdev^2)
pca_040_R_var_explained <- pca_040_R_manual$sdev^2 / sum(pca_040_R_manual$sdev^2)
and so on...
# Combine PCA results into a single data frame for ggplot
pca_combined_data <- rbind(
data.frame(PC1 = pca_012_manual$x[, 1], PC2 = pca_012_manual$x[, 2], treatment = colData_012$treatment, dataset = "012"),
data.frame(PC1 = pca_040_R_manual$x[, 1], PC2 = pca_040_R_manual$x[, 2], treatment = colData_040_R$treatment, dataset = "040_R"),
and so on...
)
# Calculate labels for axes using the variance explained
x_label <- paste0("PC1 (", round(100 * mean(c(
pca_012_var_explained[1], pca_040_R_var_explained[1], and so on...
)), 1), "% variance)")
y_label <- paste0("PC2 (", round(100 * mean(c(
pca_012_var_explained[2], pca_040_R_var_explained[2], and so on...
)), 1), "% variance)")
custom_shapes <- 0:(length(unique(combined_data$dataset)) - 1)
ggplot(pca_combined_data, aes(x = PC1, y = PC2, color = treatment, shape = dataset)) +
geom_point() +
theme_classic() +
labs(x = x_label, y = y_label, color = "Treatment", shape = "Dataset") +
scale_shape_manual(values = custom_shapes)
2. Code for PCA plotting before batch correction (all datasets combined before PCA)
counts_012 <- read.delim("counts_012.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
counts_040_R <- read.delim("counts_040_R.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
and so on...
colData_012 <- read.delim("colData_012.csv", header= TRUE,sep = ",")
colData_040_R <- read.delim("colData_040_R.csv", header= TRUE,sep = ",")
and so on...
colData_012$dataset <- "012"
colData_040_R$dataset <- "040_R"
and so on...
combined_colData <- rbind(
colData_012,
colData_040_R,
and so on...
)
dds_012 <- DESeqDataSetFromMatrix(countData = counts_012, colData = colData_012, design = ~ treatment)
dds_040_R <- DESeqDataSetFromMatrix(countData = counts_040_R, colData = colData_040_R, design = ~ treatment)
and so on...
aqp_vsd_012 <- varianceStabilizingTransformation(dds_012, blind=T)
aqp_vsd_040_R <- varianceStabilizingTransformation(dds_040_R, blind=T)
and so on...
combined_vsd <- cbind(
assay(aqp_vsd_012),
assay(aqp_vsd_040_R),
and so on...
)
pca_corrected <- prcomp(t(combined_vsd))
pca_df <- as.data.frame(pca_corrected$x)
colnames(pca_df) <- paste0("PC", 1:ncol(pca_df))
pca_df$treatment <- combined_colData$treatment
pca_df$dataset <- combined_colData$dataset
x_label <- paste0("PC1 (", round(pca_corrected$sdev[1] / sum(pca_corrected$sdev) * 100, 2), "%)")
y_label <- paste0("PC2 (", round(pca_corrected$sdev[2] / sum(pca_corrected$sdev) * 100, 2), "%)")
custom_shapes <- 0:(length(unique(combined_colData$dataset)) - 1)
ggplot(pca_df, aes(x = PC1, y = PC2, color = treatment, shape = dataset)) +
geom_point() +
theme_classic() +
labs(x = x_label, y = y_label, color = "Treatment", shape = "Dataset") +
scale_shape_manual(values = custom_shapes)
3. ![Code for PCA plotting after batch correction using ComBat]
counts_012 <- read.delim("counts_012.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
counts_040_R <- read.delim("counts_040_R.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
and so on...
colData_012 <- read.delim("colData_012.csv", header= TRUE,sep = ",")
colData_040_R <- read.delim("colData_040_R.csv", header= TRUE,sep = ",")
and so on...
colData_012$dataset <- "012"
colData_040_R$dataset <- "040_R"
and so on...
combined_colData <- rbind(
colData_012,
colData_040_R,
and so on...
)
dds_012 <- DESeqDataSetFromMatrix(countData = counts_012, colData = colData_012, design = ~ treatment)
dds_040_R <- DESeqDataSetFromMatrix(countData = counts_040_R, colData = colData_040_R, design = ~ treatment)
and so on...
aqp_vsd_012 <- varianceStabilizingTransformation(dds_012, blind=T)
aqp_vsd_040_R <- varianceStabilizingTransformation(dds_040_R, blind=T)
and so on...
combined_vsd <- cbind(
assay(aqp_vsd_012),
assay(aqp_vsd_040_R),
and so on...
)
batch_corrected_vsd <- ComBat(
dat = combined_vsd,
batch = combined_colData$batch,
mod = model.matrix(~ treatment, data = combined_colData),
par.prior = TRUE,
prior.plots = FALSE
)
pca_corrected <- prcomp(t(batch_corrected_vsd))
pca_df <- as.data.frame(pca_corrected$x)
colnames(pca_df) <- paste0("PC", 1:ncol(pca_df))
pca_df$treatment <- combined_colData$treatment
pca_df$dataset <- combined_colData$dataset
x_label <- paste0("PC1 (", round(pca_corrected$sdev[1] / sum(pca_corrected$sdev) * 100, 2), "%)")
y_label <- paste0("PC2 (", round(pca_corrected$sdev[2] / sum(pca_corrected$sdev) * 100, 2), "%)")
custom_shapes <- 0:(length(unique(combined_colData$dataset)) - 1)
ggplot(pca_df, aes(x = PC1, y = PC2, color = treatment, shape = dataset)) +
geom_point() +
theme_classic() +
labs(x = x_label, y = y_label, color = "Treatment", shape = "Dataset") +
scale_shape_manual(values = custom_shapes)
Questions: Why is there a significant difference between the PCA plots generated using the first and second code snippets? Which code is correct or valid one? Why does the third plot (after batch correction) still show clustering by dataset? Could it be related to the batch variable I used (e.g., platform or library layout or combination of both)? Are there alternative batch correction methods or strategies I should consider to better address hidden variables?
Thanks in advance for any kind of assistance!
I did not go through this wall of code and plots, but generally: A meta-analysis means to analyze datasets separately and then aggregate results to achieve a consensus, testing which per-dataset findings are reproducible and robust. I don't see the point merging 10 datasets.
Thank you for your reply. I appreciate the reminder about the traditional approach to meta-analysis. Just to clarify here, my intention with merging the datasets is for exploratory analysis (i.e., PCA ). I wanted to create a combined PCA plot to visualize the distribution of samples and detect potential batch effects before conducting separate analyses on individual datasets. I’m not trying to merge the datasets for downstream differential expression analysis but rather to identify any technical biases that might influence my separate analyses.
Do you have suggestions for alternative ways to visualize the distribution of samples across multiple datasets without merging them? Or is there a way to adjust my approach to ensure the exploratory step is valid while maintaining the integrity of the meta-analysis? Your input would be greatly valued