Question

PCA Plotting Differences Before and After Batch Correction in RNA-seq for Meta-Analysis

0

Entering edit mode

6 weeks ago

Nelo ▴ 20

Hi everyone,

I am conducting a meta-analysis of RNA-seq data using 10 different datasets and analyzing their PCA plots before and after batch correction. I have noticed significant differences in how these datasets cluster, and I am looking for insights into why this might be happening.

Individual PCA Plotting Before Batch Correction I run each dataset individually, apply variance stabilizing transformation (VST), and plot PCA for each dataset. The combined PCA plot is generated by combining individual PCA data points from each dataset.
Combined PCA Plot After Individual VST but Combined PCA Calculation I transform the datasets individually with VST, then combine the transformed data matrices and perform a combined PCA. This approach differs from the first code in that it applies PCA after combining the transformed data instead of calculating PCA for each dataset separately.
PCA Plot After Batch Correction I perform batch correction using ComBat to remove batch effects (e.g., platform or library layout or combination of both) after VST. The PCA is then plotted using the batch-corrected data.

I have shared three sets of code:

1. Code for PCA plotting before batch correction (Individual PCA Plotting Before Batch Correction)

    counts_012 <- read.delim("counts_012.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
    counts_040_R <- read.delim("counts_040_R.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
    and so on...

    colData_012 <- read.delim("colData_012.csv", header= TRUE,sep = ",")
    colData_040_R <- read.delim("colData_040_R.csv", header= TRUE,sep = ",")
    and so on...

    dds_012 <- DESeqDataSetFromMatrix(countData = counts_012, colData = colData_012, design = ~ treatment)
    dds_040_R <- DESeqDataSetFromMatrix(countData = counts_040_R, colData = colData_040_R, design = ~ treatment)
    and so on...

    aqp_vsd_012 <- varianceStabilizingTransformation(dds_012, blind=T)
    aqp_vsd_040_R <- varianceStabilizingTransformation(dds_040_R, blind=T)
    and so on...

    pca_012 <- plotPCA(aqp_vsd_012, intgroup = "treatment")
    pca_040_R <- plotPCA(aqp_vsd_040_R, intgroup = "treatment")
    and so on...

    pca_012_data <- pca_012$data
    pca_040_R_data <- pca_040_R$data
    and so on...

    combined_data <- rbind(
      cbind(pca_012_data, dataset = "012"),
      cbind(pca_040_R_data, dataset = "040_R"),
     and so on...
    )

    pca_012_manual <- prcomp(t(assay(aqp_vsd_012)))
    pca_040_R_manual <- prcomp(t(assay(aqp_vsd_040_R)))
    and so on...

    pca_012_var_explained <- pca_012_manual$sdev^2 / sum(pca_012_manual$sdev^2)
    pca_040_R_var_explained <- pca_040_R_manual$sdev^2 / sum(pca_040_R_manual$sdev^2)
    and so on...

    # Combine PCA results into a single data frame for ggplot
    pca_combined_data <- rbind(
      data.frame(PC1 = pca_012_manual$x[, 1], PC2 = pca_012_manual$x[, 2], treatment = colData_012$treatment, dataset = "012"),
      data.frame(PC1 = pca_040_R_manual$x[, 1], PC2 = pca_040_R_manual$x[, 2], treatment = colData_040_R$treatment, dataset = "040_R"),
      and so on...
    )

    # Calculate labels for axes using the variance explained
    x_label <- paste0("PC1 (", round(100 * mean(c(
      pca_012_var_explained[1], pca_040_R_var_explained[1], and so on...
    )), 1), "% variance)")

    y_label <- paste0("PC2 (", round(100 * mean(c(
      pca_012_var_explained[2], pca_040_R_var_explained[2], and so on...
    )), 1), "% variance)")

    custom_shapes <- 0:(length(unique(combined_data$dataset)) - 1)

    ggplot(pca_combined_data, aes(x = PC1, y = PC2, color = treatment, shape = dataset)) +
      geom_point() +
      theme_classic() +
      labs(x = x_label, y = y_label, color = "Treatment", shape = "Dataset") +
      scale_shape_manual(values = custom_shapes)

enter image description here

2. Code for PCA plotting before batch correction (all datasets combined before PCA)

    counts_012 <- read.delim("counts_012.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
    counts_040_R <- read.delim("counts_040_R.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
    and so on...

    colData_012 <- read.delim("colData_012.csv", header= TRUE,sep = ",")
    colData_040_R <- read.delim("colData_040_R.csv", header= TRUE,sep = ",")
    and so on...

    colData_012$dataset <- "012"
    colData_040_R$dataset <- "040_R"
    and so on...

    combined_colData <- rbind(
      colData_012,
      colData_040_R,
    and so on...
    )

    dds_012 <- DESeqDataSetFromMatrix(countData = counts_012, colData = colData_012, design = ~ treatment)
    dds_040_R <- DESeqDataSetFromMatrix(countData = counts_040_R, colData = colData_040_R, design = ~ treatment)
    and so on...

    aqp_vsd_012 <- varianceStabilizingTransformation(dds_012, blind=T)
    aqp_vsd_040_R <- varianceStabilizingTransformation(dds_040_R, blind=T)
    and so on...

    combined_vsd <- cbind(
      assay(aqp_vsd_012),
      assay(aqp_vsd_040_R),
    and so on...
    )

    pca_corrected <- prcomp(t(combined_vsd))
    pca_df <- as.data.frame(pca_corrected$x)
    colnames(pca_df) <- paste0("PC", 1:ncol(pca_df))
    pca_df$treatment <- combined_colData$treatment
    pca_df$dataset <- combined_colData$dataset
    x_label <- paste0("PC1 (", round(pca_corrected$sdev[1] / sum(pca_corrected$sdev) * 100, 2), "%)")
    y_label <- paste0("PC2 (", round(pca_corrected$sdev[2] / sum(pca_corrected$sdev) * 100, 2), "%)")
    custom_shapes <- 0:(length(unique(combined_colData$dataset)) - 1)
    ggplot(pca_df, aes(x = PC1, y = PC2, color = treatment, shape = dataset)) +
      geom_point() +
      theme_classic() +
      labs(x = x_label, y = y_label, color = "Treatment", shape = "Dataset") +
      scale_shape_manual(values = custom_shapes)

enter image description here

3. ![Code for PCA plotting after batch correction using ComBat]

    counts_012 <- read.delim("counts_012.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
    counts_040_R <- read.delim("counts_040_R.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
    and so on...

    colData_012 <- read.delim("colData_012.csv", header= TRUE,sep = ",")
    colData_040_R <- read.delim("colData_040_R.csv", header= TRUE,sep = ",")
    and so on...

    colData_012$dataset <- "012"
    colData_040_R$dataset <- "040_R"
    and so on...

    combined_colData <- rbind(
      colData_012,
      colData_040_R,
    and so on...
    )

    dds_012 <- DESeqDataSetFromMatrix(countData = counts_012, colData = colData_012, design = ~ treatment)
    dds_040_R <- DESeqDataSetFromMatrix(countData = counts_040_R, colData = colData_040_R, design = ~ treatment)
    and so on...

    aqp_vsd_012 <- varianceStabilizingTransformation(dds_012, blind=T)
    aqp_vsd_040_R <- varianceStabilizingTransformation(dds_040_R, blind=T)
    and so on...

    combined_vsd <- cbind(
      assay(aqp_vsd_012),
      assay(aqp_vsd_040_R),
    and so on...
    )

    batch_corrected_vsd <- ComBat(
      dat = combined_vsd,
      batch = combined_colData$batch,
      mod = model.matrix(~ treatment, data = combined_colData),
      par.prior = TRUE,
      prior.plots = FALSE
    )

    pca_corrected <- prcomp(t(batch_corrected_vsd))

    pca_df <- as.data.frame(pca_corrected$x)

    colnames(pca_df) <- paste0("PC", 1:ncol(pca_df))

    pca_df$treatment <- combined_colData$treatment

    pca_df$dataset <- combined_colData$dataset

    x_label <- paste0("PC1 (", round(pca_corrected$sdev[1] / sum(pca_corrected$sdev) * 100, 2), "%)")

    y_label <- paste0("PC2 (", round(pca_corrected$sdev[2] / sum(pca_corrected$sdev) * 100, 2), "%)")

    custom_shapes <- 0:(length(unique(combined_colData$dataset)) - 1)

    ggplot(pca_df, aes(x = PC1, y = PC2, color = treatment, shape = dataset)) +
      geom_point() +
      theme_classic() +
      labs(x = x_label, y = y_label, color = "Treatment", shape = "Dataset") +
      scale_shape_manual(values = custom_shapes)

enter image description here

Questions: Why is there a significant difference between the PCA plots generated using the first and second code snippets? Which code is correct or valid one? Why does the third plot (after batch correction) still show clustering by dataset? Could it be related to the batch variable I used (e.g., platform or library layout or combination of both)? Are there alternative batch correction methods or strategies I should consider to better address hidden variables?

Thanks in advance for any kind of assistance!

batch RNA-seq correction PCA • 475 views

ADD COMMENT • link updated 6 weeks ago by Basti ★ 2.0k • written 6 weeks ago by Nelo ▴ 20

0

Entering edit mode

I did not go through this wall of code and plots, but generally: A meta-analysis means to analyze datasets separately and then aggregate results to achieve a consensus, testing which per-dataset findings are reproducible and robust. I don't see the point merging 10 datasets.

ADD REPLY • link 6 weeks ago by ATpoint 86k

0

Entering edit mode

Thank you for your reply. I appreciate the reminder about the traditional approach to meta-analysis. Just to clarify here, my intention with merging the datasets is for exploratory analysis (i.e., PCA ). I wanted to create a combined PCA plot to visualize the distribution of samples and detect potential batch effects before conducting separate analyses on individual datasets. I’m not trying to merge the datasets for downstream differential expression analysis but rather to identify any technical biases that might influence my separate analyses.

Do you have suggestions for alternative ways to visualize the distribution of samples across multiple datasets without merging them? Or is there a way to adjust my approach to ensure the exploratory step is valid while maintaining the integrity of the meta-analysis? Your input would be greatly valued

ADD REPLY • link 6 weeks ago by Nelo ▴ 20

score 0 · Answer 1 · 2024-11-14

Why is there a significant difference between the PCA plots generated using the first and second code snippets? Which code is correct or valid one?

The first one is in my opinion statistically incorrect : you merge multiple independent PCA. Each PC from each PCA is a linear combination of the initial variables so all PC1 and PC2 from each dataset is different, you cannot compare them and represent them in the same projection (PC1 from dataset1 is different from PC1 from dataset2 : you cannot take a mean PC1 because they are different PC1). It is better to plot PCA for each dataset one by one and see if controls and treated separate.

The second one is statistically valid but as you have seen, batch effect exists (and it is inherent to the nature of each dataset).

Why does the third plot (after batch correction) still show clustering by dataset?

Even if you correct for technical aspects, there still exists differences in the way data were obtained (teams, patients, technique, sampling). This is why I agree with ATpoint answer, you should go with the meta-analysis and limit analysis of datasets as a whole.