Question

Batch correction of two RNA-Seq data sets

0

Entering edit mode

13 days ago

gogeni5529 ▴ 50

I have got two RNA-Seq data sets to analyze, which were made by the same person, but a year apart. I was surprised to see such a big difference between the two data sets. Is it even possible to compensate for that?

As I was analyzing them, I also added the time difference effect into the mix. Unfortunately, I don't see any improvement in the correlation of the two data sets.

The analysis code and the before-After PCA are added below.

I would like to know, if this mean, I can't analyze the two data sets together, or this only mean to be careful when interpreting the results

As I am using DESeq2, would it maybe be better to use a different tool to compensate for this bias?

cts <- inner_join(rownames_to_column(cts_p203), rownames_to_column(cts_p405), by = "rowname" )  |> column_to_rownames("rowname")
...
coldata <- data.frame(
  name = paste0(rep(x = c("Wt_p203_", "WT_p405_", "Cys_", "Thr_", "Arg_", "Leu_"), each =3), rep(1:3, times = 3) ),
  condition = c(rep("control", times = 6), rep(c("Cystein", "Threonin", "Arginin", "Leucin"), each = 3)),
  batch = c(rep("p203", times =3), rep("p405", times = 9), rep("p203", times = 6) )
)

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ batch +condition)
dds

keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep,]

vsd <- vst(dds, blind=FALSE)
...
pcaData <- plotPCA(vsd, intgroup = c("condition", "batch"), returnData = TRUE)

ggplot(pcaData, aes(x = PC1, y = PC2, color = condition, shape = batch)) +
  geom_point() + 
  scale_color_manual(values = c25[15:19]) +
  scale_alpha_manual(values=c(0.3, 0.7), guide=F) +
  geom_text_repel(aes(PC1, PC2, label = pcaData$name),size = 2)  +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  ggtitle("PCA with vst-transformed data, colored by conditions") +
  labs(shape = "Group", color = "Condition" )

Thanks

DESeq2 batch-correction • 657 views

ADD COMMENT • link 10 days ago by gogeni5529 ▴ 50

0

Entering edit mode

which were made by the same person, but a year apart

Same kit/method of extraction/lib prep/sequencer (2- or 4- color)/length of reads?

ADD REPLY • link 13 days ago by GenoMax 148k

0

Entering edit mode

as far as I know yes. It was a long time ago, and the person responsible for it is unfortunately already gone. but the core facility says it is the case

ADD REPLY • link 13 days ago by gogeni5529 ▴ 50

score 0 · Answer 1 · 2024-12-13

0

Entering edit mode

13 days ago

gogeni5529 ▴ 50

ok, after using limma::removeBatchEffect I can remove the batch effect

But how can I make sure, I use the correct values also for the differential expression?

Can I read the vst-corrected values into a matrix and run DESeq2 with it?

Or is the addition of the function ~batch + condition already takes care of that?

Would it be better to continue analyzing this in limma instead?

batch_corrected

ADD COMMENT • link 13 days ago by gogeni5529 ▴ 50

1

Entering edit mode

Yes, including batch when specifying the design formula for the analysis does the same batch removal as running limma::removeBatchEffects .

ADD REPLY • link 13 days ago by i.sudbery 20k

2

Entering edit mode

Given you have sperate wildtypes from the eaceh of the two batches, and the only comparisons you are likely to make are Treatment vs wildtype, I'd be tempted just to treat them as two entirely independent experiments.

ADD REPLY • link 13 days ago by i.sudbery 20k

0

Entering edit mode

I understand now, that running the formula would result in the same as plotting the samples with the limma function. I was nwondering though, why it looks different, when I run the formula and then plot the Vst-transformed values in a PCA using ggplot like I did in my original post. Why do the two plots looks so different, if using the formula gives me the same effect?

ADD REPLY • link 12 days ago by gogeni5529 ▴ 50

0

Entering edit mode

Adding the design formula to the VST call doesn't remove the batch effect. The design formula in VST is simply used to select the grouping of samples to use when determining the variance.

ADD REPLY • link 12 days ago by i.sudbery 20k

0

Entering edit mode

No I mean, I run the DEseq command with the formula design = ~ batch +condition before calculating the vst (the code is above). S why don't I see the batch effect removed in the PCA, like it is done when using the limma command?

ADD REPLY • link 11 days ago by gogeni5529 ▴ 50

1

Entering edit mode

I'm pretty sure the VST doesn't use any information that is calculated by running DESeq2. It just uses the counts, and calculates its own size factors and dispersions I think. THe differential testing framework implemented by the DESeq function and the exploratory analysis approaches implement in VST are more or less independent.

Its worth noting that DESeq doesn't perform a batch effect removal, and then calculate differential expression on your condition of interest - rather it calculates the effect of your condition of interest while simulataneously accounting for the effect of batch. The output of this is a log fold change between conditions that accounts for batch, not batch corrected counts.