Hello everyone.
I am trying to conduct a differential expression analysis (DEA) between cancer subtypes on TCGA RNA-Seq raw counts data. In particular, I am interested in breast cancer molecular subtypes, colon cancer molecular subtypes, and kidney cancer subtypes (KIPAN cohort with KICH, KIRC, and KIRP).
I am aware of the presence of batch effects on TCGA data, so, to perform the DEA I want to account for them. If I am not wrong, the batch IDs for TCGA data can be found at the barcodes of sample names. In this way, I can correct by different plates or tissue source sites.
For the first case, the breast cancer cohort, I have separated groups of samples based on the PAM50 classification: basal, her2-enriched, luminal-A, and luminal-B. Then, I have realized a plotMDS with edgeR package to see the distribution of the data, obtaining the image shown below:
I can not see batch effects here due to plates or tss. The code used to this end is:
dge <- DGEList(tcga, group = groups)
head(groups)
[1] Basal-like Basal-like Basal-like Basal-like Basal-like Basal-like
Levels: Basal-like HER2-enriched Luminal A Luminal B
design <- model.matrix(~0+groups)
colnames(design) <- c("basal", "her2", "luma", "lumb")
keep <- filterByExpr(dge, design)
dge <- dge[keep, , keep.lib.sizes = FALSE]
dge.n <- calcNormFactors(dge)
CPM <- cpm(dge.n, log = TRUE)
plotMDS(CPM, labels = groups, col = as.numeric(groups))
Then, I used the removeBatchEffect() function from edgeR package and plotted the result in the next image. However, I do not see any improvement:
The code used to this part:
head(colnames(tcga))
[1] "TCGA.A2.A0T2.01A.11R.A084.07" "TCGA.A2.A04P.01A.31R.A034.07" "TCGA.A1.A0SK.01A.12R.A084.07"
[4] "TCGA.A2.A0CM.01A.31R.A034.07" "TCGA.AR.A1AR.01A.31R.A137.07" "TCGA.B6.A0WX.01A.11R.A109.07"
col_sep <- strsplit(colnames(CPM), split = "\\.")
plates <- c()
for(i in 1:length(col_sep)){
plates <- c(plates, col_sep[i][[1]][6])
}
plates <- as.factor(plates) # 10 different plates
tss <- c()
for(i in 1:length(col_sep)){
tss <- c(tss, col_sep[i][[1]][2])
}
tss <- factor(tss) # 13 different tss
CPM_nBatch <- removeBatchEffect(CPM, batch = plates, batch2 = tss)
plotMDS(CPM_nBatch, labels = groups, col = as.numeric(groups))
At this point I am a bit confused, is there a clear batch effect due to plates and tss? The only thing that I see is that the basal samples are separated from the rest samples. However, the rest of the samples seem to overlap (for example, luminal-A and luminal-B samples). Should I use packages like SVA to detect some underlying variation instead of using the batches from plates and tss?
On the other hand, in the case of the DEA, should the design matrix be something like this to account for batch effects?
design <- model.matrix(~0+groups+final+tss)
colnames(design)[1:4] <- c("basal", "her2", "luma", "lumb")
head(design)
basal her2 luma lumb platesA034 platesA056 platesA084 platesA109 platesA10J platesA115 platesA12D
1 1 0 0 0 0 0 1 0 0 0 0
2 1 0 0 0 1 0 0 0 0 0 0
3 1 0 0 0 0 0 1 0 0 0 0
4 1 0 0 0 1 0 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 0 1 0 0 0
platesA12P platesA137 tssA2 tssA7 tssA8 tssAN tssAO tssAQ tssAR tssB6 tssBH tssC8 tssD8 tssE2
1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 1 0 0 0 0 0 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 1 0 0 0 0
If the design matrix is correct, the DEA with batch effects in the desing matrix should look like this, right?:
dge.disp <- estimateDisp(dge.n, design = design, robust = TRUE)
fit <- glmQLFit(dge.disp, design, robust = TRUE)
basher2 <- makeContrasts(basal-her2, levels=design)
basluma <- makeContrasts(basal-luma, levels=design)
baslumb <- makeContrasts(basal-lumb, levels=design)
her2luma <- makeContrasts(her2-luma, levels=design)
her2lumb <- makeContrasts(her2-lumb, levels=design)
lumalumb <- makeContrasts(luma-lumb, levels=design)
trbasher2 <- glmTreat(fit, contrast = basher2, lfc = log2(1.5))
trbasluma <- glmTreat(fit, contrast = basluma, lfc = log2(1.5))
trbaslumb <- glmTreat(fit, contrast = baslumb, lfc = log2(1.5))
trher2luma <- glmTreat(fit, contrast = her2luma, lfc = log2(1.5))
trher2lumb <- glmTreat(fit, contrast = her2lumb, lfc = log2(1.5))
trlumalumb <- glmTreat(fit, contrast = lumalumb, lfc = log2(1.5))
totalcontrasts <- list(trbasher2, trbasluma, trbaslumb, trher2luma, trher2lumb, trlumalumb)
genes <- c() # Vector with all genes differentially expressed
for(c in totalcontrasts){
out <- topTags(c, n ="Inf")$table
out <- out[which(out$FDR <= 0.05),]
genes <- c(genes, rownames(out))
genes <- unique(genes)
}
Finally, can I do the same for the others TCGA cohorts (TCGA-COADREAD and TCGA-KIPAN)? I think that, in the case of these two cohorts, the samples of, for example, KICH, do not have the same batch IDs as KIRC since they are separated cohorts merged into a single one. If this is the case, is there any way to perform a DEA on that data? Thanks in advance.
EDIT: I have added the MDS plots for the two batches, before and after correction of each one.
Batch1 (plates) Before correction
Batch1 (plates) After correction
Batch2 (tss) Before correction
Batch2 (tss) After correction
You should color your PCA/MDS plots by the variable (in this case batch or tss) that you want to check is influencing your data. Especially to check if your removeBatchEffect of that variable is doing something in the before-after plots, across the two first dimensions or maybe further ones.
Thanks for the advice Papyrus. I have edited the post to show the MDS plots before and after remove batch effects. However, I still can not see further improvements or a clear batch effect being removed.