Question

Cellular deconvolution does not work without batch effect correction

0

Entering edit mode

2.7 years ago

JACKY ▴ 160

I want to conduct a cell type enrichment analysis (Cellular deconvolution) to uncover the cellular differences between sick lung cancer samples and healthy samples. The problem is, I have a column about smoking. When I don't try to correct it as a batch effect, the deconvolution does not work and gives me zero results. The p values are all non-significant.

When I correct batch effect (using limma's correctbatcheffect or ComBat) the deconvolution works and gives good p.values, but it gives negative RNA-seq expression values for some genes, which doesn't make any sense, and it also wont work for the DESeq2 analysis I wanna conduct later on.

I tried to correct the batch effects on the counts, and on the normalized TPM, in both cases there were negative values.

Code:

library(data.table)
library(dplyr)
library(xCell)
library(DESeq2)
library(limma)
library(DESeq2)
library(apeglm)
library(goseq)
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(EnsDb.Hsapiens.v86)
library(AnnotationHub)
library(AnnotationDbi)
library(EnhancedVolcano)
library(devtools)
library(clusterProfiler)
library(readxl)
library(annotate)

set.seed(42)

## Article - E-GEOD-81089 - non small cell lung cancer

# RNA seq

counts1 <- fread('E-GEOD-81089-raw-counts.tsv', data.table = F)

A = duplicated(counts1$`Gene ID`) #check for duplications
counts1 = counts1[!A,]

rownames(counts1) <- counts1$`Gene ID` # using Ensemble IDs for now because we will need it for the normalization step

counts1 <- counts1[,-c(1,2)]

# Metadata
metadata1 <- fread('E-GEOD-81089-experiment-design.tsv', data.table = F)
rownames(metadata1) <- metadata1$Run
metadata1 <- metadata1[,-1]
colnames(metadata1)[c(3,13,15)] <- c('Smoking','Tissue','Sex')


### Normalization

exons <- exonsBy(EnsDb.Hsapiens.v86, by="gene")
exons <- reduce(exons)
len <- sum(width(exons))
A = intersect(rownames(counts1),names(len))
geneLengths1 =  len[A]

counts1 = counts1[A,]

# Normalization: TPM  
rpkm <- apply(X = subset(counts1),
               MARGIN = 2,
               FUN = function(x) {
                 10^9 * x / geneLengths1 / sum(as.numeric(x))
               })

TPM1 <- apply(rpkm, 2, function(x) x / sum(as.numeric(x)) * 10^6) %>% as.data.frame()


# Get back to HUGO Symbols - They are more comfortable to use and xCell accepts only gene symbols.

ensembl.genes <- as.vector(rownames(TPM1))
geneIDs <- ensembldb::select(EnsDb.Hsapiens.v86, keys= ensembl.genes, keytype = "GENEID", columns = c("SYMBOL","GENEID"))
A = duplicated(geneIDs$GENEID) # check for duplications
geneIDs = geneIDs[!A,]

A = duplicated(geneIDs$SYMBOL)
geneIDs = geneIDs[!A,]

rownames(geneIDs) <- geneIDs$GENEID

ind <- intersect(rownames(geneIDs), rownames(TPM1))
TPM1 <- TPM1[ind,]
rownames(TPM1) <- geneIDs$SYMBOL


# Correct batch effect
TPM1 = limma::removeBatchEffect(x = TPM1, batch = metadata1$Smoking)


# Run deconvolution 
scores1 <- xCellAnalysis(TPM1, signatures = NULL, genes = NULL, spill
                        = NULL, rnaseq = TRUE, file.name = NULL, scale = TRUE, alpha =
                          0.5, save.raw = FALSE, parallel.sz = 4, parallel.type = "SOCK",
                        cell.types.use = cells_2use)

I don't know what to do, how should I fix this problem?

deconvolution batch-effect r • 1.3k views

ADD COMMENT • link updated 3 months ago by Ram 44k • written 2.7 years ago by JACKY ▴ 160

score 0 · Answer 1 · 2022-03-25

0

Entering edit mode

2.7 years ago

i.sudbery 20k

Because of the way removeBatchEffects works, it is likely to leave some negative: you we think that batch 1 is on average 10 higher than batch2, but one sample only has a count of 5, then the corrected value is going to be -5. There isn't really any way to avoid that with batch regression based on linear models. If you were working in log space, then at least logTPMs < 0 make some sort of sense.

If xCellAnalysis requires TPMs, rather than log TPMs, you could take the log, removeBatch effects, and then unlog.

DESeq always requires raw, unnormalised, unbatch corrected counts. Not only will it not work with batch corrected data (using removeBatchEffects), it won't work (or work properly) with TPMs either.

To remove batch effects in a DESeq DE analysis (of an edgeR or limma::voom DE analysis), instead of using removeBatchEffect we include the batch covariate in the design formula for the DE model.

ADD COMMENT • link 2.7 years ago by i.sudbery 20k

1

Entering edit mode

If explicit correction is strictly necessary for RNA-seq data one can try ComBat-Seq. It operates on raw counts preserving the integer nature and keeping zeros as zeros and avoiding negative values. You could then use these counts directly for DESeq2 even though the DESeq2 developer recommends the inclusion if batch into the design.

ADD REPLY • link 2.7 years ago by ATpoint 85k

0

Entering edit mode

Is it normal that after runnung ComBat-seq, the dataframe is now 559.2 GB ? It's HUGE and was not even near this number before. Even if I want just to view it, it can't

Error in View : memory exhausted (limit reached?)

This is the code I wrote, the smoking is the batch and the important feature is the tissue:

counts1 <- sva::ComBat_seq(counts1, metadata1$Smoking, group = metadata1$Tissue)

ADD REPLY • link 2.7 years ago by JACKY ▴ 160

0

Entering edit mode

No, that's not normal. It should basically be the same size as the input, as it's only some integer conversion.

ADD REPLY • link 2.7 years ago by ATpoint 85k