I want to conduct a cell type enrichment analysis (Cellular deconvolution) to uncover the cellular differences between sick lung cancer samples and healthy samples. The problem is, I have a column about smoking. When I don't try to correct it as a batch effect, the deconvolution does not work and gives me zero results. The p values are all non-significant.
When I correct batch effect (using limma's correctbatcheffect
or ComBat
) the deconvolution works and gives good p.values, but it gives negative RNA-seq expression values for some genes, which doesn't make any sense, and it also wont work for the DESeq2 analysis I wanna conduct later on.
I tried to correct the batch effects on the counts, and on the normalized TPM, in both cases there were negative values.
Code:
library(data.table)
library(dplyr)
library(xCell)
library(DESeq2)
library(limma)
library(DESeq2)
library(apeglm)
library(goseq)
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(EnsDb.Hsapiens.v86)
library(AnnotationHub)
library(AnnotationDbi)
library(EnhancedVolcano)
library(devtools)
library(clusterProfiler)
library(readxl)
library(annotate)
set.seed(42)
## Article - E-GEOD-81089 - non small cell lung cancer
# RNA seq
counts1 <- fread('E-GEOD-81089-raw-counts.tsv', data.table = F)
A = duplicated(counts1$`Gene ID`) #check for duplications
counts1 = counts1[!A,]
rownames(counts1) <- counts1$`Gene ID` # using Ensemble IDs for now because we will need it for the normalization step
counts1 <- counts1[,-c(1,2)]
# Metadata
metadata1 <- fread('E-GEOD-81089-experiment-design.tsv', data.table = F)
rownames(metadata1) <- metadata1$Run
metadata1 <- metadata1[,-1]
colnames(metadata1)[c(3,13,15)] <- c('Smoking','Tissue','Sex')
### Normalization
exons <- exonsBy(EnsDb.Hsapiens.v86, by="gene")
exons <- reduce(exons)
len <- sum(width(exons))
A = intersect(rownames(counts1),names(len))
geneLengths1 = len[A]
counts1 = counts1[A,]
# Normalization: TPM
rpkm <- apply(X = subset(counts1),
MARGIN = 2,
FUN = function(x) {
10^9 * x / geneLengths1 / sum(as.numeric(x))
})
TPM1 <- apply(rpkm, 2, function(x) x / sum(as.numeric(x)) * 10^6) %>% as.data.frame()
# Get back to HUGO Symbols - They are more comfortable to use and xCell accepts only gene symbols.
ensembl.genes <- as.vector(rownames(TPM1))
geneIDs <- ensembldb::select(EnsDb.Hsapiens.v86, keys= ensembl.genes, keytype = "GENEID", columns = c("SYMBOL","GENEID"))
A = duplicated(geneIDs$GENEID) # check for duplications
geneIDs = geneIDs[!A,]
A = duplicated(geneIDs$SYMBOL)
geneIDs = geneIDs[!A,]
rownames(geneIDs) <- geneIDs$GENEID
ind <- intersect(rownames(geneIDs), rownames(TPM1))
TPM1 <- TPM1[ind,]
rownames(TPM1) <- geneIDs$SYMBOL
# Correct batch effect
TPM1 = limma::removeBatchEffect(x = TPM1, batch = metadata1$Smoking)
# Run deconvolution
scores1 <- xCellAnalysis(TPM1, signatures = NULL, genes = NULL, spill
= NULL, rnaseq = TRUE, file.name = NULL, scale = TRUE, alpha =
0.5, save.raw = FALSE, parallel.sz = 4, parallel.type = "SOCK",
cell.types.use = cells_2use)
I don't know what to do, how should I fix this problem?
If explicit correction is strictly necessary for RNA-seq data one can try ComBat-Seq. It operates on raw counts preserving the integer nature and keeping zeros as zeros and avoiding negative values. You could then use these counts directly for DESeq2 even though the DESeq2 developer recommends the inclusion if batch into the design.
Is it normal that after runnung ComBat-seq, the dataframe is now 559.2 GB ? It's HUGE and was not even near this number before. Even if I want just to view it, it can't
This is the code I wrote, the smoking is the batch and the important feature is the tissue:
No, that's not normal. It should basically be the same size as the input, as it's only some integer conversion.