I am performing a meta-analysis and used ComBat-Seq for batch adjustment. The PCA plot indicated good data homogenization. I performed DESeq2, including tumor purity as a covariate. However, every time I try to visualize the top 10 DEGs, I notice that one sample consistently appears as an outlier. I attempted to exclude this sample from the analysis, but when I re-run DESeq2 and visualize the results, another sample becomes an outlier. Essentially, regardless of what I do, one sample always shows the highest intensity in visualized the dds.norm data.
Do you know why this is happening. I didn't realized the same effect without including tumor.purity as covariance (maybe it still exist but not so obvious).
Is it expected to be happened? Is there anything I can do?
Barplot of top 2 DEGs
I am comparing R/NR and I am expecting to see difference between two group, In this case it seems that whole statistics is driven by one outlier.
Code I used:
dds0 <- DESeqDataSetFromMatrix(countData = count.table, colData = Pheno, design = ~Tumor.purity+ Pathological.response)
dds.norm <- estimateSizeFactors(dds0)
sizeFactors(dds.norm)
## Performing estimation of dispersion parameter
dds.disp <- estimateDispersions(dds.norm)
alpha <- 0.0001
wald.test <- nbinomWaldTest(dds.disp)
res.DESeq2 <- results(wald.test, alpha=alpha, pAdjustMethod="BH")
#############################################################################################
png("Normalized counts plot.4.png", width=50, height=15, units="in", res=600)
par(mfrow = c(3, 3)) # 2 reda, 5 kolona
par(mar = c(4, 4, 8, 2))
gn.most.sign <- list()
for (i in 1:9) {
gn.most.sign[[i]] <- rownames(res.DESeq2.pvalue.05.lfc.1)[i]
gn.most.diff.val <- counts(dds.norm, normalized=T)[gn.most.sign[[i]],]
barplot(gn.most.diff.val, col=Pheno$Color2, main=gn.most.sign[[i]], las=2, cex.names=0.5,)
}
dev.off()
Dear i.sudbery,
Thank you for your response and valuable input! I had some additional thoughts. As I am using the ESTIMATE algorithm to calculate tumor purity, do you think it's a significant assumption to consider the algorithm 100% accurate, and should it be included as a covariate in the DE analysis? I’m not entirely sure if I should include it or not. On one hand, it could help account for heterogeneity caused by differences in tumor purity, but on the other hand, it might introduce additional bias.
Thank you in advance, Aleksandra