I have 9 samples for bulk RNA-seq analysis with DESeq2. I have 2 conditions and 2 batches. Below is the sample table.
My design is based on both batch and condition and the PCA plot shows that C1,C2, and C3 samples are separated. below is the code and the PCA plot.
dds <- DESeqDataSetFromTximport(txi.rsem, sampleTable, ~Batch+condition)
dds <- DESeq(dds)
vsd <- vst(dds, blind=FALSE)
plotPCA(vsd, intgroup=c("condition", "Batch"))+ geom_text(aes(label=name),vjust=2)
I know that it is recommended not to remove the batch effect but I was trying to check and see what would be the results after removing the batch effect. I again see that the control samples C1,C2,C3 are not clustered with the other control C4 from another batch. Below is the code and PCA plot after removing the batch effect.
mat <- assay(vsd)
mat2 <- limma::removeBatchEffect(mat, vsd$Batch)
assay(vsd) <- mat2
plotPCA(vsd, intgroup=c("condition", "Batch")) + geom_text(aes(label=name),vjust=2)
My question is does it make sense to go further and do the DEG analysis? or my samples are not clustered/separated rationally and the results would not make any sense? I think I cannot remove C4 because that is the only connection between the samples of different batches (if that makes sense!). I was hoping that C1,C2, and C3 would cluster together with C4 and then the comparison between the Control and Treatment of different batches would be accurate.
Please search the site for "batch effect" and read posts. Your batches are unevenly distributed, with all treatment samples found in the same batch. If you were to adjust for batch effect, DESeq2 will not be able to differentiate changes owing to treatment and changes owing to batch effect, even with the lone control in B2. You should definitely use batch as a covariate, but results probably will not have a lot of statistical power.
Thank you for your response. I am reading other posts. based on your comment, do you suggest removing the batch effect step or just adjusting results based on batch effect and including batch as a covariate? (I assume the second one?)
"Removing" batch effect would be a lot more damaging that accounting for it as a covariate, especially in this case where your dataset might end up losing most of the significant differences even before DE analysis begins. Covariate assumption is the lesser of two evils here, but the true problem is experiment design.
Will DESeq allow batch to be a covariate if it's totally confounded in one condition? The second example seems applicable
http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#model-matrix-not-full-rank
Good catch - DESeq cannot account for the batch effect here. Looks like OP is going to have to operate without accounting for batch effects.