I am analyzing publicly available bulk RNA-seq data to identify differential expression between control and stress-treated samples. The metadata does not explicitly mention batch information, but my PCA plot suggests possible batch effects.
PCA plot is attached here:
I am using DESeq2 for differential expression analysis, and I plan to use the sva package for batch effect correction.
My current code:
counts <- read.delim("counts.csv", header = TRUE, row.names =1, sep = ",", check.names=FALSE)
colData <- read.delim("colData.csv", header= TRUE,sep = ",")
dds <- DESeqDataSetFromMatrix(countData = counts, colData = colData, design = ~ treatment)
vsd <- varianceStabilizingTransformation(dds, blind=T)
dds <- DESeq(dds)
res <- results(dds)
res_ordered <- res[order(res$padj),]
de_genes <- rownames(res_ordered)[which(res_ordered$padj <= 0.01 & abs(res_ordered$log2FoldChange) >= 1)]
de_expression <- assay(vsd)[de_genes, ]
upregulated_genes <- rownames(res_ordered)[res_ordered$log2FoldChange >= 1 & res_ordered$padj <= 0.01]
downregulated_genes <- rownames(res_ordered)[res_ordered$log2FoldChange <= -1 & res_ordered$padj <= 0.01]
My specific question is:
How do I correctly use sva to detect and adjust for hidden batch effects given that I do not have explicit batch information?
Are there any additional steps or considerations I should be aware of when using sva? For example, how can I validate that the surrogate variables identified by sva are effectively correcting for batch effects? Is it necessary to run ComBat in conjunction with sva, or is sva sufficient in this scenario?
Thank you very much in advance!