Greetings!
I’m doing differential expression analysis using DESeq2 and seeking advice on batch effects please. I have 1 experimental factor with four levels (“condition”: A,B,C,D). From PCA plot, samples separated by condition along PC1 (~ 34% of variance). There was a batch effect (2 tissue sampling dates) but only causing samples to separate vertically up PC2. No separation by batch was observed along PC1. I was therefore thinking to perform DE with batch as a covariate in the model (~batch + condition). Then use the batch-corrected variance stabilised counts via limma’s removeBatcheffect() for downstream stuff such as heatmaps/gene expression boxplots, as documented in the DESeq2 vignette.
mat <- assay(vsd)
mat <- limma::removeBatcheffect(mat, vsd$batch)
However, my problem is that the batches are not evenly distributed amongst groups, and I realise this is not optimal (group-batch assignments below) but it is the data I have been given. Although possibly not completely confounded, condition D not great. I would rather not toss data if possible. So, my question is whether it is valid to perform the DE analysis and generate the batch corrected counts as I’ve described given the unbalanced design? Or as the batch effect is along PC2 not PC1, is it less risky to not batch correct than batch correct with an unbalanced design (I'm thinking probably no?)?
Any advice would be much appreciated, thanks.
condition batch1 batch2
A 3 2
B 1 4
C 1 4
D 5 0
Supposing you have at least three replicates per condition, you can only compare B vs C since A and D are cofounded with batches 3 and 5 respectively
Wait, I am looking at that table at the bottom of the question... Are those the replicates per batch 1 and batch 2? If that is the case, then perhaps there is no complete confounding.
You should relay back to them that an unbalanced study like this is not good.
Yeh, but, how much per cent variation is explained by PC2? It implies, nevertheless, that the primary source of variation (PC1) is not batch-related.
Hi Kevin and Andres above, Thanks for your replies. From the table, there are 5 replicates for each condition in total so for condition A: 3 samples are from batch 1 and 2 samples from batch 2. PC2 explained 9% of the variation.
...and you conducted PCA before or after
removeBatcheffect()
?PC2= 9% before removeBatcheffect().