Dear all,
I received RNAseq gene expression data that show batch effects for several technical confounders. I am performing a differential expression analysis using DESeq2, and have tried to add these effects as parameters to my design (~batch1+batch2+condition
), but some of them are also linear combinations of the others, resulting in a model matrix that is not full rank.
Someone suggests the usage of tools such as Combat or SVA, but I am aware that transformed values are no longer integer count and I wonder whether the usage of these values will affects the DESeq2 outcomes.
What is the correct way to remove batch effects in this case?
Thank you very much!
I have GC content, date of the sequencing, and primer index. Unfortunately in some days only few (2-3) samples were sequenced.
It appears to me that GC content is a gene-level confounding variable, while date of sequencing and primer index are sample-level confounders. Therefore it is not clear to me how you would account for GC content at the sample-level (like the DESeq2 usage
~batch1+batch2+condition
would indicate). However, DESeq2 allows you to control for gene-level confounders when estimating the size factors. Please see DESeq2 documentation for details on how to do that.Sorry if I answered too quickly without thinking. The main problem is that I have two conditions (cases and controls) and (in about ~80% of the samples) in the same day either only cases or only controls have been sequenced (and primer index makes things worse). I know that this is a really poor design, but that's the way it is. This yields to an model matrix that is not full rank when I introduce the batches in the design.