I have RNASeq data that I am analysing using DESeq2. I have 2 genotypes (WT and mutant). The RNASeq was run in 3 separate batches (Batch1, 2 and 3; all batches contained some WT and mutant samples). I ran DESeq using design=~genotype + batch, re-leveled to genotype=WT. I then tried including surrogate variables based on the batch using SVA as follows below. I get a different number of DE genes in each analysis.
When I run a PCA and sample distances, the samples cluster more by batch than by genotype (particularly the 3rd batch where I know the quality of the RNA was poor).
I'm not sure which results I should be actually using. It seems that using SVA is a better way to exclude technical variation. Am I correct? When applying SVA, should the batch be included in the original design, or just the genotype?
I appreciate any help! Thanks in advance!
dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ genotype, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
# output:
# Number of significant surrogate variables is: 2
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + genotype
ddssva <- DESeq(ddssva)
ressva <- results(ddssva,contrast=c("genotype","HET","WT"))
Thank you for your reply.
As I understand, you are suggesting I use
~ genotype,svseq
instead of genotype, colData in the model matrix as I did above?The different number is something like 40 to 400. The numbers aren't huge to start, but it still is a 10 fold difference. Is that unreasonable?
Many thanks for your help
best wishes
Chris
Hi Vivek,
Sorry to bother again, but when I try to do the above it won't let me add svseq to the design.
error:
My colData contains the information on sample, genotype and batch:
Yes, since I assumed a very simple colData design for demo. I updated my answer now. But you don't need to put "sample" as a column in there.. You wouldn't need even the "Run" column if you are estimating batches in unsupervised way.
Thank you Vivek. Very helpful.