I'm trying to perform differential gene expression analyses using DESeq, on a pre-existing dataset, where 8 cell types (A to H) were identified from MARS-Seq on whole organism. Gene expression counts are given for
A x 9 replicates
B x 1
C x 1
D x 3
E x 1
F x 1
G x 4
H x 1
In order to find genes differentially expressed in A, I contrast A against "notA" (replacing B to H with "notA")
dds<-DESeqDataSetFromMatrix(countData = cts, colData = coldata,design=~condition)
dds$condition<-factor(dds$condition,levels=c("A","notA"))
dds<-DESeq(dds)
alpha<-0.1
res<-results(dds,contrast=c("condition","A","notA"),alpha=alpha)
reslfc<-lfcShrink(dds,contrast=c("condition","A","notA"),res=res, alpha=alpha)
and so on for the other 7 cell types.
However, I found that I get much more DEGs for A D & G, i.e. samples with more reps. In fact, I have no DEGs at all for B, C E, F and H.
Anyway I can overcome this? Or should I not be performing the analysis this way?
Statistical power is (among other factors) a function of the replication number. Please use google and the search function for DESeq2-related questions on replication, replicate numbers and experiments with no replicates. You'll literally find dozens of posts, blogs, bioconductor questions and responses from (among others) Mike Love (the first author and maintainer of DESeq2).
Thanks ATpoint. Besides treating it as an exploratory analysis / not drawing firm conclusions from the dataset, I'm wondering how exactly to augment the test / result, by accepting the assumption that variance-mean estimates from replicated samples can be applied to unreplicated ones?
I think (but I am really not sure) that DESeq2 uses the dispersion estimates between the replicated group and projects them to the unreplicated one, but again, this is only something I faintly remember from a bioconductor post so better double check this.