Hello all,
I have barcode count data corresponding to the viability of 25+ pooled bacterial strains under various conditions. The marginal distribution of untreated strain counts appears to be Negative Binomial.
I'm trying to use DESeq2 to analyze these data, using a matrix of strains ("genes") as rows and conditions as columns. Since the variation of counts between most conditions for most strains is very large, but between replicates is relatively small, it seems sensible to estimate dispersions (in this case) on a gene- and condition-wise basis.
The language in the DESeq2 vignettes and pre-print seems to suggest the dispersion estimates are "gene-wise". So if you run DESeq()
followed by plotDispEsts()
, each point corresponds to the variance estimate of a gene across conditions (in my case, strain), or the variance estimate between replicates of a gene under one condition?
I think the conceptual difference I'm talking about is the same as that between blind=TRUE
and blind=FALSE
in the rlog()
and varianceStabilizingTransformation()
functions.
Finally, if DESeq2 does estimate dispersions on a solely gene-wise basis, would it be reasonable for me to estimate the dispersions of my data subsetted by each condition in turn, and then feed those results into my whole DESeqDataSet
object using dispersions()
?
Many thanks for taking the time to read, and for any suggestions you might have.
Eachan
Indeed, one should not split things by group before estimating dispersions. There is still a power increase with 25 rows when using DESeq2 (or similar) versus a straight GLM, though it's fairly small.
Thanks, Devon, for your advice. To add to the reason why I'm using DESeq2, I'm also trying to avoid reinventing the wheel.
That's usually a pretty compelling reason :)
BTW, you might instead consider limma. I'm not entirely sure how well DESeq2 scales to the number of samples you have (limma has historically had better luck there, given how it works).
DESeq2 is slower than in the transpose situation, but not unmanageably so - it takes about a day on our server, which I can live with once I'm sure the output is meaningful for my application. I will be sure to investigate limma as well. Thanks, again!
Thanks for your helpful input, Vivek. I'll re-read the DESeq2 paper again.
To answer your question, the number of columns is the issue more than the number of rows. The experiment I'm working with has 50,000 conditions in duplicate, and I'm interested in differential outgrowth of the 25 strains.
@Eachan wow this sounds interesting..
Out of curiosity, what experimental design or protocol allows you to test so many conditions? And yes, sounds interesting!