Dear Michael Love and the DESeq2 Community,
I have recently encountered a problem (or limitation) with DESeq2 when working with a large dataset. I wanted to model the effects of condition (mutant vs. non-mutant) and genotype. I constructed the model as follows:
~genotype + condition + genotype:condition+Replicate
As I understand, this model should capture genotype-specific effects between conditions as well as a general response (mutant vs. non-mutant), regardless of genotype. My dataset consists of ~300 genotypes, 2 conditions, and 3 replicates, which results in approximately 2K samples and 30K genes. A simpler model with one factor, coded like this:
~condition+Replicate
finishes in about 15 minutes. However, the full model with the interaction term has been running for 4 days without completing.
I've checked similar posts, and the general recommendations were to either use parallelization or switch to limma+voom for large datasets. I tried parallelization, but it did not improve the speed due to issues with the BiocParallel backend on our machine. Interestingly, for smaller test datasets, running on multiple cores actually took longer.
To summarize, my questions are as follows:
Do I understand correctly that it is necessary to feed the entire expression matrix into the model
To summarize, my questions are: 1) Do I understand correctly that it is necessary to feed a whole expression matrix into the model
~genotype + condition + genotype:condition+Replicate
to capture genotype-specific and condition-specific differentially expressed genes? In other words, I cannot run the analysis separately for each genotype, because the dispersion estimates would differ.
2) If parallelization is not working, is switching to limma+voom the only viable solution for such a large dataset and complex design?
At this sample size and with many levels per covariate I would always try limma first. It will be much faster and probably give similar inference.