Question

model interaction in DeSeq2 with >1K samples

0

Entering edit mode

15 days ago

L.Grigoreva • 0

Dear Michael Love and the DESeq2 Community,

I have recently encountered a problem (or limitation) with DESeq2 when working with a large dataset. I wanted to model the effects of condition (mutant vs. non-mutant) and genotype. I constructed the model as follows:

~genotype + condition + genotype:condition+Replicate

As I understand, this model should capture genotype-specific effects between conditions as well as a general response (mutant vs. non-mutant), regardless of genotype. My dataset consists of ~300 genotypes, 2 conditions, and 3 replicates, which results in approximately 2K samples and 30K genes. A simpler model with one factor, coded like this:

~condition+Replicate

finishes in about 15 minutes. However, the full model with the interaction term has been running for 4 days without completing.

I've checked similar posts, and the general recommendations were to either use parallelization or switch to limma+voom for large datasets. I tried parallelization, but it did not improve the speed due to issues with the BiocParallel backend on our machine. Interestingly, for smaller test datasets, running on multiple cores actually took longer.

To summarize, my questions are as follows:

Do I understand correctly that it is necessary to feed the entire expression matrix into the model

To summarize, my questions are: 1) Do I understand correctly that it is necessary to feed a whole expression matrix into the model

~genotype + condition + genotype:condition+Replicate

to capture genotype-specific and condition-specific differentially expressed genes? In other words, I cannot run the analysis separately for each genotype, because the dispersion estimates would differ.

2) If parallelization is not working, is switching to limma+voom the only viable solution for such a large dataset and complex design?

DeSeq2 interaction large sample set design • 430 views

ADD COMMENT • link updated 14 days ago by i.sudbery 20k • written 15 days ago by L.Grigoreva • 0

0

Entering edit mode

At this sample size and with many levels per covariate I would always try limma first. It will be much faster and probably give similar inference.

ADD REPLY • link 15 days ago by ATpoint 86k

score 0 · Answer 1 · 2024-12-07

In their discussion of doing single cell RNAseq, which also often has a large number of samples, the DESeq2 authors recommend using glmGamPoi to do the model fitting. You can do this using the glmGamPoi package, or you can pass fitType="glmGamPoi" to the DESeq function (Gamma-Poisson and Negative Binomial are the same thing).

Failing this, if your goal is to generate seperate lists of differentially expressed genes between conditions for each geneotype (i.e. one list for genotypeA and one list for genotypeB), then I would feed the count matricies seperately. Yes, the results will not be identical as doing the whole lot at once, but if you pass, say, batches of 10 genotypes to DESeq2 at once, my guess is that the results would be pretty similar as you'll have enough samples to get a good estimate of the dispersion. If you were worried about it, you could try splitting your 300 genotypes in groups of 10 in more than one way and check the results are similar.