Hi Folks.
I am conducting a differential gene expression analysis using RNA-seq. My experimental design is blocked and repeated, so I need to fit mixed effects models and cannot make use of standard DGE packages such as DESeq, edgeR etc. This is not a problem when the count data is generalizable to the negative biominal (poisson etc.) distribution; however, for many of the genes, I have highly 0-inflated, or binary distributed count data. For example, for many of the genes, there are 0 counts for one parent and >5 counts for the other parent. Please advise on the best way to analyze genes that behave this way.
Thanks, John
Thanks Devon. This comment has come up in many of the posts that I have read.
For me, when an experiment is designed with blocking and replication within the individual, the individual and experimental blocking must be analyzed as random effects. This is a pretty standard quantitative genetics design. Furthermore, we have a ton of replication within the experimental factors we are testing among, so I am not convinced that shrinkage is a particularly good method to estimate within group variances.
Anyways, even if I did use fixed effects, I am still unsure about the best way to analyze these highly 0-inflated and binary gene expression phenotypes. Thanks again.
Certainly if you were to compare a straight GLM and a GLMM on your dataset then the GLMM would work better...but of course a GLMM is just doing shrinkage in a different way than DESeq2 et al., which aren't straight GLMs.
Regarding the zeroes, it depends a bit on exactly what you mean by zero inflated and where the problem is. If the case is that you have absolutely 0 expression in all but one sample, then that can be problematic. I suppose how to deal with that depends on whether you find those cases biologically interesting. For most people they wouldn't be, but I can think of counter examples (e.g., single-cell sequencing).