Hi all, I have a basic question regarding the differential gene expression analysis (DESeq2) between the two conditions (say 1 and 2). If I have 3 samples for condition 1 and 60 samples for condition 2, would it be fine to do differential gene expression analysis between the conditions 1 and 2? Or, do I need to randomly select fewer samples from condition 2 to have a more "balanced" analysis? Are there any statistical problems associated with this? If I need to select fewer samples, then how many samples of condition 2 could be selected for the analysis?
Thanks in advance for any suggestions. I really appreciate your help.
I see this issue has been raised on bioconductor (e.g. here). Not a statistician and interested to hear other views, but I'd say the DE methods in DESeq2 are valid for unbalanced groups; but they may be less optimal than if you had a balanced design with the same total sample size. You have a very large imbalance so I'd guess your variance estimates might be driven by the variances in the n=60 (larger) group. Having said that, DESeq2 is sharing variance information across genes. I think you could certainly proceed with all samples, and not down-sample to equalize group size. But I would want to visualize your data carefully using MA-plots etc. to confirm you are not seeing any group-size driven artifacts among genes found to be DE.
Thanks so much for your suggestions. They are very helpful.