I’m working with an imbalanced set of cell line RNA-seq data with small group sizes. For example, I have four samples exposed to condition A, two samples exposed to condition B, two samples exposed to condition C, etc. Group sizes range from two to five. Individual groups were run as a batch, but separate groups were run in different labs. Obviously this isn’t ideal, but some reads had to be discarded due to significant quality issues and it wasn't possible to run all samples in the same lab.
I’m looking for literature that a) provides insight into best practices for balancing this type of data, and b) perhaps a way to characterize dataset balance before/after balancing.
If necessary to know, I am ultimately seeking to conduct differential expression analysis. I do know that tools like DESeq2 are supposed to be valid for imbalanced data, but I’m wondering to what extent a dataset is just too imbalanced and requires correction or is simply not usable.
I could also extend this question into batch effect correction. How do you characterize a dataset as being too imbalanced for a tool like Combat? What’s the cutoff? Note, I am aware that including batch as a covariate in DE analysis is preferred over removing batch effects with Combat. I’m just interested to learn the best practices for defining how balanced an RNA-seq dataset is.
I would ask this on Bioconductor forum: https://support.bioconductor.org/t/Latest/
When you ask it there, please mention that you first asked here, and provide the link.