Question

Methods for analyzing and correcting RNA-seq dataset balance

1

Entering edit mode

6.6 years ago

abe ▴ 30

I’m working with an imbalanced set of cell line RNA-seq data with small group sizes. For example, I have four samples exposed to condition A, two samples exposed to condition B, two samples exposed to condition C, etc. Group sizes range from two to five. Individual groups were run as a batch, but separate groups were run in different labs. Obviously this isn’t ideal, but some reads had to be discarded due to significant quality issues and it wasn't possible to run all samples in the same lab.

I’m looking for literature that a) provides insight into best practices for balancing this type of data, and b) perhaps a way to characterize dataset balance before/after balancing.

If necessary to know, I am ultimately seeking to conduct differential expression analysis. I do know that tools like DESeq2 are supposed to be valid for imbalanced data, but I’m wondering to what extent a dataset is just too imbalanced and requires correction or is simply not usable.

I could also extend this question into batch effect correction. How do you characterize a dataset as being too imbalanced for a tool like Combat? What’s the cutoff? Note, I am aware that including batch as a covariate in DE analysis is preferred over removing batch effects with Combat. I’m just interested to learn the best practices for defining how balanced an RNA-seq dataset is.

RNA-Seq • 1.9k views

ADD COMMENT • link updated 6.6 years ago by Charles Warden 8.3k • written 6.6 years ago by abe ▴ 30

0

Entering edit mode

I would ask this on Bioconductor forum: https://support.bioconductor.org/t/Latest/

When you ask it there, please mention that you first asked here, and provide the link.

ADD REPLY • link 6.6 years ago by Kevin Blighe 89k

score 0 · Answer 1 · 2018-12-24

I think these things can sometimes be hard to precisely determine. However, if you have a highly asymmetric gene list, then it might be worth checking if you have substantially more samples in the group that is relatively up-regulated.

I would also recommend testing different methods for every project (so, for your particular set of samples, maybe some methods can better handle your unbalanced design than others).

That said, you need to critically assess your results: if you don't have a way to test the effect of the ComBat correction, then I would be skeptical of that result (and I would be cautious that the normalization methods may sometimes show over-fitting that can actually add bias into your results).

To be honest, I typically use multi-variate differential expression models instead of ComBat, and (if you have discrete groups) test visualizing expression that is centered by each group you wanted to correct (although you could visualize expression before and after the ComBat adjustment). The co-variate centered visualization won't make sense if you only have one representative sample per group, but I would also be suspicious of any method that provides a result that was supposed to correct for a variable that isn't randomized across whatever you want to adjust for (or has only one sample to represent the interaction of two variables).