Hi everyone, I'am dealing with a RNAseq dataset which has a very unbalanced gender distribution between the 2 classes I need to compare. In detail, "control" class has 11 male and 2 female, while "case" class has 1 male and 8 female. I am wondering if there is an adequate and simple method to mitigate in someway this unbalance while performing differential expression analysis. I am considering using the batch correction option in DESeq2: design = ~ Sex + Type, but I do not know what to expect, being the "confounder" so disproportionately distributed, and if the option is appropriate. As you can tell from the basic level of question, I am new in this field. Thank you for the help.
Thank you Kevin for the detailed answer. Unfortunately, in PCA, samples separate by gender almost the same way they separate by condition.... I think this could be expected, since condition groups are very very biased towards gender... I will try and check the other analysis sugegsted, but I think that the stratified one would be tricky: I will end up in comparing 1 case vs 11 controls, for male, and 8 cases vs 2 controls for female, If i got it correcly.
Yes, the imbalance will result in exaggerated / biased p-values and fold-change estimates, but I thought it interesting to do for your own investigating. If there is definite separation, then perhaps try to control for gender via the inclusion of
gender
in the design formula.Either way, as you can see, there is no 'one size fits all solution'.
In any epidemiological study, we would report covariate-adjusted and non-adjusted test statistics, so, I see no reason why we cannot do the same for Omics-style data.