I am facing an issue with the design of my experiment. I am performing DEG analyses on my RNA-seq samples as part of my PhD research. I am studying and aiming to evaluate the differences in gene expression in cribriform prostate cancer compared to non-cribriform prostate cancer.
The problem I am encountering is that my control group (non-cribriform prostate cancer) has 205 samples, whereas my treatment group (cribriform prostate cancer) has only 65 samples. I understand that this imbalance can affect the performance of the methods, but I would like some suggestions on how to adjust the control group without biasing my data.
Could I randomly select 65 samples from the control group? Or could I use a methodology to cluster the count data from the control samples and choose representatives from each cluster to reduce this discrepancy? These ideas have already crossed my mind.
I also tried filtering by another clinical variable, but the problem is that this significantly reduces the sample size of my treatment group (cribriform), from 65 to half. Since it is my limiting group, I do not want to lose samples from it. That is why I am considering only reducing the control group. I look forward to your suggestions. Thank you!
Both programs can handle unequal groups.
DESeq2 https://support.bioconductor.org/p/101416/
EdgeR https://support.bioconductor.org/p/113521/