Hi guys, I have a similar question as this post: https://stackoverflow.com/questions/56840541/differential-expression-with-very-imbalanced-groups
I am trying to perform differential gene expression analysis using public RNA-seq data, which has 775 samples in total. And I would like to do the comparison with 15 samples of interested to the rest samples, which as you can see these two groups are very imbalanced. My boss has suggested that to use a similar method as this paper: https://www.nature.com/articles/nature25171.pdf?origin=ppub To sum up, their method which called "ee-MWW" method, which they subset the bigger group into multiple small sets which have the same sample size as the smaller group and perform the Mann-Whitney-Wilcoxon test on them and get a value which can be ranked and selected the significant genes. And I also tried with DESeq2, but both of their results seem to make no biological meaning to us. They contain a lot of micro RNA genes and pseudo genes.
So does anyone know what is a more correct way to do differential gene expression analysis in the very unbalanced sample sets? Any suggestions and ideas would be very appreciated.
Does the result make sense if you filter for protein coding genes? Do you know of genes you expect to be differentially expressed and if so how do they behave?
When you have so many samples technical artefacts can be a huge problem. Are you controlling for confounding effects (batch, gender, age etc)?
Could you post a PCA plot of the samples coloured by your groups.