Entering edit mode
5.7 years ago
bimlay2
▴
30
I have gene-wise expression data for 35 cell lines with ~3 runs per cell line. I also have a binary classification for each cell line associated with a biological phenomenon.
I am interested in finding the genes that are most associated with the binary classification. I have tested several approaches, but I wanted to ask if anyone had insight into these sorts of problems.
So far I have:
- Generated univariate AUC scores for each gene, which essentially gives a measure of how separated the binary groups are for each gene.
- Used an array of binary classifiers and subsequent variable importance analysis to generate ranked gene importance.
Am I missing an obvious method? Do my approaches so far make sense?
You describe that you are interested in finding genes most associated with the binary classification (versus building a predictor of your binary class?). If this is a gene selection question, I would think one alternative would be a differential expression approach: i.e. limma or equivalent with your binary classes as contrast, and rank the genes with largest and/or most significant differences between the two classes.
Thanks for your comment. I actually used DESeq2 to generate DE results. The mean-dispersion trend looked weird, and I got super, super low p-values. I wasn't sure if any DE method was suited for 35 cell lines lumped into two groups.
Ah, OK. If 'biological.phenomenom' is a binary label on each cell line (not an experimental factor that you modulated) then I suppose very confounded with cell.line effects. If cell.line effects are large (probably) but there are still 'biological.phenomenom' main effects that are large enough to observe in that background, then perhaps a rank-based approach like a per-gene univariate Mann-Whitney test comparing the two levels of 'biological.phenomenom' would be worth a try.