Hi everyone!
I am working with a cancer mouse model that produced tumors, and we have performed gene expression profiling on all of them. I would be interested in building a classifier to identify human tumors, based on their gene expression, that are similar to my model (i.e. "mouse-like"). The microarrays have 27000+ features. I suspect that I don't need as many features. Hence, I was wondering if there were a methodology to pick the best number/nature of parameters? I know that it is counter-intuitive because I shouldn't look at the data before I apply machine learning. I am currently reading papers.
Thank you for your input!
It IS safe to filter genes to those with high variance; this would be a quick and easy way to get a reasonable set for classification.
Hi! Thank you for your quick response. Could I use a nonparamteric ranking test (e.g. Wilcoxon) to get the genes with the highest variance?
No. You may not use any measure of variability that includes the classes.
Thank you! Is there a way to have a cutoff for the variance? I am asking because the variance values will be continous. Bootstrap resampling?
There is no "cutoff". I suspect that you'll find that there is a pretty broad range that can result in similar performance.