Hello everyone
I have a data set with dimension of 330 * 45000 ( 330 samples and 45000 features : reads in peaks)
I am looking for a way to select best features for binary classification. so far I only chose feature with covariance
higher than 0.5 or less than -0.5 and reduced dimension to 14000. but I know I should reduce dimension furthermore , I'm not sure if I can use randomforest
at this stage, do you have any suggestions or tips?
thank you for your response it was indeed , very helpful for me. I have a question though, in the case of bigger datasets( more features) about 500,000 features and 1000 samples, what is the best preprocessing method for classification ? I'm looking for a method like variance that doesn't look to sample labels
I don't know that anything will work well on half a million features. Using variance and correlation (see here and here) is likely to be most productive.
Multicolinearity between features can be determined by calculating the variance inflation factor (VIF), but that is also too slow for 500K features. I just did a quick simulation with 1000 samples and 500 features, and that still took 2 hours on a fast, 12-CPU computer. See a notebook for the VIF implementation, but I don't think that will really help you.
Maybe try something like gradient boosting that is multithreaded and can handle large datasets. More than anything, I would suggest you re-think the strategy that gives you 500K features. In other words, work on reducing the number of features before attempting to classify. Whatever measurements you are making, make fewer than half a million.