Question

Feature Selection For Regression

0

Entering edit mode

13.9 years ago

User 1933 ▴ 360

From sequence one,can derive thousands features, but which feature is more predictive.

Basically we can start with a simple PCA to see which feature is more explanatory and then project other to that one. but some times,there is not any structure behind PCA and seems all features have equal contribution.

Also, in classification problem, we can start by simple tree base classification and prune the tree and evaluate our model. but how would you deal when the problem is regression ?! say prediction of solubility of a protein from its sequence.

Is there any baseline or procedure among bioinformaticians ?!

feature prediction sequence • 3.7k views

ADD COMMENT • link updated 8.0 years ago by Biostar 20 • written 13.9 years ago by User 1933 ▴ 360

Ram · Answer 1 · 2011-05-31

Put simply, feature selection can be "manual" (e.g. through PCA inspection) or "automated" (some algorithm which selects the most predictive features). Classification can be "supervised" (e.g. linear discriminant analysis, where classes and features are supplied) or "unsupervised" (again, some algorithm is used to evaluate the classification).

There is no single "baseline" or procedure; you need to consult the statistical literature and decide on an appropriate method, based on the data that you have and what precisely you want to do.

I'd recommend starting with this review: "Penalized feature selection and classification in bioinformatics." It's a good overview. Penalized feature selection is a commonly-used approach; it's an iterative procedure which tests features then as the name implies, penalizes them with a score depending on how well they perform as predictors.

You may also want to look at "Classification of gene microarrays by penalized logistic regression" (PLR). PLR provides estimates of the underlying probability that a feature is a good classifier. That paper also describes recursive feature elimination (the name is self-explanatory).

Another interesting paper: "penalizedSVM: a R-package for feature selection SVM classification", describes methods which "provide automatic feature selection for SVM classification tasks."

score 2 · Answer 2 · 2011-05-31

2

Entering edit mode

13.9 years ago

Sean Davis 27k

You could look into using random forests which does feature selection and regression natively. There is a good implementation in the randomForest R package.

ADD COMMENT • link 13.9 years ago by Sean Davis 27k