Question

How to do feature selection using PCA?

0

Entering edit mode

6.7 years ago

ahmad mousavi ▴ 800

Hi

I have read about PCA and its power to do dimension reduction but for a specific project I need to do feature selection by PCA, although I know PCA might not be the best choice for it, but I need the result of PCA for feature selection.

Can anyone give me some advices or R code for that purpose?

My data contains both clinical and genomic information with a Target variable which define disease and healthy.

Thanks.

RNA-Seq R pca • 4.6k views

ADD COMMENT • link 6.7 years ago by ahmad mousavi ▴ 800

0

Entering edit mode

Are you looking to predict your Target Variable (disease/healthy) from the clinical and genomic data? It sounds like you have labeled training data where the disease/healthy classification is known. If so, you might want to use supervised approaches to do feature selection, rather than unsupervised approaches like PCA. Because your disease/healthy classes might not necessarily be driven by the features that drive overall variance in the data, and would be selected by PCA. If you do want to explore PCA, you could take your feature matrix (assuming it contains, or could be converted to, continuous-valued variables) and apply the R prcomp() function to derive principal components. You could then see what percentage of variance in the data your principal components explain, and if that looks reasonable, then you could select one or more of the principal components as your features for further modeling.

ADD REPLY • link 6.7 years ago by Ahill ★ 2.0k

0

Entering edit mode

thanks @Ahill for good explanation. I don't want to group the variables based on their information in PCs.

Yes I have already know result label in my training set. How can I merge PCA score with Linear model?

Do you have any idea for combining PCA result with modeling (lm/GLM) ?

ADD REPLY • link 6.7 years ago by ahmad mousavi ▴ 800

score 1 · Answer 1 · 2018-11-03

1

Entering edit mode

6.7 years ago

Ahill ★ 2.0k

It can be valid to use PC scores as the independent variables in linear models, instead of the underlying primary variables. For predicting a binary outcome like disease/healthy, you could look at a lm() logistic model like:

disease.status ~ PC1 + PC2 + ...

Of course, the usefulness of this will depend on the structure of your clinical and genomic predictors, and if the PCs are correlated with disease status.

ADD COMMENT • link 6.7 years ago by Ahill ★ 2.0k

0

Entering edit mode

thanks, it woks for me.

ADD REPLY • link 6.7 years ago by ahmad mousavi ▴ 800

0

Entering edit mode

Moved this to answer. Thank you Ahill. Ahmad, feel free to up-vote and/or accept the answer if it helps.

ADD REPLY • link 6.7 years ago by Kevin Blighe 89k