Question

Rationale For Pca Analysis

18

Entering edit mode

13.9 years ago

Deena ▴ 280

Hello, I would appreciate comments/advice on when to use Principal Component Analysis and what PCA data represents. My understanding of the algorithm is that a set of correlated variables are represented as uncorrelated variables, from which one can derive an understanding of variation in the data. BUT,

Once you have represented your data as a set of principal components, is there some way to determine which features are actually represented in each principal component? In other words, what does each principal component (PC) actually represent? If I understand correctly, the first 2 PCs will always be the most important to show the variation of the dataset, but how can I tie that back to the actual variables/features that I was mapping in the first place?
I wish to determine which features, from a set of features (ex: hydrophobicity, amino acid composition, etc.) are the "best" to predict whether a protein sequence will adopt a desired protein fold (a specific fold I have in mind). Accordingly, if PCA does not do that, what is the best technique to use? I have heard of "feature selection" but I am not very familiar with it. If anyone can elaborate on if/how it differs from PCA that would be very appreciated. Are there known examples (ex: articles, reviews) in protein structure prediction that address this?

I am intending to use R for this analysis, so any suggestions for R libraries that will do the job are most welcome! Thank you very much for your advice and responses!

-Deena

pca feature • 7.2k views

ADD COMMENT • link updated 13.9 years ago by Deena Gendoo • 0 • written 13.9 years ago by Deena ▴ 280

0

Entering edit mode

[?]Thank you all very much for your fantastic and detailed responses![?]

ADD REPLY • link 13.9 years ago by Deena Gendoo • 0

score 21 · Answer 1 · 2011-02-24

"Is there some way to determine which features are actually represented in each principal component?"

Yes. Each PC is basically a linear combination of the original variables. A loading plot is typically used to plot the old variables in the new space. When combined with a score plot (where the old objects are plotted in the new space), you get a so-called biplot.

"If I understand correctly, the first 2 PCs will always be the most important to show the variation of the dataset"

Correct. This is the whole purpose of PCA. The methods finds orthogonal axes that explain the most variation. It basically finds the first PC by finding an line in the original space along which the variation in the data is maximal. This line if the first principle component. Each next PC is the line that maximizes again the variance, given that it must be orthogonal to all previous components. This is the graphical explanation; the matrix operation one is equivalent and used by software.

"how can I tie that back to the actual variables/features that I was mapping in the first place?"

Via the loading plot or biplot.

"if PCA does not do that, what is the best technique to use?"

The regression variant of PCA is PCR, Principle Component Regression. However, mind you, that there is not best way. What the optimal approach is, you cannot say on beforehand, neither based on theoretical grounds, and it depends on your data, representation, preprocessing, etc, etc.

"I have heard of "feature selection" but I am not very familiar with it."

There are very many feature selection methods. Step-forward selection, backward-elimination, genetic algorithms, just to name a few that do the selection independently from the modeling. Again, your best choice depends on your data.

"If anyone can elaborate on if/how it differs from PCA that would be very appreciated."

PCA does feature selection only in such a way that it decides how important a variable is to maximizing the variance in the dependent data (your sequences). However, when employing feature selection, you are mostly more interested in how important a variable is with respect to some independent property (your structures).

"any suggestions for R libraries that will do the job are most welcome"

I would recommend the pls: Partial Least Squares Regression (PLSR) and Principal Component Regression (PCR) package.

score 9 · Answer 2 · 2011-02-24

PCA aims to capture the maximal variance in a dataset in a single variable, the first Principal Component. Variance that can not be captures is then put in the 2nd PC and so on, thus your statement that

My understanding of the algorithm is that a set of correlated variables are represented as uncorrelated variablesrepresented as uncorrelated variables

is exactly right.

Now, PCA is one of the so-called "exploratory" techniques of multivariate anaylsis. You can easily plot a quick overview of the variance in your dataset but only if it is captured in the first two (for a 2D plot) PCs. PCA by itself can not be used for classification, but there are PCA-based classification algorithms.

There is, however, a way to visualize how much your features contribute to the PCs, which is done by plotting columns of the loadings matrix (analogous to plotting the PCs from the scores matrix; see the example below, represented by the red arrows: the directions and magnitude in PC1 and PC2 are their respective contributions - again, this is a 2D plot so only the first PCs are shown).

[?] [?] [?] [?] [?] PCA biplot

For the 2nd part, "feature selection" algorithms is also what you should be looking for. Generally, they use a classification algorithm (e.g. simple Bayesian statistics or SVMs) on a subset of your features and compare the algorithm performance to the whole or another subset in order to obtain the optimal features. There seem to be quite some R libraries that are able to do that and as always, there is also a Wikipedia Article including helpful references.

edit: here are some articles employing SVMs (there are other possibilites as well!):

Very good introduction to classification using SVMs

Feature selection experiment for determining catalytic residues

R library for feature selection by penalizing

edit2: addressing Egon's point

score 4 · Answer 3 · 2011-02-24

4

Entering edit mode

13.9 years ago

Rajarshi Guha ▴ 880

If you're looking for variable importance, random forests can be useful. I use randomForest in R. The variable importance measure is derived by scrambling descriptors individually and looking at how predictive performance degrades. In that sense the importance is in the context of the predictive ability of model

ADD COMMENT • link 13.9 years ago by Rajarshi Guha ▴ 880

0

Entering edit mode

Rajarshi, please add a link for your favorite RF package; perhaps with a link to a vignette? I guess it is based on how often variables show up in the RF?

ADD REPLY • link 13.9 years ago by Egon Willighagen 5.4k

Ram · Answer 4 · 2011-02-24

3

Entering edit mode

13.9 years ago

Jeremy Leipzig 22k

The best layman's explanation of PCA I've read was recently posted on CrossValidated.

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 13.9 years ago by Jeremy Leipzig 22k

score 2 · Answer 5 · 2011-02-24

I'll second Rajarshi's comment on random forests (randomForest package in R). I think it will help with what you're after. I don't think there's a vignette included with the package, but here's a very short demo of randomForest.

This paper offers the best explanation I've come across of exactly what RF is doing.

This is a more lay explanation of what RF is doing.