Hello, I would appreciate comments/advice on when to use Principal Component Analysis and what PCA data represents. My understanding of the algorithm is that a set of correlated variables are represented as uncorrelated variables, from which one can derive an understanding of variation in the data. BUT,
Once you have represented your data as a set of principal components, is there some way to determine which features are actually represented in each principal component? In other words, what does each principal component (PC) actually represent? If I understand correctly, the first 2 PCs will always be the most important to show the variation of the dataset, but how can I tie that back to the actual variables/features that I was mapping in the first place?
I wish to determine which features, from a set of features (ex: hydrophobicity, amino acid composition, etc.) are the "best" to predict whether a protein sequence will adopt a desired protein fold (a specific fold I have in mind). Accordingly, if PCA does not do that, what is the best technique to use? I have heard of "feature selection" but I am not very familiar with it. If anyone can elaborate on if/how it differs from PCA that would be very appreciated. Are there known examples (ex: articles, reviews) in protein structure prediction that address this?
I am intending to use R for this analysis, so any suggestions for R libraries that will do the job are most welcome! Thank you very much for your advice and responses!
-Deena
[?]Thank you all very much for your fantastic and detailed responses![?]