Hi all,
I need to run a PCA analysis on a microarray dataset to estimate which are the most relevant features. How can I do it? Is it possible?
I tried with scikit-learn
but I was unable to come with the relevant genes. I did it like this:
from sklearn.decomposition import PCA
import numpy as np
# X is the matrix transposed (n samples on the rows, m features on the columns) - it is represented as a numpy array
X = np.array([[-1, -2, 5, 1], [-3, -1, 1, 0], [-3, -2, 0, 2], [1, 1, 1, 3], [2, 1, 1, 4], [3, 2, 0, 5]])
# suppose we want the 2 most relevant features
nf = 2
pca = PCA(n_components=nf)
pca.fit(X)
X_proj = pca.transform(X)
With X
array([[-1, -2, 5, 1],
[-3, -1, 1, 0],
[-3, -2, 0, 2],
[ 1, 1, 1, 3],
[ 2, 1, 1, 4],
[ 3, 2, 0, 5]])
it returns X_proj
array([[-2.9999967 , 3.26498171],
[-3.53939268, -1.18864266],
[-2.77013188, -2.15637734],
[ 1.67612209, 0.03059917],
[ 2.87464655, 0.35674472],
[ 4.75875261, -0.30730559]])
How can I say which are the selected features? Is there another way to do it (also in R for example)?
Thanks
Hi Giovanni,
your and Brent answer are very useful. I did not understand well PCA before reading the example of psu university.
Thanks
Just to add, what is the preferred way to correlate a feature (that is a gene) with a principal component? May I use Pearson or cosine similarity? Or are there constraints for this data type? Thanks