I have a set of data containing mutation data of each gene (mutation type, mutation site, and hetero/homo mutation) on several cell lines.
Now I want to find out which mutation contribute the most to the different phenotype of different cell lines.
Would PCA give me some hint? And how to deal with the non-linear data point?
Thanks James. I don't have an external measure of phenotype in this case so maybe some clustering would be fine.
But I'm quite confused about what you said "PCA and Kernel PCA are good for dimensionality reduction but they aren't great at explaining which features (e.g. mutations) contribute to the variance." Since in another project, I'm thinking of using PCA to do dimension reduction with log2(FPKM+1), hoping to find genes that contribute the most to a certain phenotype differences across cell lines. Did you mean I can't rely on PCA or other dimension reduction algorithms (e.g. LDA and MDS) to pick up my candidate genes?
it's been a while since I've thought about LDA so I'm rusty on that.
MDS is hard to interpret as it is a nonlinear method.
the PCA procedure is: 1) Center and (usually scale) the data matrix 2) Compute the eigendecomposition of the centered and scaled matrix 3) Multiply the centered and scaled matrix by the first k eigenvectors to get a lower dimensional representation.
This won't give you candidate genes; it'll just give you a smaller data set to look though.
If you can cluster the data, maybe look into gene set enrichment analysis (GSEA) to pick up differences.
Seems like I misunderstood what PCA can do for me...
Thank you!