How to conduct PCA (principal component analysis) on a set of mutation detection data/non-linear data?
1
0
Entering edit mode
2.3 years ago
Brisket ▴ 10

I have a set of data containing mutation data of each gene (mutation type, mutation site, and hetero/homo mutation) on several cell lines.

Now I want to find out which mutation contribute the most to the different phenotype of different cell lines.

Would PCA give me some hint? And how to deal with the non-linear data point?

mutation PCA biostatistics exon • 1.2k views
ADD COMMENT
2
Entering edit mode
2.3 years ago
James ▴ 30

Do you have an external measure of phenotype? That is do you have corresponding response variables (e.g. treatment responsive/non-responsive) etc.?

If so, try using a regularized regression method to identify mutations that contribute the most.

If not, the answer is "not really." PCA and its nonlinear cousin Kernel PCA are good for dimensionality reduction but they aren't great at explaining which features (e.g. mutations) contribute to the variance.

A better method might be to cluster the data and then do some exploratory analysis of the mutations in different clusters.

ADD COMMENT
1
Entering edit mode

Thanks James. I don't have an external measure of phenotype in this case so maybe some clustering would be fine.

But I'm quite confused about what you said "PCA and Kernel PCA are good for dimensionality reduction but they aren't great at explaining which features (e.g. mutations) contribute to the variance." Since in another project, I'm thinking of using PCA to do dimension reduction with log2(FPKM+1), hoping to find genes that contribute the most to a certain phenotype differences across cell lines. Did you mean I can't rely on PCA or other dimension reduction algorithms (e.g. LDA and MDS) to pick up my candidate genes?

ADD REPLY
0
Entering edit mode

it's been a while since I've thought about LDA so I'm rusty on that.

MDS is hard to interpret as it is a nonlinear method.

the PCA procedure is: 1) Center and (usually scale) the data matrix 2) Compute the eigendecomposition of the centered and scaled matrix 3) Multiply the centered and scaled matrix by the first k eigenvectors to get a lower dimensional representation.

This won't give you candidate genes; it'll just give you a smaller data set to look though.

If you can cluster the data, maybe look into gene set enrichment analysis (GSEA) to pick up differences.

ADD REPLY
0
Entering edit mode

Seems like I misunderstood what PCA can do for me...

Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1894 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6