Question

How to conduct PCA (principal component analysis) on a set of mutation detection data/non-linear data?

0

Entering edit mode

2.3 years ago

Brisket ▴ 10

I have a set of data containing mutation data of each gene (mutation type, mutation site, and hetero/homo mutation) on several cell lines.

Now I want to find out which mutation contribute the most to the different phenotype of different cell lines.

Would PCA give me some hint? And how to deal with the non-linear data point?

mutation PCA biostatistics exon • 1.2k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 2.3 years ago by Brisket ▴ 10

score 2 · Answer 1 · 2022-08-17

2

Entering edit mode

2.3 years ago

James ▴ 30

Do you have an external measure of phenotype? That is do you have corresponding response variables (e.g. treatment responsive/non-responsive) etc.?

If so, try using a regularized regression method to identify mutations that contribute the most.

If not, the answer is "not really." PCA and its nonlinear cousin Kernel PCA are good for dimensionality reduction but they aren't great at explaining which features (e.g. mutations) contribute to the variance.

A better method might be to cluster the data and then do some exploratory analysis of the mutations in different clusters.

ADD COMMENT • link 2.3 years ago by James ▴ 30

1

Entering edit mode

Thanks James. I don't have an external measure of phenotype in this case so maybe some clustering would be fine.

But I'm quite confused about what you said "PCA and Kernel PCA are good for dimensionality reduction but they aren't great at explaining which features (e.g. mutations) contribute to the variance." Since in another project, I'm thinking of using PCA to do dimension reduction with log2(FPKM+1), hoping to find genes that contribute the most to a certain phenotype differences across cell lines. Did you mean I can't rely on PCA or other dimension reduction algorithms (e.g. LDA and MDS) to pick up my candidate genes?

ADD REPLY • link 2.3 years ago by Brisket ▴ 10

0

Entering edit mode

it's been a while since I've thought about LDA so I'm rusty on that.

MDS is hard to interpret as it is a nonlinear method.

the PCA procedure is: 1) Center and (usually scale) the data matrix 2) Compute the eigendecomposition of the centered and scaled matrix 3) Multiply the centered and scaled matrix by the first k eigenvectors to get a lower dimensional representation.

This won't give you candidate genes; it'll just give you a smaller data set to look though.

If you can cluster the data, maybe look into gene set enrichment analysis (GSEA) to pick up differences.

ADD REPLY • link 2.3 years ago by James ▴ 30

0

Entering edit mode

Seems like I misunderstood what PCA can do for me...

Thank you!

ADD REPLY • link 2.3 years ago by Brisket ▴ 10