Question

Interpretation of Principal component analysis

1

Entering edit mode

4.2 years ago

xqyn ▴ 60

Hi all,

I understand that PCA, the PC1 and PC2 explain the two most variances of data points in a dataset. Do we have any threshold for percentages of variances to conclude that the PC1 is 'good' or 'bad'? What is the relation of the percentage of variance with clustering?

In a particular case, if a dataset indicates that the 2 clusters clearly are separated with PC1 of ~20%)compared with 2 clusters that are not really separated but with PC1 of ~70%. Can we conclude that one is more trustable than the other? enter image description here

Thanks for your help!

clustering • 15k views

ADD COMMENT • link updated 4.2 years ago by Mensur Dlakic ★ 29k • written 4.2 years ago by xqyn ▴ 60

score 5 · Accepted Answer · 2021-03-25

5

Entering edit mode

4.2 years ago

Mensur Dlakic ★ 29k

PCA is not a clustering technique - it's purpose is dimensionality reduction. In many cases data points after dimensionality reduction end up grouping in clusters so it is easier to see that they are related, but that's a secondary purpose of PCA. Just like the purpose of autoencoders is not clustering, but their latent representations are useful for clustering.

The more variance is explained by principal components, the better it serves its intended dimensionality reduction purpose. So if you have PC1 and PC2 explaining 20% and 15% of variance, that would be an inferior solution to PC1 and PC2 explaining 70% and 25%, respectively. In the former case you would need more than 2 PCs to confidently represent your original data, while in the latter PC1 & PC2 would be most likely enough. However, it could happen exactly as you showed that a solution with superior PCs (on the right) gives less clean clusters than an inferior PCA solution. That has to do with intrinsic separability of data points, or whether they are intrinsically clusterable if you will. Not all the data will give clearly separated clusters even when PCA is able to explain most or all variance with only 2 components.

ADD COMMENT • link 4.2 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Thank you. Do we have any threshold for percentages of variances that we can trust? For example, if PCA results in PC1 and PC2 explaining (only) 20% and 15% of variance even with 2 clean clusters, can we claim that there are two distinct cell types?

ADD REPLY • link 4.2 years ago by xqyn ▴ 60

3

Entering edit mode

Do we have any threshold for percentages of variances that we can trust?

It depends on what you want to do with PCA results. When PCA is used for dimensionality reduction - which is its intended purpose - for any downstream analysis I would take enough PCs to explain at least 90-95% of data variance.

There is a greater chance that data points will form cleaner clusters in a 2D plot if the first two PCs explain 70% of variance than when they combine for 35% variance, but it is possible to get poorly separated clusters even when the first two PCs combine for 70% of variance. Like I said, the separation of points in a PCA plot is the intrinsic property of data (e.g., how many informative features are present in data, the distribution of classes/clusters, etc). If your data points form clean clusters when the first two PCs explain only 35% of variance, it simply means that those groups of data points are so distinct that even two PCs that combine for relatively low variance are sufficient to separate them. I would take it.

The opposite can happen, as you implied by your original post. Below I show a PCA plot where the first two PCs explain almost 93% of variance, yet that doesn't guarantee clean cluster separation.

enter image description here

ADD REPLY • link 4.2 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

That's clear!. Thank you soo much.

ADD REPLY • link 4.2 years ago by xqyn ▴ 60