Hi all,
I understand that PCA, the PC1 and PC2 explain the two most variances of data points in a dataset. Do we have any threshold for percentages of variances to conclude that the PC1 is 'good' or 'bad'? What is the relation of the percentage of variance with clustering?
In a particular case, if a dataset indicates that the 2 clusters clearly are separated with PC1 of ~20%)compared with 2 clusters that are not really separated but with PC1 of ~70%. Can we conclude that one is more trustable than the other?
Thanks for your help!
Thank you. Do we have any threshold for percentages of variances that we can trust? For example, if PCA results in PC1 and PC2 explaining (only) 20% and 15% of variance even with 2 clean clusters, can we claim that there are two distinct cell types?
It depends on what you want to do with PCA results. When PCA is used for dimensionality reduction - which is its intended purpose - for any downstream analysis I would take enough PCs to explain at least 90-95% of data variance.
There is a greater chance that data points will form cleaner clusters in a 2D plot if the first two PCs explain 70% of variance than when they combine for 35% variance, but it is possible to get poorly separated clusters even when the first two PCs combine for 70% of variance. Like I said, the separation of points in a PCA plot is the intrinsic property of data (e.g., how many informative features are present in data, the distribution of classes/clusters, etc). If your data points form clean clusters when the first two PCs explain only 35% of variance, it simply means that those groups of data points are so distinct that even two PCs that combine for relatively low variance are sufficient to separate them. I would take it.
The opposite can happen, as you implied by your original post. Below I show a PCA plot where the first two PCs explain almost 93% of variance, yet that doesn't guarantee clean cluster separation.
That's clear!. Thank you soo much.