This might be a stupid question but if I've already done PCA on the genes from my data (genes as rows and cancer/normal samples as columns), as well as k-means clustering with k=4, is there any way to do a scatterplot in R with the first two PCs using different colors for the two different sample types (cancer vs normal) and different shapes to indicate the 4 different clusters? Or would I need to consider the first four PCs for the plot (then, what function to use to plot a 4D figure)? Again, I apologize if this might be poorly phrased or unclear, I am just now learning PCA, clustering. Thank you for any feedback, examples of how I might achieve this!
Oh sure, sorry about that. Here is my code:
This tutorial is very good and should answer your questions: http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/#compute-pca-in-r-using-prcomp
I see, I mean, a simple scatterplot that you want can be done by just subsetting / dividing your data and then using
plot()
. For colour, you could assign a vector tocol
, and, for shape, you would assign a vector topch
. Maybe I'm not visualising exactly what you want to do.You can do scatterplots in ggplot2, too, via
geom_point()
You may also find Amar's link of use.
Right, I already did a scatterplot with ggplot, for the first two principal components.
I guess my question was, and what I am trying to figure out is, is there any connection between the number of clusters I use for kmeans and the results of PCA? That is, is there any rule saying that if I used k=4, instead of k=2 for clustering, then my scatterplot for PCA has to be of the first 4 PCs, instead of just 2 PCs?
More specifically, if I do kmeans with k=4 and then do PCA on genes and I want a scatterplot of PCA results using different colors for the two different samples and different shapes for the different clusters, does that mean that I have to use the first four principal components, given that I have 4 clusters? Or can I just draw the first two principal components? But then, if I draw the first 2 PCs, how would I get four different shapes for the four clusters on the same scatterplot? I'm very confused about that.. I hope I managed to explain this a bit more clearly this time.
There would not necessarily be any connection between the k-means clusters and the clusters that you see from a PCA bi-plot. So, k=4 is not equivalent in any way to PCs 1 to 4.
In your above code, you could bring
k$cluster
into your ggplot2 input data and then supply it to theshape
aesthetic. I would expect the k-means clusters to be more or less brought out on PC1 vs PC2, but not necessarily so.I'm now plugging my own package here but you could also do PCA using PCAtools and pass the k-means cluster assignments as your metadata, and then correlate these [k-means clusters] back to each PC via this function: 4.5 Correlate the principal components back to the clinical data
That would essentially tell you which PC, in particular, statistically significantly correlates to the k-means clusters
Oh I see what you're saying. I'll look into this and play around with some options. Thank you for the link by the way, it's really neat and helpful!