Question

PCA scatterplot with different shapes for different clusters

0

Entering edit mode

5.8 years ago

myyid68 ▴ 30

This might be a stupid question but if I've already done PCA on the genes from my data (genes as rows and cancer/normal samples as columns), as well as k-means clustering with k=4, is there any way to do a scatterplot in R with the first two PCs using different colors for the two different sample types (cancer vs normal) and different shapes to indicate the 4 different clusters? Or would I need to consider the first four PCs for the plot (then, what function to use to plot a 4D figure)? Again, I apologize if this might be poorly phrased or unclear, I am just now learning PCA, clustering. Thank you for any feedback, examples of how I might achieve this!

rna-seq R clustering PCA • 13k views

ADD COMMENT • link 5.8 years ago by myyid68 ▴ 30

Kevin Blighe · Answer 1 · 2019-11-05

2

Entering edit mode

5.8 years ago

Kevin Blighe 89k

There are many ways to perform PCA in R. Please show the exact code that you used for both PCA and k-means.

ADD COMMENT • link 5.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Oh sure, sorry about that. Here is my code:

k<- kmeans(df, 4)
data1 <- cbind(df, cluster = k$cluster)

pca<-prcomp(t(df), scale=T)

pca.var<-pca$sdev^2
pca.var.per<-round(pca.var/sum(pca.var)*100, 1)

pca.data<-data.frame(Sample=rownames(pca$x), X=pca$x[,1],Y=pca$x[,2])
pca.data$group=rep("cancer", 20)
pca.data$goup[11:20]=rep("normal", 10)
ggplot(data=pca.data, aes(x=X, y=Y, label=group, colour=group, shape=group))+
  geom_point(size=2, stroke=1, alpha=0.8, aes(color=category))+
  xlab(paste("First principal component - ", pca.var.per[1], "%", sep=""))+
  ylab(paste("Second principal component - ", pca.var.per[2], "%", sep=""))+
  theme_bw()+
  ggtitle("Scatterplot")

ADD REPLY • link updated 5.8 years ago by Kevin Blighe 89k • written 5.8 years ago by myyid68 ▴ 30

1

Entering edit mode

This tutorial is very good and should answer your questions: http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/#compute-pca-in-r-using-prcomp

ADD REPLY • link 5.8 years ago by Mark ★ 1.7k

1

Entering edit mode

I see, I mean, a simple scatterplot that you want can be done by just subsetting / dividing your data and then using plot(). For colour, you could assign a vector to col, and, for shape, you would assign a vector to pch. Maybe I'm not visualising exactly what you want to do.

You can do scatterplots in ggplot2, too, via geom_point()

You may also find Amar's link of use.

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Right, I already did a scatterplot with ggplot, for the first two principal components.

I guess my question was, and what I am trying to figure out is, is there any connection between the number of clusters I use for kmeans and the results of PCA? That is, is there any rule saying that if I used k=4, instead of k=2 for clustering, then my scatterplot for PCA has to be of the first 4 PCs, instead of just 2 PCs?

More specifically, if I do kmeans with k=4 and then do PCA on genes and I want a scatterplot of PCA results using different colors for the two different samples and different shapes for the different clusters, does that mean that I have to use the first four principal components, given that I have 4 clusters? Or can I just draw the first two principal components? But then, if I draw the first 2 PCs, how would I get four different shapes for the four clusters on the same scatterplot? I'm very confused about that.. I hope I managed to explain this a bit more clearly this time.

ADD REPLY • link 5.8 years ago by myyid68 ▴ 30

0

Entering edit mode

There would not necessarily be any connection between the k-means clusters and the clusters that you see from a PCA bi-plot. So, k=4 is not equivalent in any way to PCs 1 to 4.

In your above code, you could bring k$cluster into your ggplot2 input data and then supply it to the shape aesthetic. I would expect the k-means clusters to be more or less brought out on PC1 vs PC2, but not necessarily so.

I'm now plugging my own package here but you could also do PCA using PCAtools and pass the k-means cluster assignments as your metadata, and then correlate these [k-means clusters] back to each PC via this function: 4.5 Correlate the principal components back to the clinical data

That would essentially tell you which PC, in particular, statistically significantly correlates to the k-means clusters

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k

1

Entering edit mode

Oh I see what you're saying. I'll look into this and play around with some options. Thank you for the link by the way, it's really neat and helpful!

ADD REPLY • link 5.8 years ago by myyid68 ▴ 30