Hello everyone!
I'm doing PCA (principal Component Analysis) on a set of 1000 genes in 4 different samples to see if there's any split in the data. My data looks like this:
id sample1 sample2 sample3 sample4
gene1 2 0 1 1
gene2 1 2 0 3
gene3 2 2 4 2
gene4 3 1 7 0
My code is very simple:
data<-read.csv("exp.csv")
matrix<-data.matrix(data)
pca<- prcomp(matrix[,2:4], scale.=T)
library(ggplot2)
# create data frame with scores
scores = as.data.frame(pca$x)
# plot of observations
ggplot(data = scores, aes(x = PC1, y = PC2, label = rownames(exp))) +
geom_hline(yintercept = 0, colour = "gray65") +
geom_vline(xintercept = 0, colour = "gray65") +
geom_text(colour = "tomato", alpha = 0.8, size = 4) +
ggtitle("PCA plot")
When I plot PC1 and PC2 I clearly see a separation so the genes are divided into 2 big groups but how can I see what the constituent genes of these 2 clusters are? because in the plot lots of genes overlap with each other and therefore its difficult to make out the gene names just from the plot. How can I extract these from PCA results and save it as a text file?
EDIT: For the above code, can someone tell me as to how I can colour the dots in the plot according to the sample? I tried changing colour parameter in ggplot but its not working.
Thanks!!
FYI, no one receives a notice when you edit a post. So the likelihood of someone responding to the edit when there are already answers present is low.
Regarding the edit, you can specify colors by adding a new column to the scores data.frame that contains either sample names or even just factor(c(1:nrow(scores))). Then specify that as the color (well, "colour", since it uses the british spelling).
Hi Devon...thanks for letting me know about the edit. I tried the factor(c(1:nrow(scores))) but that colours all the genes differently whereas I wanted to colour them based on the sample that the gene is most contributing to? In the final PCA plot I do see 2 big clusters of genes so I wanted to colour and see which sample each gene was coming from...
Ah, I see. The genes are coming from all of the samples at the same time, so it's unclear what you actually mean.
Sorry, maybe I didn't explain properly. Yes the genes are coming from all the samples at the same time but they have different values in each sample. So there is no way to colour them according to samples? like red to genes with most expression in sample1, green to those with most expression in sample2 and so on...?
Just create a vector with that information. You have a matrix of values, so just process it to determine what sample to assign it to.
OK. Thanks!