Question

Extracting information from Principal component analysis

3

Entering edit mode

10.1 years ago

Diana ▴ 930

Hello everyone!

I'm doing PCA (principal Component Analysis) on a set of 1000 genes in 4 different samples to see if there's any split in the data. My data looks like this:

id       sample1     sample2     sample3     sample4
gene1    2           0           1           1
gene2    1           2           0           3 
gene3    2           2           4           2
gene4    3           1           7           0

My code is very simple:

data<-read.csv("exp.csv")

matrix<-data.matrix(data)

pca<- prcomp(matrix[,2:4], scale.=T)

library(ggplot2)

# create data frame with scores
scores = as.data.frame(pca$x)

# plot of observations
ggplot(data = scores, aes(x = PC1, y = PC2, label = rownames(exp))) +
  geom_hline(yintercept = 0, colour = "gray65") +
  geom_vline(xintercept = 0, colour = "gray65") +
  geom_text(colour = "tomato", alpha = 0.8, size = 4) +
  ggtitle("PCA plot")

When I plot PC1 and PC2 I clearly see a separation so the genes are divided into 2 big groups but how can I see what the constituent genes of these 2 clusters are? because in the plot lots of genes overlap with each other and therefore its difficult to make out the gene names just from the plot. How can I extract these from PCA results and save it as a text file?

EDIT: For the above code, can someone tell me as to how I can colour the dots in the plot according to the sample? I tried changing colour parameter in ggplot but its not working.

Thanks!!

PCA R RNA-Seq • 13k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by Diana ▴ 930

0

Entering edit mode

FYI, no one receives a notice when you edit a post. So the likelihood of someone responding to the edit when there are already answers present is low.

Regarding the edit, you can specify colors by adding a new column to the scores data.frame that contains either sample names or even just factor(c(1:nrow(scores))). Then specify that as the color (well, "colour", since it uses the british spelling).

ADD REPLY • link 10.1 years ago by Devon Ryan 105k

0

Entering edit mode

Hi Devon...thanks for letting me know about the edit. I tried the factor(c(1:nrow(scores))) but that colours all the genes differently whereas I wanted to colour them based on the sample that the gene is most contributing to? In the final PCA plot I do see 2 big clusters of genes so I wanted to colour and see which sample each gene was coming from...

ADD REPLY • link 10.1 years ago by Diana ▴ 930

0

Entering edit mode

Ah, I see. The genes are coming from all of the samples at the same time, so it's unclear what you actually mean.

ADD REPLY • link 10.1 years ago by Devon Ryan 105k

0

Entering edit mode

Sorry, maybe I didn't explain properly. Yes the genes are coming from all the samples at the same time but they have different values in each sample. So there is no way to colour them according to samples? like red to genes with most expression in sample1, green to those with most expression in sample2 and so on...?

ADD REPLY • link 10.1 years ago by Diana ▴ 930

0

Entering edit mode

Just create a vector with that information. You have a matrix of values, so just process it to determine what sample to assign it to.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by Devon Ryan 105k

0

Entering edit mode

OK. Thanks!

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by Diana ▴ 930

Ram · Answer 1 · 2015-01-20

4

Entering edit mode

10.1 years ago

Jeremy Leipzig 23k

the keyword you might be searching for is "loadings"

http://stackoverflow.com/questions/12760108/principal-components-analysis-how-to-get-the-contribution-of-each-paramete

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by Jeremy Leipzig 23k

Ram · Answer 2 · 2015-01-20

1

Entering edit mode

10.1 years ago

Devon Ryan 105k

If you have a clear separation, then you can simply threshold the scores data.frame according to that. I don't recall if prcomp() adds row names to its output, but if not then things should be in the same order as the input.

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks it worked!

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by Diana ▴ 930

Ram · Answer 3 · 2015-01-20

1

Entering edit mode

10.1 years ago

Jean-Karim Heriche 27k

You could cluster the genes in PCA space i.e. use the scores as input to the clustering algorithm.

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by Jean-Karim Heriche 27k

score 0 · Answer 4 · 2015-01-20

0

Entering edit mode

10.1 years ago

The ▴ 180

Check if this helps:
http://stats.stackexchange.com/questions/115032/how-to-find-which-variables-are-most-correlated-with-the-first-principal-compone

ADD COMMENT • link 10.1 years ago by The ▴ 180

score 0 · Answer 5 · 2017-03-31

0

Entering edit mode

7.9 years ago

benoit.tessoulin ▴ 30

Hi, check FactoMineR, a very useful package for PCA (and MCA, MFA, FAMD...) in R. It gives great outputs, both statisticals and graphicals. FactoMineR

ADD COMMENT • link 7.9 years ago by benoit.tessoulin ▴ 30