Hi,
I am computing PCA with prcomp
and I wanted to extract the loadings/rotations for PC1.
My dataset (the dataset has been converted into Z scores):
# A tibble: 5 x 38
gene P4_MU_Sepithelia P16_MU_Sepithel~
<chr> <dbl> <dbl>
1 ADAM10 1.45 0.973
2 ADAM12 -1.43 -1.62
3 ADAM17 -0.0191 -0.0544
4 APH1A 0.866 1.04
5 ATOH1 0.162 -0.545
my script for the PCA:
SE_notch3 <- SE_notch %>% drop_na() #remove rows with NA
rnames <- SE_notch3$gene #select character variable (aka gene names)
SE_notch3 <- SE_notch3[-c(1)] # remove gene symbol
SE_notch.matrix <-(as.matrix(SE_notch3)) #convert into matrix
rownames(SE_notch.matrix) <- rnames # assign row names
all <- t(SE_notch.matrix) #transpose
#pca
res.pca2 <- prcomp(all, center = T, scale=F)
then I tried two different approaches to get the loadings (which in prcomp resides into the rotation parameter):
method1:
loading.scores <- res.pca2$rotation[,1] #PC1
genes.scores <- abs(loading.scores) #convert in absolute values
tes <- sort(genes.scores, decreasing=T) #decreasing orders
top <- tes[1:10] #show the first 10 genes
top
the 10 top gene responsible to drive my PC1 according to the previous code are:
HES7 HES1 HEY2 HES6 GATA3
0.99819716 0.02058775 0.01975547 0.01965922 0.01623756
ADAM12 HEYL DLL4 NOTCH1 ATOH1
0.01535992 0.01517099 0.01410263 0.01406065 0.01274545
but if I compute the PCA on those 10 genes, instead of seeing the sample to cluster far apart they cluster ALL together so I thought I wort the wrong code and I tried this method:
library(broom)
library(tidyr)
top_genes <- res.pca2 %>% # extract variable (gene) loadings
tidy(matrix = "variables") %>%
filter(PC %in% c(1)) %>% # retain only PC1
group_by(PC) %>%
arrange(desc(abs(value))) %>% # sort descending value
slice(1:10) %>% # take top 10
pull(column) %>% # extract the column (gene name) from the table
unique() # retain unique gene names only
top_genes
and the top genes are:
[1] "PSEN1" "JAG2" "CREBBP" "DTX2" "MAML1"
[6] "LNX1" "NCSTN" "ADAM12" "YY1" "ITCH"
What is the right method? Why do I get different output if the PCA is computed in the same way? Where is the error?
I have also tried to compute the PCA with FactoMineR
package that doesn`t show me the loadings directly but it gives me the plot of the contribution when I set which Pc I want:
library("FactoMineR") # for computing PCA
library("factoextra") # for graph
#PCA
res.pca <- PCA(all, graph = FALSE, scale.unit = FALSE)
#contribution variables
top10<- fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)
top10
and the genes with most contributions to PC1 are those that I found with the "first" code :
HES7 HES1 HEY2 HES6 GATA3
ADAM12 HEYL DLL4 NOTCH1 ATOH1
Thank you for your help!
Camilla
thank you for your answer. the PC1 on the dataset contributes to the 90.9% of the variation in my dataset. I tried to use your package since it gives very nice plots but I don`t understand how to get the plot of the loadings only for the PC1 and when I run
I get different variables retained (compare to the previous one) which I am assuming is due to the fact that it consider variables responsible for PC1 and PC2:
also, I wanted to re-run the PCA only on those 10variables just to confirm that I was writing right the code and I was identifying correctly the genes responsible for PC1, (e.g if X number genes are the most responsible to drive the PC1 then I am assuming that re-running the PCA on them will give me a plot with all my samples far away from each other, am I wrong?)
I see, then you'd probably have to include a lot more than just 10 genes if you want to re-create the same structure. While PC1 is ~90%, those 10 genes that you chose could only represent a small portion of that. I cannot see your data or the bi-plots, so, I a not to know.
My plotloadings function, for example, may be a better way to do this, as, by default, it takes the genes that fall into the top 5% of the loadings range. If you choose PC1, and PC2, it will do this separately for PC1 and PC2, and then bind the results for each into a unique vector (so,
top 5% PC1
+top 5% PC2
).