PCA loadings/roation : different variables depending on the code
1
0
Entering edit mode
4.3 years ago
camillab. ▴ 160

Hi,

I am computing PCA with prcomp and I wanted to extract the loadings/rotations for PC1. My dataset (the dataset has been converted into Z scores):

# A tibble: 5 x 38
  gene  P4_MU_Sepithelia P16_MU_Sepithel~
  <chr>            <dbl>            <dbl>
1 ADAM10          1.45             0.973 
2 ADAM12          -1.43            -1.62  
3 ADAM17          -0.0191          -0.0544
4 APH1A           0.866            1.04  
5 ATOH1           0.162           -0.545

my script for the PCA:

SE_notch3 <- SE_notch  %>% drop_na() #remove rows with NA
rnames <- SE_notch3$gene #select character variable (aka gene names)
SE_notch3 <- SE_notch3[-c(1)] # remove gene symbol
SE_notch.matrix <-(as.matrix(SE_notch3)) #convert into matrix
rownames(SE_notch.matrix) <- rnames # assign row names
all <- t(SE_notch.matrix) #transpose

#pca
res.pca2 <- prcomp(all, center = T, scale=F)

then I tried two different approaches to get the loadings (which in prcomp resides into the rotation parameter):

method1:

loading.scores <- res.pca2$rotation[,1] #PC1
genes.scores <- abs(loading.scores) #convert in absolute values
tes <- sort(genes.scores, decreasing=T) #decreasing orders
top <- tes[1:10] #show the first 10 genes
top

the 10 top gene responsible to drive my PC1 according to the previous code are:

  HES7       HES1       HEY2       HES6      GATA3 
0.99819716 0.02058775 0.01975547 0.01965922 0.01623756 
    ADAM12       HEYL       DLL4     NOTCH1      ATOH1 
0.01535992 0.01517099 0.01410263 0.01406065 0.01274545

but if I compute the PCA on those 10 genes, instead of seeing the sample to cluster far apart they cluster ALL together so I thought I wort the wrong code and I tried this method:

 library(broom)
    library(tidyr)
    top_genes <- res.pca2 %>%  # extract variable (gene) loadings
      tidy(matrix = "variables") %>%   
      filter(PC %in% c(1)) %>%   # retain only PC1 
      group_by(PC) %>%
      arrange(desc(abs(value))) %>%  # sort descending value  
      slice(1:10) %>% # take top 10
      pull(column) %>%  # extract the column (gene name) from the table
     unique() # retain unique gene names only

    top_genes

and the top genes are:

[1] "PSEN1"  "JAG2"   "CREBBP" "DTX2"   "MAML1" 
 [6] "LNX1"   "NCSTN"  "ADAM12" "YY1"    "ITCH"

What is the right method? Why do I get different output if the PCA is computed in the same way? Where is the error?

I have also tried to compute the PCA with FactoMineR package that doesn`t show me the loadings directly but it gives me the plot of the contribution when I set which Pc I want:

library("FactoMineR") # for computing PCA
library("factoextra") # for graph
#PCA
res.pca <- PCA(all, graph = FALSE, scale.unit = FALSE)
#contribution  variables
top10<- fviz_contrib(res.pca, choice = "var", axes = 1, top = 10) 
top10

and the genes with most contributions to PC1 are those that I found with the "first" code :

HES7       HES1       HEY2       HES6      GATA3 
ADAM12       HEYL       DLL4     NOTCH1      ATOH1

Thank you for your help!

Camilla

RNA-Seq PCA prcomp FactoMineR loading • 1.8k views
ADD COMMENT
0
Entering edit mode
4.3 years ago

If all samples cluster together after you performed PC using just those 10 genes, then it tells me that they (and PC1) do not contribute too much to the overall explained variation in your dataset as you may think.

PC1 is just the PC that contributes most variation to your dataset - it may only contribute 15%, though. Also, there may not actually be any genes along PC1 that are contributing strongly to that 15% explained variation.

Ultimately, I am not sure why you want to perform PCA on a tiny subset of just 10 genes.

Kevin

ADD COMMENT
0
Entering edit mode

thank you for your answer. the PC1 on the dataset contributes to the 90.9% of the variation in my dataset. I tried to use your package since it gives very nice plots but I don`t understand how to get the plot of the loadings only for the PC1 and when I run

p <- pca(SE_notch.matrix, scale=F) #don`t need to transpose
    plotloadings(p,components = getComponents(p, seq_len(2)))

I get different variables retained (compare to the previous one) which I am assuming is due to the fact that it consider variables responsible for PC1 and PC2:

 -- variables retained:
HES7, ADAM10, ADAM12, ADAM17, APH1A, ATOH1, CBL, CNTN1, CNTN6, CREBBP, CTBP1, CTBP2, DLL4, DNER, DTX2, DTX3L, DTX4, DVL1, DVL3, ENO1, EPS15, FBXW7, FURIN, FUT8, GATA3, HDAC1, HDAC2, HES1, HES6, HEY1, HEY2, HEYL, ITCH, JAG1, JAG2, KAT2A, LFNG, LNX1, LNX2, MAML1, MAML3, MDK, MIB1, MYC, MYCBP, NCSTN, NLE1, NOTCH1, NOTCH2, NUMB, PSEN1, PSEN2, PTPRZ1, RBPJ, RFNG, SKP2, UPF2, YY1

also, I wanted to re-run the PCA only on those 10variables just to confirm that I was writing right the code and I was identifying correctly the genes responsible for PC1, (e.g if X number genes are the most responsible to drive the PC1 then I am assuming that re-running the PCA on them will give me a plot with all my samples far away from each other, am I wrong?)

ADD REPLY
1
Entering edit mode

I see, then you'd probably have to include a lot more than just 10 genes if you want to re-create the same structure. While PC1 is ~90%, those 10 genes that you chose could only represent a small portion of that. I cannot see your data or the bi-plots, so, I a not to know.

My plotloadings function, for example, may be a better way to do this, as, by default, it takes the genes that fall into the top 5% of the loadings range. If you choose PC1, and PC2, it will do this separately for PC1 and PC2, and then bind the results for each into a unique vector (so, top 5% PC1 + top 5% PC2).

ADD REPLY

Login before adding your answer.

Traffic: 1492 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6