Question

Extracting features or gene from PCA after calculating PCA for downstream analysis

6

Entering edit mode

7.0 years ago

1769mkc ★ 1.2k

Paper link Im quoting their lines which is as such

" We performed principal component analysis (PCA) of low-coverage sequencing data to identify genes explaining variation across cells. PCA separated the cells into groups corresponding to the source populations (Fig. 2c and Supplementary Figs. 3–5), and the genes distinguishing each group reflected the biological properties of the cell types (Supplementary Fig. 5 and Supplementary Table 3). PCA of low- and high-coverage sequencing data revealed a remarkably similar graphical distribution of analyzed cells, and the majority (78%) of the top 500 genes determined by PCA were shared between PCA performed on low- and high-coverage data (Supplementary Figs. 4 and 6 and Supplementary Table 4). "

My question is they how they are performing PCA and then finding out the components which contributes to the strongest PC and then taking out the genes for downstream analysis lets say there are gene involved in PC1 and PC2 which defines a certainly cell type and that distinguishes it, as I think for biology person like me finally i want to get the genes or gene list that are getting involved lineage or cell type

As a to test what they might be doing i did this

test<- read.csv('NON_CODING.csv',header = T,row.names = 1)

# preform PCA
pca = prcomp(t(test), center=TRUE, scale=TRUE)

then after calculating PCA i do see this in my console when I type pca

> pca
Standard deviations (1, .., p=16):
 [1] 3.826851e+01 2.080405e+01 1.568739e+01 1.349256e+01 1.119348e+01 9.980547e+00 8.365034e+00
 [8] 8.098841e+00 7.519507e+00 6.184505e+00 5.880260e+00 5.139609e+00 4.851091e+00 4.335606e+00
[15] 3.870918e+00 3.173450e-14

Rotation (n x k) = (2899 x 16):
                                       PC1           PC2           PC3           PC4
5S_rRNA                      -1.090574e-02 -2.412665e-03  1.637689e-02 -3.603865e-02
AB019441.29                  -1.928250e-02  1.821083e-03 -9.713724e-03 -1.978737e-03
ABBA01017803.1               -1.823266e-02 -3.727144e-03 -9.790131e-04 -3.937062e-02
ABC14-1080714F14.1            2.438024e-02  3.816019e-03  1.657784e-04 -7.141480e-03
ABC7-481722F1.1               2.467403e-02  2.873432e-03  3.781153e-03  3.790824e-03
AC000036.4                   -1.066777e-02  2.838305e-02  2.642673e-02 -7.998608e-03
AC000089.3                   -1.249026e-02 -7.870864e-04 -2.569602e-02 -3.165073e-03
AC000120.7                   -1.313696e-02  5.923245e-03  1.633255e-02  4.077602e-02
AC000123.4                    2.415996e-02  1.029732e-02  8.930792e-03 -1.118660e-02
AC000403.4                    8.692835e-03 -3.060170e-02  3.746839e-02 -6.968971e-03
AC002064.4                   -1.799590e-02  2.147160e-02 -2.303742e-02  7.647937e-03

Now how do i find which are the list or set of genes that gives me PC1 or PC2 so on

Any suggestion or help how to get the genes that make most difference between two major component that would be very helpful

R • 15k views

ADD COMMENT • link 2.4 years ago by 1769mkc ★ 1.2k

score 1 · Answer 1 · 2017-12-12

1

Entering edit mode

7.0 years ago

swbarnes2 14k

The numbers in the PC1 column are the contributions of those genes to PC1. The most extreme numbers go with the genes that are pushing that component the most. So look for the genes whose values have the highest absolute value.

ADD COMMENT • link 7.0 years ago by swbarnes2 14k

0

Entering edit mode

You (the OP) asked me to comment from our other thread ( C: Adding Percentage in PCA ), however, what I have been saying is 100% in line with what swbarnes2 says. If you are wondering how to choose the 'best' genes that contribute to variation in each PC, then there is no standard cut-off to use. You can just choose the top 50, for example.

ADD REPLY • link 7.0 years ago by Kevin Blighe 88k

0

Entering edit mode

okay...there is no definite method to do so...I was wondering how did they arrive at 500 genes in that paper ...Then will try ...do you have any tutorial for that or answered similar question how to take out genes contributing to PCA , if you had answered in the past it would be of great help to look into it, your resources are very helpful i must say

ADD REPLY • link 7.0 years ago by 1769mkc ★ 1.2k

1

Entering edit mode

i have not seen any main tutorial for this, as it is just a matter of ordering the loadings to each principal component and then choosing the top N genes. I am sure that this has been used in many published manuscripts but that many authors would just not fully explain what exactly they did.

ADD REPLY • link 6.0 years ago by Kevin Blighe 88k

0

Entering edit mode

Yes in paper they tell they did but won't write about the exact method

so how to order those eigenvalues ?

ADD REPLY • link 7.0 years ago by 1769mkc ★ 1.2k

2

Entering edit mode

This will give you top 50 genes to PC4:

data.frame(sort(abs(project.pca$rotation[,"PC4"]), decreasing=TRUE)[1:50])

ADD REPLY • link 5.7 years ago by Kevin Blighe 88k

0

Entering edit mode

data.frame(sort(abs(project.pca$rotation[,"PC4"]), decreasing=TRUE)[1:50])

why you have mentioned "PC4"? isn;t that supposed to be PC 1 i was just wondering

ADD REPLY • link 7.0 years ago by 1769mkc ★ 1.2k

1

Entering edit mode

Apologies, copy/paste error from a previous project where I was interested in PC4. You just need to change this to PC1. [now edited]

ADD REPLY • link 5.7 years ago by Kevin Blighe 88k

0

Entering edit mode

okay I was wondering that, as always your suggestion and codes are really helpful

ADD REPLY • link 7.0 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

@Kevin back to the question again some doubts cropped up..so what i understand is my PC is equal to the number of variables which in my case is my sample ,so if i see my PC1 and PC2 are the one making most of the variance , so how does it change even i consider all the PC together because the same of set of genes would be having different rotation in all the sample ,im bit confused ,because even if Im taking the genes with PC1 and PC2 for my downstream analysis I take all the data set yes now I have genes which are most variable , whatever im doing is that conceptually correct ??

ADD REPLY • link 6.2 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

The accumulative % variation of all PCs will be 100%. PC1 will always explain the highest % variation.

If you are interested in using the loadings from PC1 and PC2, then, ideally, your samples are segregated in a way that is of clinical interest along these 2 PCs (?)

ADD REPLY • link 6.0 years ago by Kevin Blighe 88k

0

Entering edit mode

certainly yes as of now i haven;t included in my analysis ,I mean im just taking the expression values for my analysis .they are segregated in terms of mutation ..

ADD REPLY • link 6.2 years ago by 1769mkc ★ 1.2k

2

Entering edit mode

Okay, yes, each and every gene will have a value to PC1, and also to PC2, PC3, et cetera. These values are called the component loadings.

ADD REPLY • link 4.3 years ago by Kevin Blighe 88k

2

Entering edit mode

hi Kevin i have doubt what is the difference with the codes as the output im getting is not consistent

eigenvalues3  <-data.frame(sort(abs(m_pca$rotation[,"PC1"]), decreasing=TRUE)[1:500]) this is what you did here

this is what doing

Loadings <- as.data.frame(m_pca$rotation[,1:3]) taking first three

then

t <- Loadings[order(Loadings$PC1,decreasing = TRUE)[1:500],]

then this

hm = cbind(gene = rownames(t), t)
head(hm)
hmm = melt(hm, id = c("gene"))
head(hmm)
write_tsv(hm,"gene_hm_PC1.txt")

I should be getting same output in both the case...but im not not sure what am i doing wrong..

ADD REPLY • link 6.1 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

The code that I gave converts the loadings to absolute values with abs(); however, you do not appear to do that with your code (?)

ADD REPLY • link 6.0 years ago by Kevin Blighe 88k

0

Entering edit mode

okay...so the only purpose is to get the absolute values ..but the gene would remain the same. isn't it?

ADD REPLY • link 6.1 years ago by 1769mkc ★ 1.2k

3

Entering edit mode

To understand the difference between positive and negative loadings, you have to view the original PCA bi-plot (of samples). Positive values on PC1 may relate to group X, but negative values may relate to group Y.

Ordering by absolute values just says: 'this gene is important to either group X or group Y'

ADD REPLY • link 6.0 years ago by Kevin Blighe 88k

0

Entering edit mode

okay that is important point...which i was missing ...i will do as you suggested

ADD REPLY • link 6.1 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

Hi Kevin,

I wonder which data should I upload for PCA, total matrix or just the matrix for differential expression genes?

If I upload total gene matrix, then when I want to find out the most contributing genes for PC1, should I exclude the undifferential expression genes?

Thanks in advance!

ADD REPLY • link 4.6 years ago by Yi • 0

0

Entering edit mode

Upload to where?

ADD REPLY • link 4.6 years ago by Kevin Blighe 88k

0

Entering edit mode

I mean which data should be used for PCA? Just the DEGs or total genes? Thanks!

ADD REPLY • link 4.6 years ago by Yi • 0

0

Entering edit mode

yes DEG ..

ADD REPLY • link 2.4 years ago by 1769mkc ★ 1.2k

score 0 · Answer 2 · 2020-05-08

0

Entering edit mode

4.6 years ago

The ▴ 180

https://stats.stackexchange.com/questions/115032/how-to-find-which-variables-are-most-correlated-with-the-first-principal-compone

ADD COMMENT • link 4.6 years ago by The ▴ 180

0

Entering edit mode

Hey, I want to ask a follow up question. Is it possible to do the same for MDS? Multi-dimensional scaling? Thanks!

ADD REPLY • link 4.4 years ago by junli1988 • 0