Question

single cell RNA-seq anlysis, PCA method, how to choose variables which contribute most to components?

4

Entering edit mode

10.7 years ago

zhenyisong ▴ 160

I read the paper by Quake lab about using single cell RNA-seq to find new cell lineage marker in lung development. Their method is to use PCA (principle component analysis) to select genes to do unsupervised hierarchical clustering (HC). Here they described that "Genes with highest loadings in the first four components were analysed by unsupervised hierarchical clustering as well as PCA". I think the loading has an equivalent concept to Eigenvector. Hence, to do the analysis, they generated mx4 matrix (m = gene number,loading matrix?) so, my problem is: how do we choose those genes with highest loadings?

Select those genes which has the largest sum of weights (I mean, sum of each row, thus mx1, then order them); or
Select those genes which has one of largest weight in either of four columns

The solution is (1) or (2)? or I mis-understand the concept of PCA?

A similar post here but I think they described a nx1 loading matrix.

BTW, is there another way to infer the new cell lineage or classify groups of cells? Is there a evaluation report on those methods? TIA

PCA • 7.5k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.7 years ago by zhenyisong ▴ 160

0

Entering edit mode

I'm having the same questions and was wondering if you have made any progress on this?

ADD REPLY • link 10.4 years ago by gaelgarcia ▴ 280

1

Entering edit mode

No. Someone suggested that the first way is OK (add the weights together and then ordering). But I did not find this explanation from the textbook. I wrote a letter to the authors and asked the source code, but no response. Anyway, if you find the answer,do let me know.

ADD REPLY • link 10.4 years ago by zhenyisong ▴ 160

Ram · Answer 1 · 2015-02-23

1

Entering edit mode

10.4 years ago

Jean-Karim Heriche 27k

From the methods section of the paper:

... genes with the highest PC loadings (highest absolute correlation coefficient with one of the first three to four principal components) were identified.

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I mean, how the highest PC loadings were calculated? I want to make sure whether "the first three or four" weights are added up and ordered by the sum (row) or single out one largest pc in those four components and then make an order. My understanding is that there are still two possible ways to interpret their description in the Method section. It is a bit of confusing for non-English background. Thanks.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.4 years ago by zhenyisong ▴ 160

0

Entering edit mode

My interpretation of this is that they went with option 2 you mentioned. The code used is in supplementary data 2 of the paper and here is what I believe to be the relevant section from file Ranalysis_scRNAseq_E18_80cells_paper.txt:

PCA.allgenes = PCA(PCA.data.log2.single, ncp=4, graph=T)
#PCA(PCA.data.log2.single, axes=c(3, 4))
dimension.PCA.allgenes<-dimdesc(PCA.allgenes, axes=c(1,2,3,4))
dim4<-as.data.frame(dimension.PCA.allgenes[[4]])
dim3<-as.data.frame(dimension.PCA.allgenes[[3]])
dim2<-as.data.frame(dimension.PCA.allgenes[[2]])
dim1<-as.data.frame(dimension.PCA.allgenes[[1]])
//
genes.corr.dim<-unique(c(row.names(dim1[c(1:18),]),row.names(dim1[(nrow(dim1)-10):nrow(dim1),]),row.names(dim2[c(1:18),]),row.names(dim2[(nrow(dim2)-18):nrow(dim2),]),row.names(dim3[1:18,]),row.names(dim3[(nrow(dim3)-18):nrow(dim3),]),row.names(dim4[(nrow(dim4)-18):nrow(dim4),])))

PCA(PCA.data.log2.single[,c(genes.corr.dim)], axes=c(1, 2))

#Hierarchical clustering with genes identified by PCA to correlate strongly with principal components:
data.cluster.candidates<-cbind(data.cast.log2.single[,1:7],data.cast.log2.single[,c(genes.corr.dim)])

hc.candidates <- hclust(as.dist(1-abs(cor(data.cluster.candidates[,8:ncol(data.cluster.candidates)],method="spearman"))), method="ward")

However, the code is not well documented to say the least and I find it unreadable but maybe that's just me not being a strong R programmer. I suspect the job is done in the dimdesc function but there's no way to know which package provides any of the functions used. My guess is that it's all in the FactoMineR package.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I greatly appreciate your help. Thanks again. However, I am wondering if this approach (Option 2) is empirical method or has some convincing reason to do so (reference?)?

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.4 years ago by zhenyisong ▴ 160

0

Entering edit mode

The loadings can be viewed as the correlation between the genes and the components so selecting in this way, you select genes that are strongly associated (positively or negatively) with a component which makes sense if you want to characterize genes specific of a disease associated with a given component. If you take option 1, you would end up with non-specific genes because any gene with a strong association with more than one component would rank high.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I remember that each Principle Component(PC) has its weight. Should we times each weight and compare those four components before selecting the max one?

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.4 years ago by zhenyisong ▴ 160