Question

ConsensusClusterPlus: How to extract most contributing features for each cluster

0

Entering edit mode

3.7 years ago

komal.rathi ★ 4.1k

Hi,

I am using the R package ConsensusClusterPlus. Here is an example with the ALL data:

library(ConsensusClusterPlus)
library(ALL)
data(ALL)
d = exprs(ALL)

res <- ConsensusClusterPlus(d,
                     clusterAlg = "pam",
                     finalLinkage = "average",
                     distance = "spearman",
                     plot = NULL,
                     reps = 1000, 
                     maxK = 10, 
                     pItem = 0.8,
                     pFeature = 1,
                     seed = 100)

So if I want to get information on the cluster membership for each sample when k = 5, I would get it by using:

cluster5 <- res[[5]]
> head(cluster5$consensusClass, n = 10)
01005 01010 03002 04006 04007 04008 04010 04016 06002 08001 
    1     2     1     2     1     1     2     1     1     3

My question is: how do I extract the most contributing features (or genes in this case) in each cluster?

R consensusclusterplus • 2.1k views

ADD COMMENT • link updated 14 months ago by LChart 4.5k • written 3.7 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Since you are clustering patients/samples using expression values, my best guess would be to separate patients based on cluster membership, e.g. For cluster 1, get a matrix of patients that are only associated with cluster 1 and compare the gene expression between other clusters. You can use something like a Wilcox test. Sort results based on fold-change or P-values.

ADD REPLY • link 3.7 years ago by halo22 ▴ 300

0

Entering edit mode

Hi, It is not an answer, but would like to know whether you have found a way to extracting the most contributing features for each cluster? I am also stuck at this point.

ADD REPLY • link 14 months ago by aUser ▴ 70

score 0 · Answer 1 · 2023-09-19

Hi,

You might have figure it out of how to extract the most contributing features for each cluster, but since, someone else might stumbled upon this, so I am writing few lines.

You can not extract the most contributing features from clustered directly. The reason is that CCPlus uses all the point to create correlation matrix and used that matrix to cluster. In doing so, the individual value of each gene is lost/incorporated into the final value (the correlation/distance between two samples). Thus, it is not possible to extract the most contributing features from CCPlus output directly.

To extract these features, one way is, as pointed out by @halo22, extract samples belonging to each cluster and re-calculate the differential expression. Sort the genes based on logFC or p-value and then select after an arbitrary criteria (e.g. log2FC > +- 1 or p.adjusted-value < 0.01 or both). This has one drawback, that some genes might be duplicated, like gene X is also in cluster 1 and in cluster 2. Then you can add another criteria of higher expression or most significance.

I was trying to figure out another way, e.g. remove one feature (gene) and see whether the cluster is intact. But the number of genes are usually very high, and it is really difficult to check cluster-integrity for that many times.

Another way could be, after clustering samples, cluster the genes now into the same number of groups as that of samples. Now the problem is how one can link the cluster of gene to that of samples. Like how I can say that the sample cluster 2 is because of gene cluster 2. May be someone else can enlighten us here.