Question

How to determine the number of clusters in heirarchical clustering?

1

Entering edit mode

7.4 years ago

John ▴ 270

Hi, I got the following R code from previously published paper, and got the graph from the code. How to interpret the graph to determine the number of clusters?

a <-read.table(file="Single_TPM.txt",header=T)
all <-a

c <- cor(all, method="pearson")

# To determine number of groups
distance_sum <-c()
for (k in 1:11){
    branch=cutree(hr,k=k)
    group_ids <-split(names(branch),branch)
    avg_matrix <-all[,c()]
    all_avg_matrix <-all

    for (group.n in 1:length(group_ids)){
        group.idx <-which(colnames(all) %in% group_ids[[group.n]])
        avg_exp <-rowMeans(all[,group.idx])
        all_avg_matrix[,group.idx] <-matrix(rep(avg_exp,length(group.idx)),ncol=length(group.idx),byrow=F)
    }

    distance_sum <-c(distance_sum,sum((all-all_avg_matrix)^2))
}
plot(1:length(distance_sum),distance_sum,type="l")

If there is any other method which suits well, please let me know, I use TPM vales from RSEM output for clustering!

image link

R RNA-Seq rna-seq • 2.0k views

ADD COMMENT • link updated 7.3 years ago by Biostar 20 • written 7.4 years ago by John ▴ 270

score 3 · Answer 1 · 2018-03-06

It looks like you're using the elbow method in order to determine ideal cluster number, in which case I agree with Johannes, in that 3 or 4 is the ideal number due to the inflexion point in the curve.

Other methods that you can use to determine ideal cluster number include:

Silhouette coefficient
Gap statistic (parallel process version of this available on my GitHub page: https://github.com/kevinblighe/clusGapKB )

I employed all of these methods in my recent published work: Vitamin D prenatal programming of childhood metabolomics profiles at age 3 y.

score 2 · Answer 2 · 2018-03-06

2

Entering edit mode

7.4 years ago

caggtaagtat ★ 1.9k

Hi, I think you have to look for the knee of the curve. In this case I would try 3 or 4 clusters.

ADD COMMENT • link 7.4 years ago by caggtaagtat ★ 1.9k

score 2 · Answer 3 · 2018-03-06

2

Entering edit mode

7.4 years ago

arta ▴ 670

In R, there is a nice package called ConsensusCluster which determines the optimal number of clusters for unsupervised algorithms.