I post Rembrandt Glioma Data Analysis (PART II) - "Are Gender specific genes related to cancer ?" At PART I, I post that Rembrandt Glioma Data is clusterable. Next, If the data is clusterable, how do we determine the optimal number of cluster?
# Open the R program
# required library
library(factoextra)
library(cluster)
library(NbClust)
I use the data, mydata_filtered_scale_1 from Rembrandt Glioma Data Analysis (PART I, Rembrandt Glioma Data Analysis (PART I) - Are Gender specific genes related to cancer ?).
The three most popular methods for determining the optimal number o f clusters are Elbow, silhouette, and gap statistic(1).
The elbow and silhouette method are implemented in factoextra and cluster package,respectively, and can be computed using the function fviz_nbclust()(1)
The Gap statistic is in cluster package and can be visualized using the function fviz_gap_stat() of factoextra package.
#I. Results of Elbow method (1) - K-means
fviz_nbclust(mydata_filtered_scale_1,kmeans,method="wss")+geom_vline(xintercept=2,linetype=2)
![enter image description here][1]
#II. Results of Elbow method (1) - PAM
fviz_nbclust(mydata_filtered_scale_1,pam,method="wss")+geom_vline(xintercept=4,linetype=2)
![enter image description here][2]
#III. Results of Elbow method (1) - hierarchial cluster
fviz_nbclust(mydata_filtered_scale_1,hcut,method="wss")+geom_vline(xintercept=4,linetype=2)
![enter image description here][3]
#IV. Results of silhouette method (1) - K-means
fviz_nbclust(mydata_filtered_scale_1,kmeans,method="silhouette")
![enter image description here][4]
#V. Results of silhouette method (1) - PAM
fviz_nbclust(mydata_filtered_scale_1,pam,method="silhouette")
![enter image description here][5]
#VI. Results of silhouette method (1) - hierarchial cluster
fviz_nbclust(mydata_filtered_scale_1,hcut,method="silhouette",hc_method="complete")
![enter image description here][6]
#VII. Results of Gap statistic (1) - K-means
Number of Cluster k - 10 clusters
#VIII. Results of Gap statistic (1) - PAM
Number of Cluster k - 11 clusters
#IX. Results of Gap statistic (1) - hierarchial cluster
Number of Cluster k - 11 clusters
Nbclust packages provide30 indicies for determining relevant number of clusters.
nb<-nbClust(mydata_filtered_scale_1,distance="euclidean",min.nc=2,max.nc=10,method="complete",index="all")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 10 proposed 2 as the best number of clusters
* 1 proposed 3 as the best number of clusters
* 1 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 5 proposed 6 as the best number of clusters
* 1 proposed 7 as the best number of clusters
* 2 proposed 10 as the best number of clusters
***** Conclusion *****
**According to the majority rule, the best number of clusters is 2**
*******************************************************************
> fviz_nbclust(nb)+theme_minimal()
Among all indices:
===================
* 2 proposed 0 as the best number of clusters
* 10 proposed 2 as the best number of clusters
* 1 proposed 3 as the best number of clusters
* 1 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 5 proposed 6 as the best number of clusters
* 1 proposed 7 as the best number of clusters
* 2 proposed 10 as the best number of clusters
* 3 proposed NA's as the best number of clusters
Conclusion
=========================
**According to the majority rule, the best number of clusters is 2**
# 549 Rembrandt samples composed of 43 genes might be divided into 2 clusters actually.
Reference
(1) Determining the optimal number of clusters:3 must known methods - Unsupervised Machine Learning (http://www.sthda.com)