I have a dataset including 150 genes and 100 sample,I want to cluster my genes with kmean clustering but I don't know about the number of clusters,How can I select the best number?
I have a dataset including 150 genes and 100 sample,I want to cluster my genes with kmean clustering but I don't know about the number of clusters,How can I select the best number?
Some of the ways to find "k" in k-means are:
Here is the implementation of K-means and elbow method in R
I usually first build a couple of dendrograms with hierarchical clustering using different methods e.g. complete linkage and Ward's linkage to get an idea of the structures present in the data. The problem is that k-means will give you the clusters you requested no matter whether there's structure in the data or not. Once you know there's structure you can either cut the tree or use k-means with the number of clusters found in the dendrogram. Also with 100-dimensional vectors, you should probably not use Euclidean distance if the data is noisy. Finally, there are also a few clustering algorithms that don't require the number of clusters as input e.g. DBSCAN (dbscan package in R) although they often require setting other parameters which are not necessarily easier to estimate.
Thank you Jean-Karim Heriche so, In your opinion it is better that first I build a dendrogram with hierarchical clustering and then select the number of cluster for kmean clustering, I don't know about mathematics behind these methods very well, In your opinion is it better that I use hierarchical clustering instead of kmean?
It is always good to start with some visual exploration of the data before clustering. The goals are first to find out whether there are some detectable structures and second to try to get an idea of their shapes. This last bit is important because k-means can only find spherical-shaped clusters. Hierarchical clustering is a quick and easy way to go about looking for structures. You could also try various kinds of plots e.g. PCA, MDS. If you can validate the clusters a posteriori then you could try different similarity/distance measures with different clustering approaches and select what gives you the best results.
There's an interesting modification to k-means where instead of setting the clusters explicitly, you minimize the expression
$\sum || x_i - \mu_i ||^2 + \sum || \mu_i - \mu_j ||$
(IIRC). The $\mu$s represent cluster centroids, and the minimization forces them to be as few as possible, while minimizing the distance from the data points $x_i$ to the corresponding centroid. I can see if I can find a reference, if it's of interest.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
The silhouette approach is implemented in R in the package cluster.
yes and it suggests me the number of two, but Calinsky approach suggests 11
Thank you Ar,I tried with all methods in this link http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters ,but I got more confused, because each method proposed me a special number, like 2,5,11 !!! Dear Ar I can't upvote your ansver ,I think you should comment on my post then I can upvote your answer
You get a different number of clusters with different methods because they look at different things. However, in the case of very well defined clusters, they would tend to give the same number but this is rather rare because real data is noisy.
Perhaps that clustering makes sense at different levels. Maybe at k=2, the hypothetical genes split up by "healthy" and "cancer" labels, say. At k=5, genes split out by two "healthy" subgroups and three "cancer" subgroups. Etc. And these groupings might be biologically relevant in different respects. Further exploration of clusterings is useful to see how and why things are falling out.