Background: I have multiple data matrices, each with 2,000 columns and ~21,0000 rows. I am performing a K-means analysis and then producing a heatmap of the cluster data. I am working in R.
Problem: Rather than just pre-selecting a K-means cluster number and through trial and error choosing which plot "looks best", I am trying to use a tool that will perform something like the elbow or silhouette methods to determine optimal cluster number. I have tried nclust (prior to implementing nclust, I have used the amap package to calculate the distance matrix). My problem is that after ~5 hours it doesn't finish running. I receive no errors or warnings. I'm remoting into a server for this data; eventually I lose connection anyway so I can't wait many hours, besides the practical consideration.
Question: Is there a practical solution or tool that can handle a large matrix for determining optimal cluster # for a k-means analysis?
Thank you, I will look into this.
Coincidentally, there was a recent similar question on Bioconductor: Question: Optimal cluster number identification using buildSNNgraph and igraph clusters
Regarding
cluster::clusGap()
, it can be terribly slow - I have enabled it for parallel processing:Actually that is my question on bioconductor. I wanted to ask Aaron about this stuff as it was related to his
scran
package. The OP here was just quicker than me, however, this question was still about k-means and I was more interested in optimizing graph based clustering. Sorry if it felt like a double post!No, no problem. The question in this thread here is from crcarroll. As I mentioned in my own answer on Bioconductor, there's no right or wrong answer. Aaron's knowledge definitely supersedes mine in this area though.