Optimal K-means cluster # for large data (R)
1
0
Entering edit mode
4.6 years ago
crcarroll ▴ 90

Background: I have multiple data matrices, each with 2,000 columns and ~21,0000 rows. I am performing a K-means analysis and then producing a heatmap of the cluster data. I am working in R.

Problem: Rather than just pre-selecting a K-means cluster number and through trial and error choosing which plot "looks best", I am trying to use a tool that will perform something like the elbow or silhouette methods to determine optimal cluster number. I have tried nclust (prior to implementing nclust, I have used the amap package to calculate the distance matrix). My problem is that after ~5 hours it doesn't finish running. I receive no errors or warnings. I'm remoting into a server for this data; eventually I lose connection anyway so I can't wait many hours, besides the practical consideration.

Question: Is there a practical solution or tool that can handle a large matrix for determining optimal cluster # for a k-means analysis?

R kmeans • 2.7k views
ADD COMMENT
0
Entering edit mode
4.6 years ago
piyushjo ▴ 710

What about gap statistic. clusgap() function of cluster package. I think in single cell data, you first obtain PCA and then use those reduced dimension for k-mean clustering. Look at the below link, this is for single cell, but I am sure can be applied to your problem.

https://osca.bioconductor.org/clustering.html

ADD COMMENT
0
Entering edit mode

Thank you, I will look into this.

ADD REPLY
1
Entering edit mode

Coincidentally, there was a recent similar question on Bioconductor: Question: Optimal cluster number identification using buildSNNgraph and igraph clusters

Regarding cluster::clusGap(), it can be terribly slow - I have enabled it for parallel processing:

ADD REPLY
0
Entering edit mode

Actually that is my question on bioconductor. I wanted to ask Aaron about this stuff as it was related to his scran package. The OP here was just quicker than me, however, this question was still about k-means and I was more interested in optimizing graph based clustering. Sorry if it felt like a double post!

ADD REPLY
1
Entering edit mode

No, no problem. The question in this thread here is from crcarroll. As I mentioned in my own answer on Bioconductor, there's no right or wrong answer. Aaron's knowledge definitely supersedes mine in this area though.

ADD REPLY

Login before adding your answer.

Traffic: 2724 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6