Question

K-means With Many Clusters: Quick-TRANSfer steps exceeded

0

Entering edit mode

6.7 years ago

Ark ▴ 90

Hello!

I am attempting to run the default kmeans function in RStudio on TPM-normalized and feature-scaled RNA-seq data. I am varying the number of centers I present to kmeans (logarithmically increasing 'k' from 1 - 10,000) but I have hit a snag.

I am calling the kmeans function in the following way:

clust <- kmeans(my_data, k, iter.max = 40)

And I have found that unless I specify a particular seed value I receive the warning:

Quick-TRANSfer stage steps exceeded maximum (= 729900)

This also happens if I increase the number of starts (nstarts) from the default of 1.

I believe that I know why this is happening, but I am not entirely sure, and even if my hunch is correct I still don't know how to fix this.

What I think is happening:

I believe this error is happening because there are too many points that are too similar in value, and therefore, kmeans is having difficulty trying to place the points in one particular cluster. Basically, I think that some points are being assigned back and forth between clusters without ever "settling" on one cluster in particular.

What I have tried to fix the problem:

I have tried different seed values and they seem to produce the warning randomly
I have tried to vary the iter.max value (from the default of 10 up to a max of 80) without any luck
I have tried calling the garbage collector (gc()) before the kmeans function as some users had reported the warning disappearing after clearing memory, but this did not work for me
I have tried using a different algorithm (Lloyd), however even with iter.max set to 80, it still failed to converge. On top of that, I would really prefer to use H-W if at all possible as I am analyzing the way kmeans is generally used and therefore need to stay close to the default settings

I am not sure what else I can try to resolve the issue. Any suggestions would be appreciated!

Thank you!

kmeans rna-seq R • 10k views

ADD COMMENT • link updated 6.7 years ago by Chirag Parsania ★ 2.0k • written 6.7 years ago by Ark ▴ 90

1

Entering edit mode

6.7 years ago

Chirag Parsania ★ 2.0k

Hi,

You can try one of the recently published clustering method "Clust". Here is the paper . In this method, user do not need to define number of clusters. Method itself detects number of clusters and also removes observations which does not contribute to the variability across the samples. Method also have online version. User just need to upload the matrix in .txt file

ADD COMMENT • link 6.7 years ago by Chirag Parsania ★ 2.0k

0

Entering edit mode

Thanks for the link! This looks interesting and I will definitely try it out on my data!

ADD REPLY • link 6.7 years ago by Ark ▴ 90

score 1 · Accepted Answer · 2018-11-18

I have been working with this some more and have come to the conclusion that this particular warning is practically unavoidable when I push the number of clusters as high as I am. From what I have read, with an extreme number of clusters and very similar values among the data (many practically equivalent), the algorithm will have trouble converging in a reasonable amount of time. I think my initial hunch in the original post was correct.

For anyone with the same issue: My solution was simply to run many iterations for all desired numbers of clusters (I did 10 per k value) and to completely disregard those that return an "ifault" value of 4. This value indicates that the algorithm couldn't converge in what it considers a reasonable amount of time. Admittedly, as k increases, kmeans takes longer and longer to run and compounding that with many iterations is not ideal. However, I have not found another way around this particular issue in the extreme cases where very large numbers of clusters need to be used. Using another algorithm may help (Lloyds, Macqueen, etc.) but in my case, I really needed to use the Hartigan-Wong algorithm.

Thanks for anyone who read! I'm sure I'll have more questions for you all soon!