Hello!
I am attempting to run the default kmeans function in RStudio on TPM-normalized and feature-scaled RNA-seq data. I am varying the number of centers I present to kmeans (logarithmically increasing 'k' from 1 - 10,000) but I have hit a snag.
I am calling the kmeans function in the following way:
clust <- kmeans(my_data, k, iter.max = 40)
And I have found that unless I specify a particular seed value I receive the warning:
Quick-TRANSfer stage steps exceeded maximum (= 729900)
This also happens if I increase the number of starts (nstarts) from the default of 1.
I believe that I know why this is happening, but I am not entirely sure, and even if my hunch is correct I still don't know how to fix this.
What I think is happening:
I believe this error is happening because there are too many points that are too similar in value, and therefore, kmeans is having difficulty trying to place the points in one particular cluster. Basically, I think that some points are being assigned back and forth between clusters without ever "settling" on one cluster in particular.
What I have tried to fix the problem:
- I have tried different seed values and they seem to produce the warning randomly
- I have tried to vary the iter.max value (from the default of 10 up to a max of 80) without any luck
- I have tried calling the garbage collector (gc()) before the kmeans function as some users had reported the warning disappearing after clearing memory, but this did not work for me
- I have tried using a different algorithm (Lloyd), however even with iter.max set to 80, it still failed to converge. On top of that, I would really prefer to use H-W if at all possible as I am analyzing the way kmeans is generally used and therefore need to stay close to the default settings
I am not sure what else I can try to resolve the issue. Any suggestions would be appreciated!
Thank you!
check also https://stackoverflow.com/questions/21382681/kmeans-quick-transfer-stage-steps-exceeded-maximum
Yes, thank you. That was basically what I assumed my issue was. The solution they proposed for using a different algorithm was not applicable in my case. Also, I do actually want an extreme number of clusters, as I am running some tests that need both very low and very high numbers of clusters.
Thank you for your reply!