I am not sure whether this would be more appropriate on a stats forum, but I feel some here may have knowledge of this.
It is said that K-means clustering "does not work well with non-globular clusters."
However, is this a hard-and-fast rule - or is it that it does not often work?
I have a 2-d data set (specifically depth of coverage and breadth of coverage of genome sequencing reads across different genomic regions). In short, I am expecting two clear groups from this dataset (with notably different depth of coverage and breadth of coverage) and by defining the two groups I can avoid having to make an arbitrary cut-off between them.
The procedure appears to successfully identify the two expected groupings, however the clusters are clearly not globular. Is this a valid application? I am not sure whether I am violating any assumptions (if there are any?), or whether it is just that k-means often does not work with non-spherical data clusters.
If the question being asked is, is there a depth and breadth of coverage associated with each group which means the data can be partitioned such that the means of the members of the groups are closer for the two parameters to members within the same group than between groups, then the answer appears to be yes. But is it valid? Or is it simply, if it works, then it's ok?
(Apologies, I am very much a stats novice.)
Edit: below is a visual of the clusters. The breadth of coverage is 0 to 100 % of the region being considered. The depth is 0 to essentially infinity (I have log transformed this parameter as some regions of the genome are repetitive, so reads from other areas of the genome may map to it resulting in very high depth - again, please correct me if this is not the way to go in a statistical sense prior to clustering).
As you can see the red cluster is now reasonably compact thanks to the log transform, however the yellow (gold?) cluster is not. Despite this, without going into detail the two groups make biological sense (both given their resulting members and the fact that you would expect two distinct groups prior to the test), so given that the result of clustering maximizes the between group variance, surely this is the best place to make the cut-off between those tending towards zero coverage (will never be exactly zero due to incorrect mapping of reads?) and those with distinctly higher breadth/depth of coverage. The algorithm converges very quickly <10 iterations.
Thanks
Thanks Karl, I have updated my question with a graph of the clusters. It would be great if you could comment on whether the log transform is appropriate and whether the clusters seem reasonable given that two sets would be expected - one relating to regions which in reality should have zero coverage and the other to those regions with distinctly higher depth and breadth of coverage.
k-means is a type of unsupervised clustering, so it will find clusters wherever they may lie. In this case it looks like you expect one group to be towards the upper right and the other to the lower left, so a diagonal line delineating them might be a good idea. both SVM and LDA are techniques that will quickly define the diagonal line for you, keeping it as far away from any one point as possible.
I still don't really understand your reason for using or needing any kind of clustering, but that's irrelevant if we take the 'two clusters' as an assumption as you seem to be. Assuming a biological reason for two distinct clusters is outside the scope of an algorithmic discussion. Other algos could tell you how many clusters exist or if they are well defined, and then log-transforming the depth would seriously impact those results. For this instance, I think the log transform is a good idea because once the depth is above some limit you don't really care how high it gets.
Yes, when the depth gets really high I don't really care how high it gets, so this seems ok given that I am expecting two "clusters" (see below), but as you point out they are not very cluster like, the data points are just separated, so it may be the case that I could apply something more simple.
Thanks for the edit, that really helps to explain. As Jean-Karim points out, perhaps it is true that any clustering would work due to the separation.