Dear all,
I am a bit confused about hierarchical clustering of non-normal data. I have read posts saying the underlaying distribution when choosing a distance measure does not matter and others write it does. There are more posts, but from the latter post I gathered that non-normal data should be used in conjunction with Spearman correlation as a distance measure. Can someone explain this to me? Also, how robust is this concerning violations? I couldn’t find a paper or a textbook on this.
I generated a heatmap with hierarchical clustering in R and tried Euclidian and Spearman as a distance measure. I get a nice and sensible plot with Euclidian and a horrible one with Spearman (both in conjunction with complete as an agglomeration method). As I am interested in magnitude concerning the clustering of the samples, a distance metric such as Euclidian (and not correlation) makes sense to me (as also mentioned in this post). For clustering of genes, it makes sense to use correlation, as here one is mostly interested in patterns - this is the example I see everywhere. But in my case, I would like to cluster relating to the magnitude of the values and I want to cluster the samples and not the variables. So, what do I do with non-normal data I want to cluster without using correlation?
I have continuous data that is scaled before clustering.
I also had a look at the distribution of the data. My question here is – what actually needs to be normally distributed for clustering?
- All values combined (all data points)? – here I get a wonderfully bell-shaped curve and a very nice qqplot (generated with the
qqnorm
function). - Or each sample individually? – here many of the samples exhibit an okish / so-so qqplot but not all.
- Or the variables? – here many of the variables exhibit an okish / so-so qqplot; the data is not heavliy skewed.
The clustering with Euclidian generates nice clusters that make sense and also fit to the underlaying values (I visually compared boxplots). So, my final question – is it ok to use Euclidian distance in my case?
Any advice is very much appreciated.
Thank you for your answer! Would 'ward.D2' as a linkage method also be ok? Or is there more to consider?
Ward's linkage also assumes globular clusters. In my experience, Ward's, complete and average linkages often produce similar outcomes. If you have non-globular clusters, you could try single linkage but in my experience, in such cases, density-based clustering or clustering following a change of representation (e.g. dimensionality reduction) may work better.
Again thank you for your answer! The clustering is indeed very similar. How can I check the assumption of spherical/globular clusters?
I am not sure what you mean. The assumption is in the model of cluster structure that the algorithm has. If you mean visualize that you have globular clusters, this isn't easily done in high dimension but is suggested by the fact that you get nice clusters (for some definition of nice) using a method that looks for spherical clusters. Also if you're using hierarchical clustering, you would need to cut your tree first to actually get clusters.
Thank you for your quick responses! I would have one final question: would you always recommend to transform the data with log2 before clustering? or only when it significantly improves the clustering? or only when it makes the data less skewed? what would be your criteria? In my case, it maybe improves the clustering slightly, but as I have negative values I would need to shift the data first, then take the log2 and then scale, which makes the data range actually wider and the color distinction worse.
You don't say what kind of data you're dealing. In general, I'd try to avoid data transformation unless it's motivated by external considerations like a need to change the shape of the distribution to apply a some statistical test or correct for some known effect.
You may want to post this as a separate question with some details about the data.