Hi all,
So I have a dataset (ELISA data) for quite a few analytes for patients and healthy controls. Almost all analytes are highly significantly different in patients vs. controls (non-parametric test). So I produced a heatmap and the clustering was okish. However, after log2 transformation it's very good (I add a very small constant to all values avoid -Inf values as I have quite a few 0). If I convert all ELISA data to the same unit before taking the log2 I get an almost perfect clustering, which I would have expected since almost all analytes are highly significantly different between the two groups. But I am a bit worried it's not ok what I did and I do not understand why the clustering improved so much.
Any advice/input is highly appreciated!
Thanks
What is the nature of your data ? What distance/similarity measure and what clustering algorithm are you using ? Log transformation is often used for skewed data. Have you looked at the distributions ? It is also important to pay attention to the assumptions made by the clustering algorithm.
Thank you for your input. So my data is ELISA data.
Then I am using the heatmap.2 function in R with distance measure 'euclidean' and agglomeration method 'ward.D2'.
Yes, I had a look at the distribution of the data. Each analyte by itself is nowhere near a normal distribution (hence the non-parametric test). But all values from variables together form a nice bell-shape curve after all the steps I have stated. Each step improves it. Without the log2 but with scaling the distribution is still skewed. With the log2 it's not skewed anymore. However, I didn't know that this could affect clustering that much. Could that be the reason why?
Ward's method, like k-means, favours roughly spherical-shaped clusters. Data with heavily skewed variables may lead to very elongated clusters that are not well captured by this method. Taking the log of a variable will reduce the skewness and typically makes the distribution closer to normal. You could try alternative clustering methods less sensitive to skewness such as single or average linkage. If the log-transformed data is close to normally distributed, you could do your statistical tests on the log-transformed data, using parametric tests would give you more power.
Thanks for your input. I tried originally both non-parametric and parametric - but the difference was negligible.