Clustering data - data transformation (log2) highly improves clustering but why?
0
0
Entering edit mode
7.0 years ago
JJ ▴ 710

Hi all,

So I have a dataset (ELISA data) for quite a few analytes for patients and healthy controls. Almost all analytes are highly significantly different in patients vs. controls (non-parametric test). So I produced a heatmap and the clustering was okish. However, after log2 transformation it's very good (I add a very small constant to all values avoid -Inf values as I have quite a few 0). If I convert all ELISA data to the same unit before taking the log2 I get an almost perfect clustering, which I would have expected since almost all analytes are highly significantly different between the two groups. But I am a bit worried it's not ok what I did and I do not understand why the clustering improved so much.

Any advice/input is highly appreciated!

Thanks

clustering Elisa data transformation • 8.3k views
ADD COMMENT
1
Entering edit mode

What is the nature of your data ? What distance/similarity measure and what clustering algorithm are you using ? Log transformation is often used for skewed data. Have you looked at the distributions ? It is also important to pay attention to the assumptions made by the clustering algorithm.

ADD REPLY
0
Entering edit mode

Thank you for your input. So my data is ELISA data.

  • First, I convert all values to the same unit (ng/ml)
  • Second, I add a small constant
  • Third, I take the log2
  • Forth, I scale (scale function in R)

Then I am using the heatmap.2 function in R with distance measure 'euclidean' and agglomeration method 'ward.D2'.

Yes, I had a look at the distribution of the data. Each analyte by itself is nowhere near a normal distribution (hence the non-parametric test). But all values from variables together form a nice bell-shape curve after all the steps I have stated. Each step improves it. Without the log2 but with scaling the distribution is still skewed. With the log2 it's not skewed anymore. However, I didn't know that this could affect clustering that much. Could that be the reason why?

ADD REPLY
1
Entering edit mode

Ward's method, like k-means, favours roughly spherical-shaped clusters. Data with heavily skewed variables may lead to very elongated clusters that are not well captured by this method. Taking the log of a variable will reduce the skewness and typically makes the distribution closer to normal. You could try alternative clustering methods less sensitive to skewness such as single or average linkage. If the log-transformed data is close to normally distributed, you could do your statistical tests on the log-transformed data, using parametric tests would give you more power.

ADD REPLY
0
Entering edit mode

Thanks for your input. I tried originally both non-parametric and parametric - but the difference was negligible.

ADD REPLY

Login before adding your answer.

Traffic: 1506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6