Question

Is log-transformation needed to calculate Euclidean distance metric from FPKM/TPM?

2

Entering edit mode

7.7 years ago

John Ma ▴ 310

I'm planning on performing some hierarchical clustering on the TPM data we already have, but are unsure whether to log-transform the data to calculate the Euclidean distance matrix.

While for correlation purposes we have already already log-transformed the data, but Euclidean distance does not appear to have a linearity requirement. TPM is already normalized data (although I'm unsure whether it's scale-normalized), so our team is not particularly sure where we still need to log-normalize the TPMs prior to calculating the Euclidean distance.

Anyone has any insight on this?

RNA-Seq clustering distance metric FPKM TPM • 6.0k views

ADD COMMENT • link updated 7.7 years ago by Petr Ponomarenko ★ 2.8k • written 7.7 years ago by John Ma ▴ 310

1

Entering edit mode

I would always use log2 normalized expression values for clustering. Or maybe even better, make z-scores of your log2 expression values before hierarchical clustering.

ADD REPLY • link 7.7 years ago by Benn 8.3k

score 1 · Answer 1 · 2017-03-03

1

Entering edit mode

7.7 years ago

Petr Ponomarenko ★ 2.8k

One of the reasons to use log transform is to try to make gene expression distribution shape normal (Gaussian). When data has normal distribution statistical models used to analyze the data work better. In some situation removing some data to make distribution normal is a good idea (this will technically remove outliers).

ADD COMMENT • link 7.7 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Sure, Gaussian distribution are required for many analyses, and this is why I log-transformed expression data for Pearson correlation. But I don't recall the calculation of Euclidean distance itself, or any of the major clustering algorithms, require this?

ADD REPLY • link 7.7 years ago by John Ma ▴ 310

0

Entering edit mode

If you are planning to use Euclidean distance for clustering analysis, then while most clustering algorithms by itself do not require normal distribution, the way they work is via one or another approach to variance minimization within the cluster and for this you better have no outliers and ideally normal distribution to resemble random noise added over your centroid mean (since random noise has normal distribution). Also, the requirements depend on the hypothesis you are testing. Most likely you will want to measure p-value for the randomness of the measured difference between two groups and having normal distribution at that moment will help you have lower p-value.

ADD REPLY • link 7.7 years ago by Petr Ponomarenko ★ 2.8k