Hierarchical Clustering
1
6
Entering edit mode
12.6 years ago
Diana ▴ 930

Hello everyone,

I'm using pvclust package in R to cluster (hierarchical clustering with bootstrap) my gene expression data. When I plot my data after clustering, all the branches collapse at the bottom and I can't see the clusters. Is there a way I can improve the image? I used the function scale to scale my data before clustering it but the tree that it produces is a little different from what is produced with unscaled data. With unscaled data, I get 3 outliers but with scaled data the outliers are embedded in the clusters. I'm worried that scaling is distorting my data. Please help.

Image with unscaled data: http://img715.imageshack.us/img715/411/imagetree.jpg

If I scale the data:

I get this tree: http://img526.imageshack.us/img526/923/scaledeb.jpg

Test data:

Gene     condition1  condition2  condition3
AATF    0.004239637    0.004565341    0.004992545
ADNP2    0.00316361    0.002401833    0.002222395
AP-2    0.029882702    0.016730296    0.020585824
AXIN2    0.001743115    0.002124558    0.003573409

Thank you!

r clustering • 3.8k views
ADD COMMENT
1
Entering edit mode

could you paste your code and the plot?

ADD REPLY
3
Entering edit mode
12.6 years ago
Wen.Huang ★ 1.2k

Do you really have to use Euclidean distance? when you scale your data, the scale and magnitude of Euclidean distance change. For gene expression data, "correlation" is almost certainly the right way to measure distance.

ADD COMMENT
0
Entering edit mode

Thank you Huang for your answer. Is there any paper or review that you've come across that describes correlation method to be better than others for clustering gene expression data?I used the correlation method and it does give the same tree with scaled or unscaled data however, there is some difference in the clustering of genes as compared to Euclidean distances and I'm not sure which method would be best.

ADD REPLY
1
Entering edit mode

There is no better or worse between euclidean and correlation distance. It depends on what you believe is the best distance measure. But you definitely don't want to measure Euclidean distance on scaled gene expression. Perhaps Michael Eisen's 1998 PNAS paper is a good reference if you really want one. I believe he used correlation.

ADD REPLY

Login before adding your answer.

Traffic: 1691 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6