If I understand correctly, this is a question regarding how one can "cut" the hierarchical clustering to extract highly correlated nodes. There are a few options but they are dependent on the metrics that one uses, and require some arbitrary decisions.
From the result of Eisen's CLUSTER
program, you might notice that each internal node (NODE1X
, ..) in the output has a metric associated with it (the value in the last column in the output). Keep in mind that this value depends on the distance metric (e.g. Euclidean distance or Pearson correlation coefficient) and the linkage method (e.g. single-linkage, complete-linkage) you used when running CLUSTER
.
One immediate method is to pick an arbitrary cutoff to select nodes beyond a minimum quality. Let's say we want to select the nodes that have average correlation coefficient r>0.7
. The exact cut-off is dependent on how compact you'd like the clusters, therefore it is quite arbitrary. In statistic text, people often determine the number of clusters by plotting cluster number (k
, thereby gradually loosening of the cut-off) versus the compactness of the partitions, and then determines a suitable k
based on that plot.
Recent research instead focus on automatic (dynamic) selection of cut-off, with applications in gene expression data. I'll list a few references, but there are more.
"An improved algorithm for clustering gene expression data"
http://bioinformatics.oxfordjournals.org/cgi/content/full/23/21/2859
"Selection of informative clusters from hierarchical cluster tree with gene classes"
http://www.biomedcentral.com/1471-2105/5/32
"Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R"
http://bioinformatics.oxfordjournals.org/cgi/content/full/24/5/719
In summary, there is no simple answer to your question, everyone seems to do this differently. But it is certainly an active field.
Thank you very much. You indeed perfectly understood where were my concerns. It was a problem to me because my choice is often ward linkage (not provided in EISEN soft) so I use R then export R results to CDT, GTR, ATR files on my own and then use Java TreeView. So I needed to calculate the correlation at each node myself (hclust only provides "height" values of each node). But at the end I use arbitrary cutoff. Thanks for the refs associated to cluster selection.