Question

Selecting Nodes Of High Correlation In A Tree

2

Entering edit mode

14.6 years ago

toni ★ 2.2k

Hi,

I have a microarray experiment where I first processed a hierarchical clustering using R/Bioconductor. So, in particular I have a gene tree. This gene tree could be converted to a .gtr file.

Here is an small example of a gtr file (reporting the history of node joining)

NODE1X GENE1X GENE4X 0.98
NODE2X GENE5X GENE2X 0.80
NODE3X NODE1X GENE3X 0.72
NODE4X NODE2X NODE3X 0.60

The last column is the correlation measured at each node.

The question(s) is :

do you know how is calculated this "correlation" at each node in general (especially in Eisen software) ? Are there several common methods ? One much more used for gene expression ?

Which node correlation measure would you use to select clusters of highly correlated genes and then submit these to a GO analysis tool ? Is it a reliable process to make the gene selection before GO analysis ?

Regards,

tony

microarray gene clustering • 5.5k views

ADD COMMENT • link updated 6.3 years ago by Ram 44k • written 14.6 years ago by toni ★ 2.2k

score 6 · Answer 1 · 2010-05-12

If I understand correctly, this is a question regarding how one can "cut" the hierarchical clustering to extract highly correlated nodes. There are a few options but they are dependent on the metrics that one uses, and require some arbitrary decisions.

From the result of Eisen's CLUSTER program, you might notice that each internal node (NODE1X, ..) in the output has a metric associated with it (the value in the last column in the output). Keep in mind that this value depends on the distance metric (e.g. Euclidean distance or Pearson correlation coefficient) and the linkage method (e.g. single-linkage, complete-linkage) you used when running CLUSTER.

One immediate method is to pick an arbitrary cutoff to select nodes beyond a minimum quality. Let's say we want to select the nodes that have average correlation coefficient r>0.7. The exact cut-off is dependent on how compact you'd like the clusters, therefore it is quite arbitrary. In statistic text, people often determine the number of clusters by plotting cluster number (k, thereby gradually loosening of the cut-off) versus the compactness of the partitions, and then determines a suitable k based on that plot.

Recent research instead focus on automatic (dynamic) selection of cut-off, with applications in gene expression data. I'll list a few references, but there are more.

"An improved algorithm for clustering gene expression data" http://bioinformatics.oxfordjournals.org/cgi/content/full/23/21/2859

"Selection of informative clusters from hierarchical cluster tree with gene classes" http://www.biomedcentral.com/1471-2105/5/32

"Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R" http://bioinformatics.oxfordjournals.org/cgi/content/full/24/5/719

In summary, there is no simple answer to your question, everyone seems to do this differently. But it is certainly an active field.

Ram · Answer 2 · 2010-05-06

4

Entering edit mode

14.6 years ago

User 59 13k

You might be interested in this paper.

"A common clustering method in the analysis of gene expression data has been hierarchical clustering. Usually the analysis involves selection of clusters by cutting the tree at a suitable level and/or analysis of a sorted gene list that is obtained with the tree. Cutting of the hierarchical tree requires the selection of a suitable level and it results in the loss of information on the other level. Sorted gene lists depend on the sorting method of the joined clusters. Author proposes that the clusters should be selected using the gene classifications."

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 14.6 years ago by User 59 13k

0

Entering edit mode

Thank you Daniel. Very useful. Indeed, being able to select enriched nodes at different levels is interesting (instead of cutting the tree with possibly loss of information)

ADD REPLY • link 14.6 years ago by toni ★ 2.2k

score 1 · Answer 3 · 2010-05-06

1

Entering edit mode

14.6 years ago

Istvan Albert 102k

Clustering makes use of similarity measures between elements. There are various different similarity measures: euclidian, correlation, cosine etc that one may employ. The actual numerical values may not be sufficient to identify the method used to produce them.

Joining nodes into clusters is a second stage, here again several other techniques may be used to link similar subgroups into a single one.

There is no right or wrong method, many people use pearson correlation as their metric, it has a fairly straightforward interpretation.

If you have genes of interest you can reduce your dataset to those genes only. This may increase the predictive power of your results because there will be fewer variables in play.

ADD COMMENT • link 14.6 years ago by Istvan Albert 102k

0

Entering edit mode

Thank you Itsvan. Theoretical background of clustering is ok. Each node-value corresponds to the dissimilarity of the joined nodes (depending on distance/linkage you chose). Actually, this question arose from the fact that when using Cluster/TreeView with euclidian dist and complete linkage, the so-called correlation values at each node do not correspond to the (joined)-dissimilarity but is a scaled value in [0,1]. So I was wondering if there was a scaling applied or maybe a recursive customiezd function that would measure what you want on each node like a intra-variance or simple correlation.

ADD REPLY • link 14.6 years ago by toni ★ 2.2k