Selecting Nodes Of High Correlation In A Tree
3
2
Entering edit mode
14.6 years ago
toni ★ 2.2k

Hi,

I have a microarray experiment where I first processed a hierarchical clustering using R/Bioconductor. So, in particular I have a gene tree. This gene tree could be converted to a .gtr file.

Here is an small example of a gtr file (reporting the history of node joining)

NODE1X GENE1X GENE4X 0.98
NODE2X GENE5X GENE2X 0.80
NODE3X NODE1X GENE3X 0.72
NODE4X NODE2X NODE3X 0.60

The last column is the correlation measured at each node.

The question(s) is :

do you know how is calculated this "correlation" at each node in general (especially in Eisen software) ? Are there several common methods ? One much more used for gene expression ?

Which node correlation measure would you use to select clusters of highly correlated genes and then submit these to a GO analysis tool ? Is it a reliable process to make the gene selection before GO analysis ?

Regards,

tony

microarray gene clustering • 5.4k views
ADD COMMENT
6
Entering edit mode
14.6 years ago

If I understand correctly, this is a question regarding how one can "cut" the hierarchical clustering to extract highly correlated nodes. There are a few options but they are dependent on the metrics that one uses, and require some arbitrary decisions.

From the result of Eisen's CLUSTER program, you might notice that each internal node (NODE1X, ..) in the output has a metric associated with it (the value in the last column in the output). Keep in mind that this value depends on the distance metric (e.g. Euclidean distance or Pearson correlation coefficient) and the linkage method (e.g. single-linkage, complete-linkage) you used when running CLUSTER.

One immediate method is to pick an arbitrary cutoff to select nodes beyond a minimum quality. Let's say we want to select the nodes that have average correlation coefficient r>0.7. The exact cut-off is dependent on how compact you'd like the clusters, therefore it is quite arbitrary. In statistic text, people often determine the number of clusters by plotting cluster number (k, thereby gradually loosening of the cut-off) versus the compactness of the partitions, and then determines a suitable k based on that plot.

Recent research instead focus on automatic (dynamic) selection of cut-off, with applications in gene expression data. I'll list a few references, but there are more.

"An improved algorithm for clustering gene expression data" http://bioinformatics.oxfordjournals.org/cgi/content/full/23/21/2859

"Selection of informative clusters from hierarchical cluster tree with gene classes" http://www.biomedcentral.com/1471-2105/5/32

"Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R" http://bioinformatics.oxfordjournals.org/cgi/content/full/24/5/719

In summary, there is no simple answer to your question, everyone seems to do this differently. But it is certainly an active field.

ADD COMMENT
0
Entering edit mode

Thank you very much. You indeed perfectly understood where were my concerns. It was a problem to me because my choice is often ward linkage (not provided in EISEN soft) so I use R then export R results to CDT, GTR, ATR files on my own and then use Java TreeView. So I needed to calculate the correlation at each node myself (hclust only provides "height" values of each node). But at the end I use arbitrary cutoff. Thanks for the refs associated to cluster selection.

ADD REPLY
4
Entering edit mode
14.6 years ago
User 59 13k

You might be interested in this paper.

"A common clustering method in the analysis of gene expression data has been hierarchical clustering. Usually the analysis involves selection of clusters by cutting the tree at a suitable level and/or analysis of a sorted gene list that is obtained with the tree. Cutting of the hierarchical tree requires the selection of a suitable level and it results in the loss of information on the other level. Sorted gene lists depend on the sorting method of the joined clusters. Author proposes that the clusters should be selected using the gene classifications."

ADD COMMENT
0
Entering edit mode

Thank you Daniel. Very useful. Indeed, being able to select enriched nodes at different levels is interesting (instead of cutting the tree with possibly loss of information)

ADD REPLY
1
Entering edit mode
14.6 years ago

Clustering makes use of similarity measures between elements. There are various different similarity measures: euclidian, correlation, cosine etc that one may employ. The actual numerical values may not be sufficient to identify the method used to produce them.

Joining nodes into clusters is a second stage, here again several other techniques may be used to link similar subgroups into a single one.

There is no right or wrong method, many people use pearson correlation as their metric, it has a fairly straightforward interpretation.

If you have genes of interest you can reduce your dataset to those genes only. This may increase the predictive power of your results because there will be fewer variables in play.

ADD COMMENT
0
Entering edit mode

Thank you Itsvan. Theoretical background of clustering is ok. Each node-value corresponds to the dissimilarity of the joined nodes (depending on distance/linkage you chose). Actually, this question arose from the fact that when using Cluster/TreeView with euclidian dist and complete linkage, the so-called correlation values at each node do not correspond to the (joined)-dissimilarity but is a scaled value in [0,1]. So I was wondering if there was a scaling applied or maybe a recursive customiezd function that would measure what you want on each node like a intra-variance or simple correlation.

ADD REPLY

Login before adding your answer.

Traffic: 1005 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6