Question

How To Calculate Cluster/Go Annotation Mutual Information?

6

Entering edit mode

13.1 years ago

Ted ▴ 70

Hi All,

This paper by Gibbons and Roth (2002) describes a method of verifying clusters by checking their mutual information against GO terms. A cluster/annotation contingency matrix is produced, indicating that for cluster r and GO term c, each element indicates the number of occurrences of a specific GO term (the column) for the genes in that group. Then, mutual information is calculated. This is best visualized from this graphic in Steuer et al. (2006) alt text

My question is how to calculate this mutual information value. I have such a contingency matrix, and know how to calculate mutual information, but Gibbons et al (and Steuer et al too) use an approximation, and I'm unsure of their notation.

The MI for a cluster is additive under their (maybe too strong assumptions):

I(C, [A1, A2]) = I(C, A1) + I(C, A2)

Each I(C, Ai) = H(C) + H(Ai) - H(C, Ai)

What I'm confused about is (1) Why no subscript on C? How is H(C, Ai) calculated when Ai corresponds to one column, and C (seems to) correspond to all of the clusters? With one column, how do we get the joint distribution? (2) How is H(C) calculated? Is it across all attributes?

If you want to show an example with the contingency table in the graphic, I'd be forever grateful!

gene clustering • 6.2k views

ADD COMMENT • link updated 13.1 years ago by Qdjm 1.9k • written 13.1 years ago by Ted ▴ 70

Ram · Answer 1 · 2011-11-08

1

Entering edit mode

13.1 years ago

Jake Mick ▴ 50

0.1. Their assumption is not incorrect or too strong at all. Multivariate mutual information is exactly that. However be very cautious about interpretting the sign of the result.

C is the cluster indicator(CI) vector. There is no subscript on C because there is only 1 clustering result. No modification of C is needed.
H(C) would be calculated from the CI histogram, each value in the histogram is divided by the length of datapoints, n, we'll call this new histogram CIN. H(C) = sum over n(CIN_n*log(CIN_n)).

Note that the base of the logarithm determines the unit to report. 2 is bits, e is nats, etc...

I'm a little confused as to why they did not construct a GO-indicator vector and directly calculate the normalized mutual information.

ADD COMMENT • link 13.1 years ago by Jake Mick ▴ 50

0

Entering edit mode

How is H(C, Ai) calculated too? This is what I am most confused about. If you have time, could you do an example with the data above?

ADD REPLY • link 13.0 years ago by Ted ▴ 70

0

Entering edit mode

I apologize, I see the problem now. After running that contingency matrix through NMI I get a score of 0.318098096606, on [0,1].

This modified version: [[5, 0, 0], [0, 7, 0], [0, 0, 0], [0, 0, 7]], scores perfectly 1.

This one: [[5, 0, 0], [0, 7, 0], [0, 0, 0], [0, 1, 7]], scores at 0.755192177109.

Random where each element is [1,10], hovers around 0.05.

They use that assumption to account for multiple go entries. If you are comparing cluster indicator results to univariate annotations I would highly recommend using NMI.

ADD REPLY • link 13.0 years ago by Jake Mick ▴ 50

0

Entering edit mode

If you have univariate "true labels" that you would like to compare you clustering results to here's two things to read:

www.csie.ntu.edu.tw/~cjlin/papers/ecml08.pdf nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

ADD REPLY • link 13.0 years ago by Jake Mick ▴ 50

0

Entering edit mode

I apologize, I see the problem now. I have attempted to reproduce their results on the toy data unsuccessfully assuming C to be the only three possibilities I could think of and trying multiple logarithm bases. There doesn't appear to be any great reason why they are making the assumption that they do. The standard evaluation of joint entropy-NMI in clustering can easily be described from that contingency table, which results in 0.318098096606.

[[5, 0, 0], [0, 7, 0], [0, 0, 0], [0, 1, 7]] results in 0.755192177109 [[5, 0, 0], [0, 7, 0], [0, 0, 0], [0, 1, 7]] results in 1...

ADD REPLY • link 13.0 years ago by Jake Mick ▴ 50

0

Entering edit mode

For univariate data, as opposed to the multiple category ownership seen later I would look somewhere else besides NMI, but if you're faced with the task of comparing the cluster indicator and a set of categorical "true" labels for your genes then NMI would be a great bet.

A less common evaluation of clustering that results in a standard deviation like term is hungarian matching.

Here are two great resources: http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html Parallel Spectral Clustering in Distributed Systems. Chen et al.

Sorry I couldn't be of more help!

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.0 years ago by Jake Mick ▴ 50

0

Entering edit mode

I apologize, I see the problem now. I have attempted to reproduce their results on the toy data unsuccessfully assuming C to be the only three possibilities I could think of and trying multiple logarithm bases. There doesn't appear to be any great reason why they are making the assumption that they do. The standard evaluation of joint entropy-NMI in clustering can easily be described from that contingency table, which results in 0.318098096606. [[5, 0, 0], [0, 7, 0], [0, 0, 0], [0, 1, 7]] results in 0.755192177109 [[5, 0, 0], [0, 7, 0], [0, 0, 0], [0, 0, 7]] results in 1...

ADD REPLY • link 13.0 years ago by Jake Mick ▴ 50

0

Entering edit mode

For univariate data, as opposed to the multiple category ownership seen later I would look somewhere else besides NMI, but if you're faced with the task of comparing the cluster indicator and a set of categorical "true" labels for your genes then NMI would be a great bet. A less common evaluation of clustering that results in a standard deviation like term is hungarian matching. Here are two great resources: nlp.stanford.edu/IR-book/html/htmledition/… Parallel Spectral Clustering in Distributed Systems. Chen et al. Sorry I couldn't be of more help!

ADD REPLY • link 13.0 years ago by Jake Mick ▴ 50

0

Entering edit mode

Random contigency matrix equal sized to that input with elements [1,10] hovered around 0.05.

ADD REPLY • link 13.0 years ago by Jake Mick ▴ 50

0

Entering edit mode

For multiple category ownership seen later I would look somewhere else besides NMI, but if you're faced with the task of comparing the cluster indicator and a set of categorical "true" labels for your genes then NMI would be a great bet. A less common evaluation of clustering that results in a standard deviation like term is hungarian matching. Here are two great resources: http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html Parallel Spectral Clustering in Distributed Systems. Chen et al. Sorry I couldn't be of more help!

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.0 years ago by Jake Mick ▴ 50

score 0 · Answer 2 · 2011-12-07

Hi Ted,

I suspect that you are confused because the figure is misleading. The reason that each attribute is considered separately is that a single gene can have multiple attributes (or none), however, this is not all at clear from the figure which implies that each gene has exactly one attribute. Furthermore, the contingency table in this figure is NOT the one you need to calculate the MI. You actually need a separate contingency table for each attribute. For example, this is the table for A1 (from the figure):

          |  A1 = yes  |  A1 = no
          -------------------------
C = 1     |    5       |     1
C = 2     |    0       |     7
C = 3     |    1       |     6

Does it make sense now?