Hi All,
This paper by Gibbons and Roth (2002) describes a method of verifying clusters by checking their mutual information against GO terms. A cluster/annotation contingency matrix is produced, indicating that for cluster r and GO term c, each element indicates the number of occurrences of a specific GO term (the column) for the genes in that group. Then, mutual information is calculated. This is best visualized from this graphic in Steuer et al. (2006)
My question is how to calculate this mutual information value. I have such a contingency matrix, and know how to calculate mutual information, but Gibbons et al (and Steuer et al too) use an approximation, and I'm unsure of their notation.
The MI for a cluster is additive under their (maybe too strong assumptions):
I(C, [A1, A2]) = I(C, A1) + I(C, A2)
Each I(C, Ai) = H(C) + H(Ai) - H(C, Ai)
What I'm confused about is (1) Why no subscript on C? How is H(C, Ai) calculated when Ai corresponds to one column, and C (seems to) correspond to all of the clusters? With one column, how do we get the joint distribution? (2) How is H(C) calculated? Is it across all attributes?
If you want to show an example with the contingency table in the graphic, I'd be forever grateful!
How is H(C, Ai) calculated too? This is what I am most confused about. If you have time, could you do an example with the data above?
I apologize, I see the problem now. After running that contingency matrix through NMI I get a score of 0.318098096606, on [0,1].
This modified version: [[5, 0, 0], [0, 7, 0], [0, 0, 0], [0, 0, 7]], scores perfectly 1.
This one: [[5, 0, 0], [0, 7, 0], [0, 0, 0], [0, 1, 7]], scores at 0.755192177109.
Random where each element is [1,10], hovers around 0.05.
They use that assumption to account for multiple go entries. If you are comparing cluster indicator results to univariate annotations I would highly recommend using NMI.
If you have univariate "true labels" that you would like to compare you clustering results to here's two things to read:
www.csie.ntu.edu.tw/~cjlin/papers/ecml08.pdf nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
I apologize, I see the problem now. I have attempted to reproduce their results on the toy data unsuccessfully assuming C to be the only three possibilities I could think of and trying multiple logarithm bases. There doesn't appear to be any great reason why they are making the assumption that they do. The standard evaluation of joint entropy-NMI in clustering can easily be described from that contingency table, which results in 0.318098096606.
[[5, 0, 0], [0, 7, 0], [0, 0, 0], [0, 1, 7]] results in 0.755192177109 [[5, 0, 0], [0, 7, 0], [0, 0, 0], [0, 1, 7]] results in 1...
For univariate data, as opposed to the multiple category ownership seen later I would look somewhere else besides NMI, but if you're faced with the task of comparing the cluster indicator and a set of categorical "true" labels for your genes then NMI would be a great bet.
A less common evaluation of clustering that results in a standard deviation like term is hungarian matching.
Here are two great resources: http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html Parallel Spectral Clustering in Distributed Systems. Chen et al.
Sorry I couldn't be of more help!
I apologize, I see the problem now. I have attempted to reproduce their results on the toy data unsuccessfully assuming C to be the only three possibilities I could think of and trying multiple logarithm bases. There doesn't appear to be any great reason why they are making the assumption that they do. The standard evaluation of joint entropy-NMI in clustering can easily be described from that contingency table, which results in 0.318098096606. [[5, 0, 0], [0, 7, 0], [0, 0, 0], [0, 1, 7]] results in 0.755192177109 [[5, 0, 0], [0, 7, 0], [0, 0, 0], [0, 0, 7]] results in 1...
For univariate data, as opposed to the multiple category ownership seen later I would look somewhere else besides NMI, but if you're faced with the task of comparing the cluster indicator and a set of categorical "true" labels for your genes then NMI would be a great bet. A less common evaluation of clustering that results in a standard deviation like term is hungarian matching. Here are two great resources: nlp.stanford.edu/IR-book/html/htmledition/… Parallel Spectral Clustering in Distributed Systems. Chen et al. Sorry I couldn't be of more help!
Random contigency matrix equal sized to that input with elements [1,10] hovered around 0.05.
For multiple category ownership seen later I would look somewhere else besides NMI, but if you're faced with the task of comparing the cluster indicator and a set of categorical "true" labels for your genes then NMI would be a great bet. A less common evaluation of clustering that results in a standard deviation like term is hungarian matching. Here are two great resources: http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html Parallel Spectral Clustering in Distributed Systems. Chen et al. Sorry I couldn't be of more help!