Comparing clusterings: an axiomatic view

doi:10.1145/1102351.1102424

Home
/
Papers
/
Comparing clusterings: an axiomatic view

Proceedings Article•DOI•

Comparing clusterings: an axiomatic view

Marina Meilǎ¹•Institutions (1)

University of Washington¹

07 Aug 2005-pp 577-584

TL;DR: This paper views clusterings as elements of a lattice and gives an axiomatic characterization of some criteria for comparing clusterings, including the variation of information and the unadjusted Rand index, and proves an impossibility result: there is no "sensible" criterion for comparing clusters that is simultaneously aligned with the lattice of partitions, convexely additive, and bounded.

read less

Abstract: This paper views clusterings as elements of a lattice. Distances between clusterings are analyzed in their relationship to the lattice. From this vantage point, we first give an axiomatic characterization of some criteria for comparing clusterings, including the variation of information and the unadjusted Rand index. Then we study other distances between partitions w.r.t these axioms and prove an impossibility result: there is no "sensible" criterion for comparing clusterings that is simultaneously (1) aligned with the lattice of partitions, (2) convexely additive, and (3) bounded.

...read moreread less

Citations

PDF

Open Access

More filters

Book•

Machine Learning : A Probabilistic Perspective

[...]

Kevin P. Murphy

24 Aug 2012

TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

...read moreread less

Abstract: Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package--PMTK (probabilistic modeling toolkit)--that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

...read moreread less

8,059 citations

Cites background from "Comparing clusterings: an axiomatic..."

...Another variant, called variation of information, is described in (Meila 2005)....
[...]

Journal Article•DOI•

Contour Detection and Hierarchical Image Segmentation

[...]

Pablo Arbeláez¹, Michael Maire², Charless C. Fowlkes³, Jitendra Malik¹•Institutions (3)

University of California, Berkeley¹, California Institute of Technology², University of California, Irvine³

01 May 2011-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper investigates two fundamental problems in computer vision: contour detection and image segmentation and presents state-of-the-art algorithms for both of these tasks.

...read moreread less

Abstract: This paper investigates two fundamental problems in computer vision: contour detection and image segmentation. We present state-of-the-art algorithms for both of these tasks. Our contour detector combines multiple local cues into a globalization framework based on spectral clustering. Our segmentation algorithm consists of generic machinery for transforming the output of any contour detector into a hierarchical region tree. In this manner, we reduce the problem of image segmentation to that of contour detection. Extensive experimental evaluation demonstrates that both our contour detection and segmentation methods significantly outperform competing algorithms. The automatically generated hierarchical segmentations can be interactively refined by user-specified annotations. Computation at multiple image resolutions provides a means of coupling our system to recognition applications.

...read moreread less

5,068 citations

Cites background or methods from "Comparing clusterings: an axiomatic..."

...We also report the Probabilistic Rand Index and Variation of Information benchmarks....
[...]
...Although V I possesses some interesting theoretical properties [6], its perceptual meaning and applicability in the presence of several ground-truth segmentations remain unclear....
[...]
...Noteworthy examples include the Probabilistic Rand Index, introduced in this context by [5], the Variation of Information [6], [7], and the Segmentation Covering criterion used in the PASCAL challenge [8]....
[...]
...The Variation of Information metric was introduced for the purpose of clustering comparison [6]....
[...]
...While the boundary benchmark and segmentation covering criterion clearly separate it from all other segmentation methods, the gap narrows for the Probabilistic Rand Index and the Variation of Information....
[...]

Journal Article•DOI•

Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

[...]

Nguyen Xuan Vinh¹, Julien Epps¹, James Bailey²•Institutions (2)

University of New South Wales¹, University of Melbourne²

01 Mar 2010-Journal of Machine Learning Research

TL;DR: An organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones, and advocates the normalized information distance (NID) as a general measure of choice.

...read moreread less

Abstract: Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. In this paper, we perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when the data size is small compared to the number of clusters present therein. Of the available information theoretic based measures, we advocate the normalized information distance (NID) as a general measure of choice, for it possesses concurrently several important properties, such as being both a metric and a normalized measure, admitting an exact analytical adjusted-for-chance form, and using the nominal [0,1] range better than other normalized variants.

...read moreread less

1,818 citations

Cites background or methods from "Comparing clusterings: an axiomatic..."

...The fact thatD joint (and henceDsum) is a true metric is a well known result (Meilă, 2005)....
[...]
...In this context, the pioneering works of Meilă (2003, 2005, 2007) have shown a number of desirable theoretical properties of one of these measures—thevariation of information(VI)—such as its metric property and its alignment with the lattice of partitions....
[...]
...The fact that D joint (and hence Dsum) is a true metric is a well known result (Meilă, 2005)....
[...]
...For the particular purpose of clustering comparison, this class of measures has been popularized through the works of Strehl and Ghosh (2002) and Meilă (2005), and since then has been employed in various subsequent research (Fern and Brodley, 2003; He et al., 2008; Asur et al., 2007; Tumer and…...
[...]
...Related Work Meilă (2005) considered clustering comparison measures with respect to their alignment with the lattice of partitions....
[...]

Journal Article•DOI•

Comparing clusterings---an information based distance

[...]

Marina Meila¹•Institutions (1)

University of Washington¹

01 May 2007-Journal of Multivariate Analysis

TL;DR: This paper proposes an information theoretic criterion for comparing two partitions, or clusterings, of the same data set, called variation of information (VI), and presents it from an axiomatic point of view, showing that it is the only ''sensible'' criterion for compare partitions that is both aligned to the lattice and convexely additive.

...read moreread less

1,527 citations

Additional excerpts

...the second kind S(n,K) [17]....
[...]

Journal Article•DOI•

Defining and evaluating network communities based on ground-truth

[...]

Jaewon Yang¹, Jure Leskovec¹•Institutions (1)

Stanford University¹

01 Jan 2015-Knowledge and Information Systems

TL;DR: In this article, the authors distinguish between structural and functional definitions of network communities and identify networks with explicitly labeled functional communities to which they refer as ground-truth communities, where nodes explicitly state their community memberships and use such social groups to define a reliable and robust notion of groundtruth communities.

...read moreread less

Abstract: Nodes in real-world networks organize into densely linked communities where edges appear with high concentration among the members of the community. Identifying such communities of nodes has proven to be a challenging task due to a plethora of definitions of network communities, intractability of methods for detecting them, and the issues with evaluation which stem from the lack of a reliable gold-standard ground-truth. In this paper, we distinguish between structural and functional definitions of network communities. Structural definitions of communities are based on connectivity patterns, like the density of connections between the community members, while functional definitions are based on (often unobserved) common function or role of the community members in the network. We argue that the goal of network community detection is to extract functional communities based on the connectivity structure of the nodes in the network. We then identify networks with explicitly labeled functional communities to which we refer as ground-truth communities. In particular, we study a set of 230 large real-world social, collaboration, and information networks where nodes explicitly state their community memberships. For example, in social networks, nodes explicitly join various interest-based social groups. We use such social groups to define a reliable and robust notion of ground-truth communities. We then propose a methodology, which allows us to compare and quantitatively evaluate how different structural definitions of communities correspond to ground-truth functional communities. We study 13 commonly used structural definitions of communities and examine their sensitivity, robustness and performance in identifying the ground-truth. We show that the 13 structural definitions are heavily correlated and naturally group into four classes. We find that two of these definitions, Conductance and Triad participation ratio, consistently give the best performance in identifying ground-truth communities. We also investigate a task of detecting communities given a single seed node. We extend the local spectral clustering algorithm into a heuristic parameter-free community detection method that easily scales to networks with more than 100 million nodes. The proposed method achieves 30 % relative improvement over current local clustering methods.

...read moreread less

1,518 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131

Collapse

References

PDF

Open Access

More filters

Book•

Elements of information theory

[...]

Thomas M. Cover¹, Joy A. Thomas²•Institutions (2)

Stanford University¹, IBM²

01 Jan 1991

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.

...read moreread less

Abstract: Preface to the Second Edition. Preface to the First Edition. Acknowledgments for the Second Edition. Acknowledgments for the First Edition. 1. Introduction and Preview. 1.1 Preview of the Book. 2. Entropy, Relative Entropy, and Mutual Information. 2.1 Entropy. 2.2 Joint Entropy and Conditional Entropy. 2.3 Relative Entropy and Mutual Information. 2.4 Relationship Between Entropy and Mutual Information. 2.5 Chain Rules for Entropy, Relative Entropy, and Mutual Information. 2.6 Jensen's Inequality and Its Consequences. 2.7 Log Sum Inequality and Its Applications. 2.8 Data-Processing Inequality. 2.9 Sufficient Statistics. 2.10 Fano's Inequality. Summary. Problems. Historical Notes. 3. Asymptotic Equipartition Property. 3.1 Asymptotic Equipartition Property Theorem. 3.2 Consequences of the AEP: Data Compression. 3.3 High-Probability Sets and the Typical Set. Summary. Problems. Historical Notes. 4. Entropy Rates of a Stochastic Process. 4.1 Markov Chains. 4.2 Entropy Rate. 4.3 Example: Entropy Rate of a Random Walk on a Weighted Graph. 4.4 Second Law of Thermodynamics. 4.5 Functions of Markov Chains. Summary. Problems. Historical Notes. 5. Data Compression. 5.1 Examples of Codes. 5.2 Kraft Inequality. 5.3 Optimal Codes. 5.4 Bounds on the Optimal Code Length. 5.5 Kraft Inequality for Uniquely Decodable Codes. 5.6 Huffman Codes. 5.7 Some Comments on Huffman Codes. 5.8 Optimality of Huffman Codes. 5.9 Shannon-Fano-Elias Coding. 5.10 Competitive Optimality of the Shannon Code. 5.11 Generation of Discrete Distributions from Fair Coins. Summary. Problems. Historical Notes. 6. Gambling and Data Compression. 6.1 The Horse Race. 6.2 Gambling and Side Information. 6.3 Dependent Horse Races and Entropy Rate. 6.4 The Entropy of English. 6.5 Data Compression and Gambling. 6.6 Gambling Estimate of the Entropy of English. Summary. Problems. Historical Notes. 7. Channel Capacity. 7.1 Examples of Channel Capacity. 7.2 Symmetric Channels. 7.3 Properties of Channel Capacity. 7.4 Preview of the Channel Coding Theorem. 7.5 Definitions. 7.6 Jointly Typical Sequences. 7.7 Channel Coding Theorem. 7.8 Zero-Error Codes. 7.9 Fano's Inequality and the Converse to the Coding Theorem. 7.10 Equality in the Converse to the Channel Coding Theorem. 7.11 Hamming Codes. 7.12 Feedback Capacity. 7.13 Source-Channel Separation Theorem. Summary. Problems. Historical Notes. 8. Differential Entropy. 8.1 Definitions. 8.2 AEP for Continuous Random Variables. 8.3 Relation of Differential Entropy to Discrete Entropy. 8.4 Joint and Conditional Differential Entropy. 8.5 Relative Entropy and Mutual Information. 8.6 Properties of Differential Entropy, Relative Entropy, and Mutual Information. Summary. Problems. Historical Notes. 9. Gaussian Channel. 9.1 Gaussian Channel: Definitions. 9.2 Converse to the Coding Theorem for Gaussian Channels. 9.3 Bandlimited Channels. 9.4 Parallel Gaussian Channels. 9.5 Channels with Colored Gaussian Noise. 9.6 Gaussian Channels with Feedback. Summary. Problems. Historical Notes. 10. Rate Distortion Theory. 10.1 Quantization. 10.2 Definitions. 10.3 Calculation of the Rate Distortion Function. 10.4 Converse to the Rate Distortion Theorem. 10.5 Achievability of the Rate Distortion Function. 10.6 Strongly Typical Sequences and Rate Distortion. 10.7 Characterization of the Rate Distortion Function. 10.8 Computation of Channel Capacity and the Rate Distortion Function. Summary. Problems. Historical Notes. 11. Information Theory and Statistics. 11.1 Method of Types. 11.2 Law of Large Numbers. 11.3 Universal Source Coding. 11.4 Large Deviation Theory. 11.5 Examples of Sanov's Theorem. 11.6 Conditional Limit Theorem. 11.7 Hypothesis Testing. 11.8 Chernoff-Stein Lemma. 11.9 Chernoff Information. 11.10 Fisher Information and the Cram-er-Rao Inequality. Summary. Problems. Historical Notes. 12. Maximum Entropy. 12.1 Maximum Entropy Distributions. 12.2 Examples. 12.3 Anomalous Maximum Entropy Problem. 12.4 Spectrum Estimation. 12.5 Entropy Rates of a Gaussian Process. 12.6 Burg's Maximum Entropy Theorem. Summary. Problems. Historical Notes. 13. Universal Source Coding. 13.1 Universal Codes and Channel Capacity. 13.2 Universal Coding for Binary Sequences. 13.3 Arithmetic Coding. 13.4 Lempel-Ziv Coding. 13.5 Optimality of Lempel-Ziv Algorithms. Compression. Summary. Problems. Historical Notes. 14. Kolmogorov Complexity. 14.1 Models of Computation. 14.2 Kolmogorov Complexity: Definitions and Examples. 14.3 Kolmogorov Complexity and Entropy. 14.4 Kolmogorov Complexity of Integers. 14.5 Algorithmically Random and Incompressible Sequences. 14.6 Universal Probability. 14.7 Kolmogorov complexity. 14.9 Universal Gambling. 14.10 Occam's Razor. 14.11 Kolmogorov Complexity and Universal Probability. 14.12 Kolmogorov Sufficient Statistic. 14.13 Minimum Description Length Principle. Summary. Problems. Historical Notes. 15. Network Information Theory. 15.1 Gaussian Multiple-User Channels. 15.2 Jointly Typical Sequences. 15.3 Multiple-Access Channel. 15.4 Encoding of Correlated Sources. 15.5 Duality Between Slepian-Wolf Encoding and Multiple-Access Channels. 15.6 Broadcast Channel. 15.7 Relay Channel. 15.8 Source Coding with Side Information. 15.9 Rate Distortion with Side Information. 15.10 General Multiterminal Networks. Summary. Problems. Historical Notes. 16. Information Theory and Portfolio Theory. 16.1 The Stock Market: Some Definitions. 16.2 Kuhn-Tucker Characterization of the Log-Optimal Portfolio. 16.3 Asymptotic Optimality of the Log-Optimal Portfolio. 16.4 Side Information and the Growth Rate. 16.5 Investment in Stationary Markets. 16.6 Competitive Optimality of the Log-Optimal Portfolio. 16.7 Universal Portfolios. 16.8 Shannon-McMillan-Breiman Theorem (General AEP). Summary. Problems. Historical Notes. 17. Inequalities in Information Theory. 17.1 Basic Inequalities of Information Theory. 17.2 Differential Entropy. 17.3 Bounds on Entropy and Relative Entropy. 17.4 Inequalities for Types. 17.5 Combinatorial Bounds on Entropy. 17.6 Entropy Rates of Subsets. 17.7 Entropy and Fisher Information. 17.8 Entropy Power Inequality and Brunn-Minkowski Inequality. 17.9 Inequalities for Determinants. 17.10 Inequalities for Ratios of Determinants. Summary. Problems. Historical Notes. Bibliography. List of Symbols. Index.

...read moreread less

45,034 citations

"Comparing clusterings: an axiomatic..." refers background in this paper

...The proof then follows from elementary properties of the conditional entropy (see ( Cover & Thomas, 1991 ) for details): rst, if C0 is a renemen t of C, then H(CjC0) = 0; then, one applies the chain rule for con-...
[...]
...…entropies (Meilă, 2003) dVI(C, C ′) = H(C|C′) + H(C′|C′) (14) The proof then follows from elementary properties of the conditional entropy (see (Cover & Thomas, 1991) for details): first, if C′ is a refinement of C, then H(C|C′) = 0; then, one applies the chain rule for conditional…...
[...]

Book•

Enumerative Combinatorics

[...]

R P Stanley

13 Apr 1997

7,046 citations

Journal Article•DOI•

Objective Criteria for the Evaluation of Clustering Methods

[...]

William M. Rand¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Dec 1971-Journal of the American Statistical Association

TL;DR: This article proposes several criteria which isolate specific aspects of the performance of a method, such as its retrieval of inherent structure, its sensitivity to resampling and the stability of its results in the light of new data.

...read moreread less

Abstract: Many intuitively appealing methods have been suggested for clustering data, however, interpretation of their results has been hindered by the lack of objective criteria. This article proposes several criteria which isolate specific aspects of the performance of a method, such as its retrieval of inherent structure, its sensitivity to resampling and the stability of its results in the light of new data. These criteria depend on a measure of similarity between two different clusterings of the same set of data; the measure essentially considers how each pair of data points is assigned in each clustering.

...read moreread less

6,179 citations

"Comparing clusterings: an axiomatic..." refers background in this paper

...The clustering literature contains quite a number of criteria for comparing clusterings: the Rand index (Rand, 1971), the Jaccard index (Ben-Hur et al....
[...]
...It should also be noted that this definition is not aligned to the majority of clustering comparison criteria, like the Rand (Rand, 1971), Fowlkes-Mallows (Fowlkes & Mallows, 1983) and other indices....
[...]
...When can we do this? The aforementioned indices can all be interpreted as probabilities (see the original papers (Rand, 1971; Fowlkes & Mallows, 1983; Hubert & Arabie, 1985; Ben-Hur et al., 2002) for details), but their adjusted versions can not (Hubert & Arabie, 1985)....
[...]
...The aforementioned indices can all be interpreted as probabilities (see the original papers (Rand, 1971; Fowlkes & Mallows, 1983; Hubert & Arabie, 1985; Ben-Hur et al., 2002) for details), but their adjusted versions can not (Hubert & Arabie, 1985)....
[...]
...The clustering literature contains quite a number of criteria for comparing clusterings: the Rand index (Rand, 1971), the Jaccard index (Ben-Hur et al., 2002), the Folwkes-Mallows index (Fowlkes & Mallows, 1983), the Huber and Arabie indices (Hubert & Arabie, 1985), the Mirkin metric (Mirkin,…...
[...]

Book•

Algorithmic graph theory and perfect graphs

[...]

Martin Charles Golumbic

01 Jan 1980

TL;DR: This new Annals edition continues to convey the message that intersection graph models are a necessary and important tool for solving real-world problems and remains a stepping stone from which the reader may embark on one of many fascinating research trails.

...read moreread less

Abstract: Algorithmic Graph Theory and Perfect Graphs, first published in 1980, has become the classic introduction to the field. This new Annals edition continues to convey the message that intersection graph models are a necessary and important tool for solving real-world problems. It remains a stepping stone from which the reader may embark on one of many fascinating research trails. The past twenty years have been an amazingly fruitful period of research in algorithmic graph theory and structured families of graphs. Especially important have been the theory and applications of new intersection graph models such as generalizations of permutation graphs and interval graphs. These have lead to new families of perfect graphs and many algorithmic results. These are surveyed in the new Epilogue chapter in this second edition. New edition of the "Classic" book on the topic Wonderful introduction to a rich research area Leading author in the field of algorithmic graph theory Beautifully written for the new mathematician or computer scientist Comprehensive treatment

...read moreread less

4,090 citations

Journal Article•DOI•

A Method for Comparing Two Hierarchical Clusterings

[...]

Edward B. Fowlkes¹, Colin L. Mallows¹•Institutions (1)

Bell Labs¹

01 Sep 1983-Journal of the American Statistical Association

TL;DR: The derivation and use of a measure of similarity between two hierarchical clusterings, Bk, is derived from the matching matrix, [mij], formed by cutting the two hierarchical trees and counting the number of matching entries in the k clusters in each tree.

...read moreread less

Abstract: This article concerns the derivation and use of a measure of similarity between two hierarchical clusterings. The measure, Bk , is derived from the matching matrix, [mij ], formed by cutting the two hierarchical trees and counting the number of matching entries in the k clusters in each tree. The mean and variance of Bk are determined under the assumption that the margins of [mij ] are fixed. Thus, Bk represents a collection of measures for k = 2, …, n – 1. (k, Bk ) plots are found to be useful in portraying the similarity of two clusterings. Bk is compared to other measures of similarity proposed respectively by Baker (1974) and Rand (1971). The use of (k, Bk ) plots for studying clustering methods is explored by a series of Monte Carlo sampling experiments. An example of the use of (k, Bk ) on real data is given.

...read moreread less

1,376 citations

"Comparing clusterings: an axiomatic..." refers background or methods in this paper

...It should also be noted that this denition is not aligned to the majority of clustering comparison criteria, like the Rand (Rand, 1971), Fowlkes-Mallows ( Fowlkes & Mallows, 1983 ) and other indices....
[...]
...It should also be noted that this definition is not aligned to the majority of clustering comparison criteria, like the Rand (Rand, 1971), Fowlkes-Mallows (Fowlkes & Mallows, 1983) and other indices....
[...]
...The clustering literature contains quite a number of criteria for comparing clusterings: the Rand index (Rand, 1971), the Jaccard index (Ben-Hur et al., 2002), the Folwkes-Mallows index ( Fowlkes & Mallows, 1983 ), the Huber and Arabie indices (Hubert & Arabie, 1985), the Mirkin metric (Mirkin, 1996), the Van Dongen metric (van Dongen, 2000), as well as statistically \adjusted" versions of some of the above (Hubert & Arabie, ......
[...]
...The aforementioned indices can all be interpreted as probabilities (see the original papers (Rand, 1971; Fowlkes & Mallows, 1983; Hubert & Arabie, 1985; Ben-Hur et al., 2002) for details), but their adjusted versions can not (Hubert & Arabie, 1985)....
[...]
...When can we do this? The aforementioned indices can all be interpreted as probabilities (see the original papers (Rand, 1971; Fowlkes & Mallows, 1983; Hubert & Arabie, 1985; Ben-Hur et al., 2002) for details), but their adjusted versions can not (Hubert & Arabie, 1985)....
[...]