Information theoretic measures for clusterings comparison: is a correction for chance necessary?

doi:10.1145/1553374.1553511

Proceedings ArticleDOI

Information theoretic measures for clusterings comparison: is a correction for chance necessary?

Nguyen Xuan Vinh, +2 more

- pp 1073-1080

Chats0

TLDR

This paper derives the analytical formula for the expected mutual information value between a pair of clusterings, and proposes the adjusted version for several popular information theoretic based measures.

Abstract:

Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the number of data points and the number of clusters is small. This effect is similar in some other non-information theoretic based measures such as the well-known Rand Index. Assuming a hypergeometric model of randomness, we derive the analytical formula for the expected mutual information value between a pair of clusterings, and then propose the adjusted version for several popular information theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.

Citations

PDF

Open Access

More filters

Book

Machine Learning : A Probabilistic Perspective

Kevin P. Murphy

TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

...read moreread less

Journal ArticleDOI

Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

Nguyen Xuan Vinh, +2 more

- 01 Mar 2010 -

Journal of Machine Learning Research

TL;DR: An organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones, and advocates the normalized information distance (NID) as a general measure of choice.

...read moreread less

Journal ArticleDOI

Algorithms for hierarchical clustering: an overview

Fionn Murtagh, +2 more

- 01 Jan 2012 -

Wiley Interdisciplinary Reviews-Data Min...

TL;DR: A recently developed very efficient (linear time) hierarchical clustering algorithm is described, which can also be viewed as a hierarchical grid‐based algorithm.

...read moreread less

Journal ArticleDOI

Quantifiable predictive features define epitope-specific T cell receptor repertoires

Pradyot Dash, +15 more

- 21 Jun 2017 -

Nature

TL;DR: Analytical tools developed develop a distance measure on the space of TCRs that permits clustering and visualization, a robust repertoire diversity metric that accommodates the low number of paired public receptors observed when compared to single-chain analyses, and a distance-based classifier that can assign previously unobserved T CRs to characterized repertoires with robust sensitivity and specificity.

...read moreread less

Book ChapterDOI

Evaluating User Privacy in Bitcoin

Elli Androulaki, +4 more

TL;DR: This research examines the use of pseudonymity in the Bitcoin network, and the role that it plays in the development of trust and confidence in the system.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

Elements of information theory

Thomas M. Cover, +1 more

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.

...read moreread less

Journal ArticleDOI

Objective Criteria for the Evaluation of Clustering Methods

William M. Rand

- 01 Dec 1971 -

Journal of the American Statistical Asso...

TL;DR: This article proposes several criteria which isolate specific aspects of the performance of a method, such as its retrieval of inherent structure, its sensitivity to resampling and the stability of its results in the light of new data.

...read moreread less

Journal ArticleDOI

Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

Alexander Strehl, +1 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings and proposes three effective and efficient techniques for obtaining high-quality combiners (consensus functions).

...read moreread less

Journal ArticleDOI

Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

Stefano Monti, +3 more

- 01 Jul 2003 -

Machine Learning

TL;DR: A new methodology of class discovery and clustering validation tailored to the task of analyzing gene expression data is presented, and in conjunction with resampling techniques, it provides for a method to represent the consensus across multiple runs of a clustering algorithm and to assess the stability of the discovered clusters.

...read moreread less

Journal ArticleDOI

Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

Nguyen Xuan Vinh, +2 more

- 01 Mar 2010 -

Journal of Machine Learning Research

TL;DR: An organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones, and advocates the normalized information distance (NID) as a general measure of choice.

...read moreread less

Journal of Statistical Mechanics: Theory...

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

Peter J. Rousseeuw

- 01 Nov 1987 -

Journal of Computational and Applied Mat...

UCI Machine Learning Repository

A. Asuncion

Information theoretic measures for clusterings comparison: is a correction for chance necessary?

Citations

Machine Learning : A Probabilistic Perspective

Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

Algorithms for hierarchical clustering: an overview

Quantifiable predictive features define epitope-specific T cell receptor repertoires

Evaluating User Privacy in Bitcoin

References

Elements of information theory

Objective Criteria for the Evaluation of Clustering Methods

Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

Related Papers (5)

A density-based algorithm for discovering clusters in large spatial Databases with Noise

Some methods for classification and analysis of multivariate observations

Fast unfolding of communities in large networks

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

UCI Machine Learning Repository