scispace - formally typeset
Proceedings ArticleDOI

Information theoretic measures for clusterings comparison: is a correction for chance necessary?

Reads0
Chats0
TLDR
This paper derives the analytical formula for the expected mutual information value between a pair of clusterings, and proposes the adjusted version for several popular information theoretic based measures.
Abstract
Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the number of data points and the number of clusters is small. This effect is similar in some other non-information theoretic based measures such as the well-known Rand Index. Assuming a hypergeometric model of randomness, we derive the analytical formula for the expected mutual information value between a pair of clusterings, and then propose the adjusted version for several popular information theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.

read more

Content maybe subject to copyright    Report

Citations
More filters
Book

Machine Learning : A Probabilistic Perspective

TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.
Journal ArticleDOI

Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

TL;DR: An organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones, and advocates the normalized information distance (NID) as a general measure of choice.
Journal ArticleDOI

Algorithms for hierarchical clustering: an overview

TL;DR: A recently developed very efficient (linear time) hierarchical clustering algorithm is described, which can also be viewed as a hierarchical grid‐based algorithm.
Journal ArticleDOI

Quantifiable predictive features define epitope-specific T cell receptor repertoires

TL;DR: Analytical tools developed develop a distance measure on the space of TCRs that permits clustering and visualization, a robust repertoire diversity metric that accommodates the low number of paired public receptors observed when compared to single-chain analyses, and a distance-based classifier that can assign previously unobserved T CRs to characterized repertoires with robust sensitivity and specificity.
Book ChapterDOI

Evaluating User Privacy in Bitcoin

TL;DR: This research examines the use of pseudonymity in the Bitcoin network, and the role that it plays in the development of trust and confidence in the system.
References
More filters
Book

Elements of information theory

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
Journal ArticleDOI

Objective Criteria for the Evaluation of Clustering Methods

TL;DR: This article proposes several criteria which isolate specific aspects of the performance of a method, such as its retrieval of inherent structure, its sensitivity to resampling and the stability of its results in the light of new data.
Journal ArticleDOI

Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

TL;DR: This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings and proposes three effective and efficient techniques for obtaining high-quality combiners (consensus functions).
Journal ArticleDOI

Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

TL;DR: A new methodology of class discovery and clustering validation tailored to the task of analyzing gene expression data is presented, and in conjunction with resampling techniques, it provides for a method to represent the consensus across multiple runs of a clustering algorithm and to assess the stability of the discovered clusters.
Journal ArticleDOI

Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

TL;DR: An organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones, and advocates the normalized information distance (NID) as a general measure of choice.
Related Papers (5)