scispace - formally typeset
Open AccessJournal ArticleDOI

Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

Reads0
Chats0
TLDR
A generative mixture-model approach to clustering directional data based on the von Mises-Fisher distribution, which arises naturally for data distributed on the unit hypersphere, and derives and analyzes two variants of the Expectation Maximization framework for estimating the mean and concentration parameters of this mixture.
Abstract
Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces.

read more

Content maybe subject to copyright    Report

Citations
More filters
BookDOI

Semi-Supervised Learning

TL;DR: Semi-supervised learning (SSL) as discussed by the authors is the middle ground between supervised learning (in which all training examples are labeled) and unsupervised training (where no label data are given).
Journal ArticleDOI

Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

TL;DR: An organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones, and advocates the normalized information distance (NID) as a general measure of choice.
Journal ArticleDOI

topicmodels: An R Package for Fitting Topic Models

TL;DR: The R package topicmodels provides basic infrastructure for fitting topic models based on data structures from the text mining package tm to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables.
Proceedings ArticleDOI

A web-based kernel function for measuring the similarity of short text snippets

TL;DR: This paper defines a similarity kernel function, mathematically analyze some of its properties, and provides examples of its efficacy, and shows the use of this kernel function in a large-scale system for suggesting related queries to search engine users.
Proceedings ArticleDOI

Information theoretic measures for clusterings comparison: is a correction for chance necessary?

TL;DR: This paper derives the analytical formula for the expected mutual information value between a pair of clusterings, and proposes the adjusted version for several popular information theoretic based measures.
References
More filters
Book

Elements of information theory

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
Journal ArticleDOI

Cluster analysis and display of genome-wide expression patterns

TL;DR: A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression, finding in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function.
Book

A treatise on the theory of Bessel functions

G. N. Watson
TL;DR: The tabulation of Bessel functions can be found in this paper, where the authors present a comprehensive survey of the Bessel coefficients before and after 1826, as well as their extensions.