Data clustering: 50 years beyond K-means

doi:10.1016/J.PATREC.2009.09.011

Journal ArticleDOI

Data clustering: 50 years beyond K-means

- Vol. 31, Iss: 8, pp 651-666

TLDR

A brief overview of clustering is provided, well known clustering methods are summarized, the major challenges and key issues in designing clustering algorithms are discussed, and some of the emerging and useful research directions are pointed out.

Abstract:

Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering.

Data clustering: 50 years beyond K-means

Citations

Pattern Recognition and Machine Learning

Variational Mode Decomposition

Clustering by fast search and find of density peaks

A Comprehensive Survey of Clustering Algorithms

Living on the Edge: The Role of Proactive Caching in 5G Wireless Networks

References

Using multivariate statistics

Maximum likelihood from incomplete data via the EM algorithm

Distinctive Image Features from Scale-Invariant Keypoints

Latent dirichlet allocation

Latent Dirichlet Allocation

Related Papers (5)

Data clustering: a review

Some methods for classification and analysis of multivariate observations

A density-based algorithm for discovering clusters in large spatial Databases with Noise

Least squares quantization in PCM

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis