On Cluster Validity and the Information Need of Users

Open Access

On Cluster Validity and the Information Need of Users

TLDR

This analysis includes the classical cluster validity measures from Dunn and Davies-Bouldin as well as the new graph-based measures Λ (weighted edge connectivity) and ρ (expected edge density) and they are definitely outperformed by the expected edge density ρ.

Abstract:

In the field of information retrieval, clustering algorithms are used to analyze large collections of documents with the objective to form groups of similar documents. Clustering a document collection is an ambiguous task: A clustering, i. e. a set of document groups, depends on the chosen clustering algorithm as well as on the algorithm’s parameter settings. To find the best among several clusterings, it is common practice to evaluate their internal structures with a cluster validity measure. A clustering is considered to be useful to a user if particular structural properties are well developed. Nevertheless, the presence of certain structural properties may not guarantee usefulness from an information retrieval standpoint, say, whether or not the found document groups resemble the classification of a human editor. The paper in hand investigates this point: Based on already classified document collections we generate clusterings and compare the predicted quality to their real quality. Our analysis includes the classical cluster validity measures from Dunn and Davies-Bouldin as well as the new graph-based measures Λ (weighted edge connectivity) and ρ (expected edge density). The experiments show interesting results: The classical measures behave in a consistent manner insofar as mediocre and poor clusterings are identified as such. On real-world document clustering data, however, they are definitely outperformed by the expected edge density ρ. This superiority of the graph-based measures can be explained by their independence of cluster forms and distances.

On Cluster Validity and the Information Need of Users

Citations

A survey of Web clustering engines

Nonlinear dimension reduction and clustering by Minimum Curvilinearity unfold neuropathic pain and tissue embryological classes

An approach to clustering abstracts

An efficient Particle Swarm Optimization approach to cluster short texts

References

Some methods for classification and analysis of multivariate observations

Cluster Analysis

Finding Groups in Data

An algorithm for suffix stripping

A Cluster Separation Measure

Related Papers (5)

Data clustering: a review

Algorithms for clustering data

Some methods for classification and analysis of multivariate observations

Finding Groups in Data: An Introduction to Cluster Analysis

Introduction to Information Retrieval