scispace - formally typeset
Open Access

On Cluster Validity and the Information Need of Users

TLDR
This analysis includes the classical cluster validity measures from Dunn and Davies-Bouldin as well as the new graph-based measures Λ (weighted edge connectivity) and ρ (expected edge density) and they are definitely outperformed by the expected edge density ρ.
Abstract
In the field of information retrieval, clustering algorithms are used to analyze large collections of documents with the objective to form groups of similar documents. Clustering a document collection is an ambiguous task: A clustering, i. e. a set of document groups, depends on the chosen clustering algorithm as well as on the algorithm’s parameter settings. To find the best among several clusterings, it is common practice to evaluate their internal structures with a cluster validity measure. A clustering is considered to be useful to a user if particular structural properties are well developed. Nevertheless, the presence of certain structural properties may not guarantee usefulness from an information retrieval standpoint, say, whether or not the found document groups resemble the classification of a human editor. The paper in hand investigates this point: Based on already classified document collections we generate clusterings and compare the predicted quality to their real quality. Our analysis includes the classical cluster validity measures from Dunn and Davies-Bouldin as well as the new graph-based measures Λ (weighted edge connectivity) and ρ (expected edge density). The experiments show interesting results: The classical measures behave in a consistent manner insofar as mediocre and poor clusterings are identified as such. On real-world document clustering data, however, they are definitely outperformed by the expected edge density ρ. This superiority of the graph-based measures can be explained by their independence of cluster forms and distances.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

A survey of Web clustering engines

TL;DR: The issues that must be addressed in the development of a Web clustering engine, including acquisition and preprocessing of search results, their clustering and visualization are discussed, and the role played by the quality of the cluster labels is emphasized.
Journal ArticleDOI

Nonlinear dimension reduction and clustering by Minimum Curvilinearity unfold neuropathic pain and tissue embryological classes

TL;DR: Improvements in tissue embryological classification refine results obtained in an earlier study, and suggest a possible reinterpretation of skin attribution as mesodermal.
Book ChapterDOI

An approach to clustering abstracts

TL;DR: The preliminary experiments show that abstracts cannot be clustered with the same quality as full texts, though the achieved quality is adequate for many applications; accordingly, Makagonov's proposal that digital libraries should provide document images of full texts of the papers (and not only abstracts) for open access via Internet, in order to help in search, classification, clustering, selection, and proper referencing of the books.
Journal ArticleDOI

An efficient Particle Swarm Optimization approach to cluster short texts

TL;DR: Experimental results with corpora containing scientific abstracts, news and short legal documents obtained from the Web, show that CLUDIPSO^* is an effective clustering method for short-text corpora of small and medium size.
References
More filters

Some methods for classification and analysis of multivariate observations

TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.
Book

Cluster Analysis

TL;DR: This fourth edition of the highly successful Cluster Analysis represents a thorough revision of the third edition and covers new and developing areas such as classification likelihood and neural networks for clustering.
BookDOI

Finding Groups in Data

TL;DR: In this article, an electrical signal transmission system for railway locomotives and rolling stock is proposed, where a basic pulse train is transmitted whereof the pulses are of a selected first amplitude and represent a train axle count, and a spike pulse of greater selected amplitude is transmitted, occurring immediately after the axle count pulse to which it relates, whenever an overheated axle box is detected.
Journal ArticleDOI

An algorithm for suffix stripping

TL;DR: An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL, and performs slightly better than a much more elaborate system with which it has been compared.
Journal ArticleDOI

A Cluster Separation Measure

TL;DR: A measure is presented which indicates the similarity of clusters which are assumed to have a data density which is a decreasing function of distance from a vector characteristic of the cluster which can be used to infer the appropriateness of data partitions.