Recent trends in hierarchic document clustering: a critical review

doi:10.1016/0306-4573(88)90027-1

Journal ArticleDOI

Recent trends in hierarchic document clustering: a critical review

Peter Willett

- 01 Aug 1988 -

Information Processing and Management

- Vol. 24, Iss: 5, pp 577-597

Chats0

TLDR

Algorithms that can be used to allow the implementation of hierarchic agglomerative clustering methods for document retrieval, and experimental evidence suggests that nearest neighbor clusters provide a reasonably efficient and effective means of including interdocument similarity information in document retrieval systems.

Abstract:

This article reviews recent research into the use of hierarchic agglomerative clustering methods for document retrieval. After an introduction to the calculation of interdocument similarities and to clustering methods that are appropriate for document clustering, the article discusses algorithms that can be used to allow the implementation of these methods on databases of nontrivial size. The validation of document hierarchies is described using tests based on the theory of random graphs and on empirical characteristics of document collections that are to be clustered. A range of search strategies is available for retrieval from document hierarchies and the results are presented of a series of research projects that have used these strategies to search the clusters resulting from several different types of hierarchic agglomerative clustering method. It is suggested that the complete linkage method is probably the most effective method in terms of retrieval performance; however, it is also difficult to implement in an efficient manner. Other applications of document clustering techniques are discussed briefly; experimental evidence suggests that nearest neighbor clusters, possibly represented as a network model, provide a reasonably efficient and effective means of including interdocument similarity information in document retrieval systems.

Recent trends in hierarchic document clustering: a critical review

Citations

Foundations of Statistical Natural Language Processing

Machine learning in automated text categorization

System for generation of user profiles for a system for customized electronic identification of desirable objects

Data Cleaning: Problems and Current Approaches.

Scatter/Gather: a cluster-based approach to browsing large document collections

References

Introduction to Modern Information Retrieval

Cluster Analysis

Clustering Algorithms

Relevance weighting of search terms

Relevance weighting of search terms

Related Papers (5)

Introduction to Modern Information Retrieval

Term Weighting Approaches in Automatic Text Retrieval

A Comparison of Document Clustering Techniques

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Indexing by Latent Semantic Analysis