Journal ArticleDOI
Recent trends in hierarchic document clustering: a critical review
Reads0
Chats0
TLDR
Algorithms that can be used to allow the implementation of hierarchic agglomerative clustering methods for document retrieval, and experimental evidence suggests that nearest neighbor clusters provide a reasonably efficient and effective means of including interdocument similarity information in document retrieval systems.Abstract:
This article reviews recent research into the use of hierarchic agglomerative clustering methods for document retrieval. After an introduction to the calculation of interdocument similarities and to clustering methods that are appropriate for document clustering, the article discusses algorithms that can be used to allow the implementation of these methods on databases of nontrivial size. The validation of document hierarchies is described using tests based on the theory of random graphs and on empirical characteristics of document collections that are to be clustered. A range of search strategies is available for retrieval from document hierarchies and the results are presented of a series of research projects that have used these strategies to search the clusters resulting from several different types of hierarchic agglomerative clustering method. It is suggested that the complete linkage method is probably the most effective method in terms of retrieval performance; however, it is also difficult to implement in an efficient manner. Other applications of document clustering techniques are discussed briefly; experimental evidence suggests that nearest neighbor clusters, possibly represented as a network model, provide a reasonably efficient and effective means of including interdocument similarity information in document retrieval systems.read more
Citations
More filters
Book
Foundations of Statistical Natural Language Processing
TL;DR: This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.
Journal ArticleDOI
Machine learning in automated text categorization
TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Patent
System for generation of user profiles for a system for customized electronic identification of desirable objects
TL;DR: In this article, the authors proposed a system that automatically constructs a target profile for each target object in the electronic media based on the frequency with which each word appears in an article relative to its overall frequency of use in all articles, as well as a "target profile interest summary" for each user.
Journal Article
Data Cleaning: Problems and Current Approaches.
Erhard Rahm,Hong Hai Do +1 more
TL;DR: This work classifies data quality problems that are addressed by data cleaning and provides an overview of the main solution approaches and discusses current tool support for data cleaning.
Journal ArticleDOI
Scatter/Gather: a cluster-based approach to browsing large document collections
TL;DR: A document browsing technique that employs docum-ent clustering as its primary operation is presented and a fast (linear time) clustering algorithm is presented that provides a powerful new access paradigm.
References
More filters
Book
Introduction to Modern Information Retrieval
Gerard Salton,Michael J. McGill +1 more
TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.
Book
Cluster Analysis
TL;DR: This fourth edition of the highly successful Cluster Analysis represents a thorough revision of the third edition and covers new and developing areas such as classification likelihood and neural networks for clustering.
Book
Relevance weighting of search terms
TL;DR: This paper examines statistical techniques for exploiting relevance information to weight search terms using information about the distribution of index terms in documents in general and shows that specific weighted search methods are implied by a general probabilistic theory of retrieval.
Journal ArticleDOI
Relevance weighting of search terms
TL;DR: In this article, a series of relevance weighting functions is derived and is justified by theoretical considerations, in particular, it is shown that specific weighted search methods are implied by a general probabilistic theory of retrieval.