Topic
Document retrieval
About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.
Papers published on a yearly basis
Papers
More filters
••
TL;DR: Traditional information retrieval measures of recall and precision at varying numbers of retrieved documents are calculated and used as the bases for statistical comparisons of retrieval effectiveness among the eight search engines.
Abstract: Search engines are essential for finding information on the World Wide Web. We conducted a study to see how effective eight search engines are. Expert searchers sought information on the Web for users who had legitimate needs for information, and these users assessed the relevance of the information retrieved. We calculated traditional information retrieval measures of recall and precision at varying numbers of retrieved documents and used these as the bases for statistical comparisons of retrieval effectiveness among the eight search engines. We also calculated the likelihood that a document retrieved by one search engine was retrieved by other search engines as well.
382 citations
••
TL;DR: Using several simplifications of the vector-space model for text retrieval queries, the authors seek the optimal balance between processing efficiency and retrieval effectiveness as expressed in relevant document rankings.
Abstract: Efficient and effective text retrieval techniques are critical in managing the increasing amount of textual information available in electronic form. Yet text retrieval is a daunting task because it is difficult to extract the semantics of natural language texts. Many problems must be resolved before natural language processing techniques can be effectively applied to a large collection of texts. Most existing text retrieval techniques rely on indexing keywords. Unfortunately, keywords or index terms alone cannot adequately capture the document contents, resulting in poor retrieval performance. Yet keyword indexing is widely used in commercial systems because it is still the most viable way by far to process large amounts of text. Using several simplifications of the vector-space model for text retrieval queries, the authors seek the optimal balance between processing efficiency and retrieval effectiveness as expressed in relevant document rankings.
382 citations
•
01 Jul 1999TL;DR: In this paper, the authors bridge the gap between applied mathematics and information retrieval and discuss some of the current problems in information retrieval that may not be familiar to applied mathematicians and computer scientists.
Abstract: A discussion of many of the key design issues for building search engines. It emphasizes the important roles that applied mathematics can play in improving information retrieval. The authors discuss not only important data structures, algorithms and software, but also user-centred issues such as interfaces, manual indexing, and document preparation. The authors bridge the gap between applied mathematics and information retrieval. They discuss some of the current problems in information retrieval that may not be familiar to applied mathematicians and computer scientists and present some of the driving computational methods (SVD, SDD) for automated conceptual indexing. This book introduces topics in a non-technical way and provides insights into common problems found in information retrieval. The more mathematical details are provided in sidebars or offset from the regular text.
381 citations
••
07 Jun 2004TL;DR: Two supervised learning approaches to disambiguate authors in the citations are investigated, one uses the naive Bayes probability model, a generative model; the other uses support vector machines (SVMs) and the vector space representation of citations, a discriminative model.
Abstract: Due to name abbreviations, identical names, name misspellings, and pseudonyms in publications or bibliographies (citations), an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, Web search, database integration, and may cause improper attribution to authors. We investigate two supervised learning approaches to disambiguate authors in the citations. One approach uses the naive Bayes probability model, a generative model; the other uses support vector machines (SVMs) [V. Vapnik (1995)] and the vector space representation of citations, a discriminative model. Both approaches utilize three types of citation attributes: coauthor names, the title of the paper, and the title of the journal or proceeding. We illustrate these two approaches on two types of data, one collected from the Web, mainly publication lists from homepages, the other collected from the DBLP citation databases.
378 citations
••
TL;DR: This paper compares text retrieval methods intended for office systems with methods from database systems and from information retrieval systems, and examines the most interesting representatives of each class.
Abstract: This paper compares text retrieval methods intended for office systems. The operational requirements of the office environment are discussed, and retrieval methods from database systems and from information retrieval systems are examined. We classify these methods and examine the most interesting representatives of each class. Attempts to speed up retrieval with special purpose hardware are also presented, and issues such as approximate string matching and compression are discussed. A qualitative comparison of the examined methods is presented. The signature file method is discussed in more detail.
375 citations