scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: Traditional information retrieval measures of recall and precision at varying numbers of retrieved documents are calculated and used as the bases for statistical comparisons of retrieval effectiveness among the eight search engines.
Abstract: Search engines are essential for finding information on the World Wide Web. We conducted a study to see how effective eight search engines are. Expert searchers sought information on the Web for users who had legitimate needs for information, and these users assessed the relevance of the information retrieved. We calculated traditional information retrieval measures of recall and precision at varying numbers of retrieved documents and used these as the bases for statistical comparisons of retrieval effectiveness among the eight search engines. We also calculated the likelihood that a document retrieved by one search engine was retrieved by other search engines as well.

382 citations

Journal ArticleDOI
TL;DR: Using several simplifications of the vector-space model for text retrieval queries, the authors seek the optimal balance between processing efficiency and retrieval effectiveness as expressed in relevant document rankings.
Abstract: Efficient and effective text retrieval techniques are critical in managing the increasing amount of textual information available in electronic form. Yet text retrieval is a daunting task because it is difficult to extract the semantics of natural language texts. Many problems must be resolved before natural language processing techniques can be effectively applied to a large collection of texts. Most existing text retrieval techniques rely on indexing keywords. Unfortunately, keywords or index terms alone cannot adequately capture the document contents, resulting in poor retrieval performance. Yet keyword indexing is widely used in commercial systems because it is still the most viable way by far to process large amounts of text. Using several simplifications of the vector-space model for text retrieval queries, the authors seek the optimal balance between processing efficiency and retrieval effectiveness as expressed in relevant document rankings.

382 citations

Book
01 Jul 1999
TL;DR: In this paper, the authors bridge the gap between applied mathematics and information retrieval and discuss some of the current problems in information retrieval that may not be familiar to applied mathematicians and computer scientists.
Abstract: A discussion of many of the key design issues for building search engines. It emphasizes the important roles that applied mathematics can play in improving information retrieval. The authors discuss not only important data structures, algorithms and software, but also user-centred issues such as interfaces, manual indexing, and document preparation. The authors bridge the gap between applied mathematics and information retrieval. They discuss some of the current problems in information retrieval that may not be familiar to applied mathematicians and computer scientists and present some of the driving computational methods (SVD, SDD) for automated conceptual indexing. This book introduces topics in a non-technical way and provides insights into common problems found in information retrieval. The more mathematical details are provided in sidebars or offset from the regular text.

381 citations

Proceedings ArticleDOI
07 Jun 2004
TL;DR: Two supervised learning approaches to disambiguate authors in the citations are investigated, one uses the naive Bayes probability model, a generative model; the other uses support vector machines (SVMs) and the vector space representation of citations, a discriminative model.
Abstract: Due to name abbreviations, identical names, name misspellings, and pseudonyms in publications or bibliographies (citations), an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, Web search, database integration, and may cause improper attribution to authors. We investigate two supervised learning approaches to disambiguate authors in the citations. One approach uses the naive Bayes probability model, a generative model; the other uses support vector machines (SVMs) [V. Vapnik (1995)] and the vector space representation of citations, a discriminative model. Both approaches utilize three types of citation attributes: coauthor names, the title of the paper, and the title of the journal or proceeding. We illustrate these two approaches on two types of data, one collected from the Web, mainly publication lists from homepages, the other collected from the DBLP citation databases.

378 citations

Journal ArticleDOI
TL;DR: This paper compares text retrieval methods intended for office systems with methods from database systems and from information retrieval systems, and examines the most interesting representatives of each class.
Abstract: This paper compares text retrieval methods intended for office systems. The operational requirements of the office environment are discussed, and retrieval methods from database systems and from information retrieval systems are examined. We classify these methods and examine the most interesting representatives of each class. Attempts to speed up retrieval with special purpose hardware are also presented, and issues such as approximate string matching and compression are discussed. A qualitative comparison of the examined methods is presented. The signature file method is discussed in more detail.

375 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111