scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Proceedings ArticleDOI
01 Aug 1994
TL;DR: The increasing lengths of documents in full-text collections encourages renewed interest in the ranking and retrieval of document passages, but questions about how passages are defined, how they can be ranked efficiently, and what is their proper role in long, structured documents are raised.
Abstract: The increasing lengths of documents in full-text collections encourages renewed interest in the ranking and retrieval of document passages. Past research showed that evidence from passages can improve retrieval results, but it also raised questions about how passages are defined, how they can be ranked efficiently, and what is their proper role in long, structured documents.

544 citations

Journal ArticleDOI
TL;DR: It is shown that cluster-based retrieval strategies can be devised which are as effective as linear associative retrieval strategies and much more efficient.

535 citations

19 Jan 2001
TL;DR: Experimental results on three standard tasks show that the language model-based algorithms work as well as, or better than, today's top-performing retrieval algorithms.
Abstract: Because of the world wide web, information retrieval systems are now used by millions of untrained users all over the world. The search engines that perform the information retrieval tasks, often retrieve thousands of potentially interesting documents to a query. The documents should be ranked in decreasing order of relevance in order to be useful to the user. This book describes a mathematical model of information retrieval based on the use of statistical language models. The approach uses simple document-based unigram models to compute for each document the probability that it generates the query. This probability is used to rank the documents. The study makes the following research contributions. * The development of a model that integrates term weighting, relevance feedback and structured queries. * The development of a model that supports multiple representations of a request or information need by integrating a statistical translation model. * The development of a model that supports multiple representations of a document, for instance by allowing proximity searches or searches for terms from a particular record field (e.g. a search for terms from the title). * A mathematical interpretation of stop word removal and stemming. * A mathematical interpretation of operators for mandatory terms, wildcards and synonyms. * A practical comparison of a language model-based retrieval system with similar systems that are based on well-established models and term weighting algorithms in a controlled experiment. * The application of the model to cross-language information retrieval and adaptive information filtering, and the evaluation of two prototype systems in a controlled experiment. Experimental results on three standard tasks show that the language model-based algorithms work as well as, or better than, today's top-performing retrieval algorithms. The standard tasks investigated are ad-hoc retrieval (when there are no previously retrieved documents to guide the search), retrospective relevance weighting (find the optimum model for a given set of relevant documents), and ad-hoc retrieval using manually formulated Boolean queries. The application to cross-language retrieval and adaptive filtering shows the practical use of respectively structured queries, and relevance feedback.

528 citations

Book
23 Jul 2010
TL;DR: Information Retrieval offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation, and is a valuable reference for professionals in computer science, computer engineering, and software engineering.
Abstract: Information retrieval is the foundation for modern search engines. This text offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each chapter includes exercises and suggestions for student projects. Wumpusa multiuser open-source information retrieval system developed by one of the authors and available onlineprovides model implementations and a basis for student work. The modular structure of the book allows instructors to use it in a variety of graduate-level courses, including courses taught from a database systems perspective, traditional information retrieval courses with a focus on IR theory, and courses covering the basics of Web retrieval. After an introduction to the basics of information retrieval, the text covers three major topic areasindexing, retrieval, and evaluationin self-contained parts. The final part of the book draws on and extends the general material in the earlier parts, treating such specific applications as parallel search engines, Web search, and XML retrieval. End-of-chapter references point to further reading; exercises range from pencil and paper problems to substantial programming projects. In addition to its classroom use, Information Retrieval will be a valuable reference for professionals in computer science, computer engineering, and software engineering.

523 citations

Proceedings Article
01 Dec 2004
TL;DR: An alternative two-layer model based on exponential family distributions and the semantics of undirected models is proposed, which performs well on document retrieval tasks and provides an elegant solution to searching with keywords.
Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research fields. Although this approach has met with considerable success, the causal semantics of these models can make it difficult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these "exponential family harmoniums" is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.

520 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111