scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Book
01 Oct 1996
TL;DR: The 1994 NIST Text Retrieval Conference as discussed by the authors was co-sponsored by the National Inst. of Standards & Technology (NIST) and the Advanced Research Projects Agency (ARPA).
Abstract: From the Publisher: Held in Gaithersburg, MD, August November 2-4, 1994. The conference was co-sponsored by the National Inst. of Standards & Technology (NIST) & the Advanced Research Projects Agency (ARPA) & was attended by 150 people involved in the 32 participating groups. Evaluates new technologies in text retrieval. Includes 34 papers: indexing structures, fragmentation schemes, probabilistic retrieval, latent semantic indexing, interactive document retrieval, & much more. Numerous graphs, tables & charts.

76 citations

Proceedings ArticleDOI
01 Jul 2000
TL;DR: In this article, the effects of out-of-vocabulary (OOV) items in spoken document retrieval (SDR) were investigated and the use of a parallel corpus for query and document expansion was found to be especially beneficial.
Abstract: The effects of out-of-vocabulary (OOV) items in spoken document retrieval (SDR) are investigated. Several sets of transcriptions were created for the TREC-8 SDR task using a speech recognition system varying the vocabulary sizes and OOV rates, and the relative retrieval performance measured. The effects of OOV terms on a simple baseline IR system and on more sophisticated retrieval systems are described. The use of a parallel corpus for query and document expansion is found to be especially beneficial, and with this data set, good retrieval performance can be achieved even for fairly high OOV rates.

76 citations

Journal ArticleDOI
TL;DR: This survey paper presents a critical study of different document layout analysis techniques and discusses comprehensively the different phases of the DLA algorithms based on a general framework that is formed as an outcome of reviewing the research in the field.
Abstract: Document layout analysis (DLA) is a preprocessing step of document understanding systems. It is responsible for detecting and annotating the physical structure of documents. DLA has several important applications such as document retrieval, content categorization, text recognition, and the like. The objective of DLA is to ease the subsequent analysis/recognition phases by identifying the document-homogeneous blocks and by determining their relationships. The DLA pipeline consists of several phases that could vary among DLA methods, depending on the documents’ layouts and final analysis objectives. In this regard, a universal DLA algorithm that fits all types of document-layouts or that satisfies all analysis objectives has not been developed, yet. In this survey paper, we present a critical study of different document layout analysis techniques. The study highlights the motivational reasons for pursuing DLA and discusses comprehensively the different phases of the DLA algorithms based on a general framework that is formed as an outcome of reviewing the research in the field. The DLA framework consists of preprocessing, layout analysis strategies, post-processing, and performance evaluation phases. Overall, the article delivers an essential baseline for pursuing further research in document layout analysis.

76 citations

Book
01 Jan 1974

76 citations

DissertationDOI
27 Jun 2008
TL;DR: This thesis presents a number of task-specific search solutions and tries to set them into more generic frameworks, taking a look at the three areas (1) context adaptivity of search, (2) efficient XML retrieval, and (3) entity ranking.
Abstract: Text retrieval is an active area of research since decades Several issues have been studied over the entire period, like the development of statistical models for the estimation of relevance, or the challenge to keep retrieval tasks efficient with ever growing text collections Especially in the last decade, we have also seen a diversification of retrieval tasks Passage or XML retrieval systems allow a more focused search Question answering or expert search systems do not even return a ranked list of text units, but for instance persons with expertise on a given topic The sketched situation forms the starting point of this thesis, which presents a number of task-specific search solutions and tries to set them into more generic frameworks In particular, we take a look at the three areas (1) context adaptivity of search, (2) efficient XML retrieval, and (3) entity ranking In the first case, we show how different types of context information can be incorporated in the retrieval of documents When users are searching for information, the search task is typically part of a wider working process This search context, however, is often not reflected by the few search keywords stated to the retrieval system, though it can contain valuable information for query refinement We address with this work two research questions related to the aim of developing context-aware retrieval systems First, we show how already available information about the user’s context can be employed effectively to gain highly precise search results Second, we investigate how such meta-data about the search context can be gathered The proposed “query profiles” have a central role in the query refinement process They automatically detect necessary context information and help the user to explicitly express context-dependent search constraints The effectiveness of the approach is tested with retrieval experiments on newspaper data When documents are not regarded as a simple sequence of words, but their content is structured in a machine readable form, it is attractive to try to develop retrieval systems that make use of the additional structure information Structured retrieval first asks for the design of a suitable language that enables the user to express queries on content and structure We investigate here existing query languages, whether and how they support the basic needs of structured querying However, our main focus lies on the efficiency of structured retrieval systems Conventional inverted indices for document retrieval systems are not suitable for maintaining structure indices We identify base operations involved in the execution of structured queries and show how they can be supported by new indices and algorithms on a database system Efficient query processing has to be concerned with the optimization of query plans as well We investigate low-level query plans of physical database operators for the execution of simple query patterns Furthermore, It is demonstrated how complex queries benefit from higher level query optimization New search tasks and interfaces for the presentation of search results, like faceted search applications, question answering, expert search, and automatic timeline construction, come with the need to rank entities instead of documents By entities we mean unique (named) existences, such as persons, organizations or dates Modern language processing tools are able to automatically detect and categorize named entities in large text collections In order to estimate their relevance to a given search topic, we develop retrieval models for entities which are based on the relevance of texts that mention the entity A graph-based relevance propagation framework is introduced for this purpose that enables to derive the relevance of entities Several options for the modeling of entity containment graphs and different relevance propagation approaches are tested, demonstrating the usefulness of the graph-based ranking framework

76 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111