scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
01 Jan 1995
TL;DR: There has been a paradigm shift from the system perspective to the user perspective, with a resulting need to design and redesign systems that focus on user needs and that requires analyses of users, their needs, and their habits as discussed by the authors.
Abstract: The author in his chapter refers us back to a previous ARIST chapter by Brenda Dervin and Michael Nilan, in which the dichotomy between the system perspective and the user perspective is shown. He points out that the user perspective shows up in retrieval studies and questions how these studies have affected information retrieval research methods.There has been a paradigm shift from the system perspective to the user perspective, with a resulting need to design and redesign systems that focus on user needs and that requires analyses of users, their needs, and their habits. Two approaches that advocate the user-centered perspective are : (1) the cognitive approach, and (2° the holistic approach. Systems designed from the user-centered perspective would not only serve the intended audience but would further the user-centerd perspective of the entire information retrieival discipline

58 citations

Book ChapterDOI
13 Feb 2006
TL;DR: A novel DTW-based partial matching scheme is employed to take care of morphologically variant words to achieve effective search and retrieval from a large collection of printed document images by matching image features at word-level.
Abstract: This paper presents a system for retrieval of relevant documents from large document image collections. We achieve effective search and retrieval from a large collection of printed document images by matching image features at word-level. For representations of the words, profile-based and shape-based features are employed. A novel DTW-based partial matching scheme is employed to take care of morphologically variant words. This is useful for grouping together similar words during the indexing process.The system supports cross-lingual search using OM-Trans transliteration and a dictionary-based approach. System-level issues for retrieval (eg. scalability, effective delivery etc.) are addressed in this paper.

58 citations

Journal ArticleDOI
Valerie Cross1
01 Feb 1994
TL;DR: A general description of the main components of fuzzy information retrieval are given: document representation, query representation, computer-aided query formulation, document retrieval status, and performance measures.
Abstract: Over the past decade, information retrieval has emerged as an active research area in the application of fuzzy set theory. Fuzzy information retrieval utilizes fuzzy sets to represent documents, membership degrees for query term relevance, fuzzy logical operators to define queries, and fuzzy compatibility measures to assess the retrieval status value of a document. This paper presents an overview of fuzzy relational databases and fuzzy information retrieval. A general description of the main components of fuzzy information retrieval are given: document representation, query representation, computer-aided query formulation, document retrieval status, and performance measures. Examples of areas currently being researched are provided. The relation between fuzzy information retrieval and fuzzy relational databases is examined.

58 citations

Proceedings ArticleDOI
07 Jun 2005
TL;DR: It turns out that the use of formatting information can lead to quite accurate extraction from general documents, and one can significantly improve search ranking results in do document retrieval by using the extracted titles.
Abstract: In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word is 0.810 and 0.837 respectively, and precision and recall for title extraction from PowerPoint is 0.875 and 0.895 respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to another domain, and more surprisingly we can even train models in one language and apply them to another language. Moreover, we can significantly improve search ranking results in document retrieval by using the extracted titles.

58 citations

Proceedings ArticleDOI
01 Jun 1992
TL;DR: This paper describes an approach to complex object retrieval using a probabilistic inference net model and an implementation of this approach using a loose coupling of an object-oriented database system (IRIS) and a text retrieval system based on inference nets (INQUERY).
Abstract: Document management systems are needed for many business applications. This type of system would combine the functionality of a database system, (for describing, storing and maintaining documents with complex structure and relationships) with a text retrieval system (for effective retrieval based on full text). The retrieval model for a document management system is complicated by the variety and complexity of the objects that are represented. In this paper, we describe an approach to complex object retrieval using a probabilistic inference net model, and an implementation of this approach using a loose coupling of an object-oriented database system (IRIS) and a text retrieval system based on inference nets (INQUERY). The resulting system is used to store long, structured documents and can retrieve document components (sections, figures, etc.) based on their contents or the contents of related components. The lessons learnt from the implementation are discussed.

58 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111