scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Proceedings Article
01 Jan 1992
TL;DR: In this paper, an adaptive method using genetic algorithms to modify user queries, based on relevance judgments, was adapted for the Text Retrieval Conference (TREC) and shown to be applicable to large text collections, where more relevant documents are presented to users in the genetic modification.
Abstract: We have been developing an adaptive method using genetic algorithms to modify user queries, based on relevance judgments. This algorithm was adapted for the Text Retrieval Conference (TREC). The method is shown to be applicable to large text collections, where more relevant documents are presented to users in the genetic modification. The algorithm also shows some interesting phenomena, such as parallel searching. Further studies are planned to adjust the system parameters to improve its effectiveness

59 citations

Journal Article
TL;DR: A data and an execution model that allow for efficient storage and retrieval of XML documents in a relational database and provides clear and intuitive semantics, which facilitates the definition of a declarative query algebra is presented.
Abstract: In this paper, we present a data and an execution model that allow for efficient storage and retrieval of XML documents in a relational database. The data model is strictly based on the notion of binary associations: by decomposing XML documents into small, flexible and semantically homogeneous units we are able to exploit the performance potential of vertical fragmentation. Moreover, our approach provides clear and intuitive semantics, which facilitates the definition of a declarative query algebra. Our experimental results with large collections of XML documents demonstrate the effectiveness of the techniques proposed.

59 citations

Journal ArticleDOI
TL;DR: The usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible are demonstrated.
Abstract: This paper describes features and methods for document image comparison and classification at the spatial layout level. The methods are useful for visual similarity based document retrieval as well as fast algorithms for initial document type classification without OCR. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes region layout information in fixed-length vectors by capturing structural characteristics of the image. These fixed-length vectors are then compared to each other through a Manhattan distance computation for fast page layout comparison. The paper describes experiments and results to rank-order a set of document pages in terms of their layout similarity to a test document. We also demonstrate the usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible. The methods described in the paper can be used in various document retrieval tasks including visual similarity based retrieval, categorization and information extraction.

59 citations

Proceedings ArticleDOI
15 Aug 2005
TL;DR: A novel approach to recognizing and retrieving handwritten manuscripts, based upon word image classification as a key step, is proposed, trained on a corpus of word images that have been resized and sampled at a pyramid of resolutions.
Abstract: Recognition and retrieval of historical handwritten material is an unsolved problem. We propose a novel approach to recognizing and retrieving handwritten manuscripts, based upon word image classification as a key step. Decision trees with normalized pixels as features form the basis of a highly accurate AdaBoost classifier, trained on a corpus of word images that have been resized and sampled at a pyramid of resolutions. To stem problems from the highly skewed distribution of class frequencies, word classes with very few training samples are augmented with stochastically altered versions of the originals. This increases recognition performance substantially. On a standard corpus of 20 pages of handwritten material from the George Washington collection the recognition performance shows a substantial improvement in performance over previous published results (75% vs 65%). Following word recognition, retrieval is done using a language model over the recognized words. Retrieval performance also shows substantially improved results over previously published results on this database. Recognition/retrieval results on a more challenging database of 100 pages from the George Washington collection are also presented.

59 citations

Proceedings Article
01 Jan 2002
TL;DR: This work proposes to represent documents using phrases, a vector space model that represents a document as a vector of index terms, and shows that phrase-based VSM yields a 16% increase of retrieval accuracy compared to the stem-based model.
Abstract: Many information retrieval systems are based on vector space model (VSM) that represents a document as a vector of index terms. Concepts have been proposed to replace word stems as the index terms to improve retrieval accuracy. However, past research revealed that such systems did not outperform the traditional stem-based systems. Incorporating conceptual similarity derived from knowledge sources should have the potential to improve retrieval accuracy. Yet the incompleteness of the knowledge source precludes significant improvement. To remedy this problem, we propose to represent documents using phrases. A phrase consists of multiple concepts and word stems. The similarity between two phrases is jointly determined by their conceptual similarity and their common word stems. The document similarity can in turn be derived from phrase similarities. Using OHSUMED as a test collection and UMLS as the knowledge source, our experiment results reveal that phrase-based VSM yields a 16% increase of retrieval accuracy compared to the stem-based model.

59 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111