scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 2002"


MonographDOI
20 Jun 2002
TL;DR: This text covers the emerging technologies of document retrieval, information extraction, and text categorization in a way which highlights commonalities in terms of both general principles and practical issues.
Abstract: This text covers the emerging technologies of document retrieval, information extraction, and text categorization in a way which highlights commonalities in terms of both general principles and practical issues. It seeks to satisfy a need on the part of technology practitioners in the Internet space, faced with having to make difficult decisions as to what research has been done and what the best practices are. It is not intended as a vendor guide or as a recipe for building applications. But it does identify the key technologies, the issues involved, and the strengths and weaknesses of the various approaches. There is also a strong emphasis on evaluation in every chapter, both in terms of methodology (how to evaluate) and what controlled experimentation and industrial experience have to tell us.

321 citations


Proceedings ArticleDOI
S. Muthukrishnan1
06 Jan 2002
TL;DR: This paper considers document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology, and provides the first known optimal algorithm for the document listing problem.
Abstract: We are given a collection D of text documents d1,…,dk, with ∑i = n, which may be preprocessed. In the document listing problem, we are given an online query comprising of a pattern string p of length m and our goal is to return the set of all documents that contain one or more copies of p. In the closely related occurrence listing problem, we output the set of all positions within the documents where pattern p occurs. In 1973, Weiner [24] presented an algorithm with O(n) time and space preprocessing following which the occurrence listing problem can be solved in time O(m + output) where output is the number of positions where p occurs; this algorithm is clearly optimal. In contrast, no optimal algorithm is known for the closely related document listing problem, which is perhaps more natural and certainly well-motivated.We provide the first known optimal algorithm for the document listing problem. More generally, we initiate the study of pattern matching problems that require retrieving documents matched by the patterns; this contrasts with pattern matching problems that have been studied more frequently, namely, those that involve retrieving all occurrences of patterns. We consider document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology. We present very efficient (optimal) algorithms for our document retrieval problems. Our approach for solving such problems involve performing "local" encodings whereby they are reduced to range query problems on geometric objects --- points and lines --- that have color. We present improved algorithms for these colored range query problems that arise in our reductions using the structural properties of strings. This approach is quite general and yields simple, efficient, implementable algorithms for all the document retrieval problems in this paper.

267 citations


Proceedings ArticleDOI
04 Nov 2002
TL;DR: The experiments show that passage retrieval is feasible in the language modeling context, and more importantly, it can provide more reliable performance than retrieval based on full documents.
Abstract: Previous research has shown that passage-level evidence can bring added benefits to document retrieval when documents are long or span different subject areas. Recent developments in language modeling approach to IR provided a new effective alternative to traditional retrieval models. These two streams of research motivate us to examine the use of passages in a language model framework. This paper reports on experiments using passages in a simple language model and a relevance model, and compares the results with document-based retrieval. Results from the INQUERY search engine, which is not based on a language modeling approach, are also given for comparison. Test data include two heterogeneous and one homogeneous document collections. Our experiments show that passage retrieval is feasible in the language modeling context, and more importantly, it can provide more reliable performance than retrieval based on full documents.

228 citations


Proceedings ArticleDOI
04 Nov 2002
TL;DR: An approach to retrieval of documents that contain of both free text and semantically enriched markup in which both documents and queries can be marked up with statements in the DAML+OIL semantic web language is described.
Abstract: We describe an approach to retrieval of documents that contain of both free text and semantically enriched markup. In particular, we present the design and implementation prototype of a framework in which both documents and queries can be marked up with statements in the DAML+OIL semantic web language. These statements provide both structured and semi-structured information about the documents and their content. We claim that indexing text and semantic markup together will significantly improve retrieval performance. Our approach allows inferencing to be done over this information at several points: when a document is indexed, when a query is processed and when query results are evaluated.

227 citations


Journal ArticleDOI
Hugo Zaragoza1

211 citations


Proceedings ArticleDOI
Dragomir R. Radev1, Weiguo Fan1, Hong Qi1, Harris Wu1, Amardeep Grewal1 
07 May 2002
TL;DR: The architecture that augments existing search engines so that they support natural language question answering, called NSIR, is developed and some probabilistic approaches to the last three of these stages are described.
Abstract: Web-based search engines such as Google and NorthernLight return documents that are relevant to a user query, not answers to user questions. We have developed an architecture that augments existing search engines so that they support natural language question answering. The process entails five steps: query modulation, document retrieval, passage extraction, phrase extraction, and answer ranking. In this paper we describe some probabilistic approaches to the last three of these stages. We show how our techniques apply to a number of existing search engines and we also present results contrasting three different methods for question answering. Our algorithm, probabilistic phrase reranking (PPR) using proximity and question type features achieves a total reciprocal document rank of .20 on the TREC 8 corpus. Our techniques have been implemented as a Web-accessible system, called NSIR.

182 citations


Book ChapterDOI
07 Sep 2002
TL;DR: In this article, the authors re-examine the capability of ant-based meta-heuristics to simultaneously perform a combination of clustering and multi-dimensional scaling, and they show how to improve on this by some modifications of the algorithm and a hybridization with a simple pre-processing phase.
Abstract: Sorting and clustering methods inspired by the behavior of real ants are among the earliest methods in ant-based meta-heuristics. We revisit these methods in the context of a concrete application and introduce some modifications that yield significant improvements in terms of both quality and efficiency. Firstly, we re-examine their capability to simultaneously perform a combination of clustering and multi-dimensional scaling. In contrast to the assumptions made in earlier literature, our results suggest that these algorithms perform scaling only to a very limited degree. We show how to improve on this by some modifications of the algorithm and a hybridization with a simple pre-processing phase. Secondly, we discuss how the time-complexity of these algorithms can be improved. The improved algorithms are used as the core mechanism in a visual document retrieval system for world-wide web searches.

181 citations


Patent
29 Apr 2002
TL;DR: In this article, the search results display area and topic word display area adjacently on a retrieval assisting interface, the title information and topic information can be browsed by users; by arranging search results analysis means such as mark title button for emphasizing documents containing designated topic words, along with mark topic word button for emphasis topic words contained in a designated document, users can analyze search results readily from various standpoints.
Abstract: Achieving efficient analysis of search results, which is required for the examination of search queries, by listing up both title information of a retrieved document group and the whole information. By arranging search results display area and topic word display area adjacently on a retrieval assisting interface, the title information and topic information can be browsed by users; by arranging search results analysis means such as mark title button for emphasizing documents containing designated topic words, along with mark topic word button for emphasizing topic words contained in a designated document, users can analyze search results readily from various standpoints.

136 citations


Proceedings ArticleDOI
24 Aug 2002
TL;DR: Topical relations expressed as lexical chains on extended WordNet improve the performance of a question answering system by increasing the document retrieval recall and by providing the much needed axioms that link question keywords with answers.
Abstract: The paper presents a method for finding topically related words on an extended WordNet. By exploiting the information in the WordNet glosses, the connectivity between the synsets is dramatically increased. Topical relations expressed as lexical chains on extended WordNet improve the performance of a question answering system by increasing the document retrieval recall and by providing the much needed axioms that link question keywords with answers.

133 citations


Book ChapterDOI
25 Mar 2002
TL;DR: The research focuses on the degree to which implicit evidence of document relevance can be substituted for explicit evidence in terms of both user opinion and search effectiveness.
Abstract: In this paper we report on the application of two contrasting types of relevance feedback for web retrieval. We compare two systems; one using explicit relevance feedback (where searchers explicitly have to mark documents relevant) and one using implicit relevance feedback (where the system endeavours to estimate relevance by mining the searcher's interaction). The feedback is used to update the display according to the user's interaction. Our research focuses on the degree to which implicit evidence of document relevance can be substituted for explicit evidence. We examine the two variations in terms of both user opinion and search effectiveness.

131 citations


Journal ArticleDOI
TL;DR: InfoSky is a system enabling users to explore large, hierarchically structured document collections using a planar graphical representation with variable magnification, and can map metadata such as document size or age to attributes of the visualisation such as colour and luminance.
Abstract: InfoSky is a system enabling users to explore large, hierarchically structured document collections. Similar to a real-world telescope, InfoSky employs a planar graphical representation with variable magnification. Documents of similar content are placed close to each other and are visualised as stars, forming clusters with distinct shapes. For greater performance, the hierarchical structure is exploited and force-directed placement is applied recursively at each level on much fewer objects, rather than on the whole corpus. Collections of documents at a particular level in the hierarchy are visualised with bounding polygons using a modified weighted Voronoi diagram. Their area is related to the number of documents contained. Textual labels are displayed dynamically during navigation, adjusting to the visualisation content. Navigation is animated and provides a seamless zooming transition between summary and detail view. Users can map metadata such as document size or age to attributes of the visualisation such as colour and luminance. Queries can be made and matching documents or collections are highlighted. Formative usability testing is ongoing; a small baseline experiment comparing the telescope browser to a tree browser is discussed.

Proceedings ArticleDOI
11 Aug 2002
TL;DR: A general language model was proposed that combined bigram language models with Good-Turing estimate and corpus-based smoothing of unigram probabilities to demonstrate better performance of language models against vector space or probabilistic retrieval models for document retrieval.
Abstract: Statistical Language Models(LM) have been used in many natural language processing tasks including speech recognition and machine translation [5, 2]. Recently language models have been explored as a framework for information retrieval [9, 4, 7, 1, 6]. The basic idea is to view each document to have its own language model and model querying as a generative process. Documents are ranked based on the probability of their language model generating the given query. Since documents are fixed entities in information retrieval, language models for documents suffer from sparse data problem. Smoothed unigram models have been used to demonstrate better performance of language models against vector space or probabilistic retrieval models for document retrieval. Song and Croft [10] proposed a general language model that combined bigram language models with Good-Turing estimate and corpus-based smoothing of unigram probabilities. Improved performance was observed with combined bigram language models. The language models explored for information retrieval mimic those used for speech recognition. Specifically, in the bigram model a document d represented as word sequence w1, w2, · · · , wn is modeled as

Patent
11 Jan 2002
TL;DR: In this paper, a new data structure and algorithms which offer at least equal performance in common sparse matrix tasks, and improved performance in many, were proposed and applied to a word document index to produce fast build and query times for document retrieval.
Abstract: A new data structure and algorithms which offer at least equal performance in common sparse matrix tasks, and improved performance in many. This is applied to a word-document index to produce fast build and query times for document retrieval.

Proceedings ArticleDOI
11 Aug 2002
TL;DR: The new language modeling approach is shown to explain a number of practical facts of today's information retrieval systems that are not very well explained by the current state of information retrieval theory, including stop words, mandatory terms, coordination level ranking and retrieval using phrases.
Abstract: This paper follows a formal approach to information retrieval based on statistical language models. By introducing some simple reformulations of the basic language modeling approach we introduce the notion of importance of a query term. The importance of a query term is an unknown parameter that explicitly models which of the query terms are generated from the relevant documents (the important terms), and which are not (the unimportant terms). The new language modeling approach is shown to explain a number of practical facts of today's information retrieval systems that are not very well explained by the current state of information retrieval theory, including stop words, mandatory terms, coordination level ranking and retrieval using phrases.

Journal ArticleDOI
TL;DR: This paper presents a meta-modelling framework for estimating the relevance of information retrieval in a number of discrete-time models and shows clear patterns in how these models are modified over time.
Abstract: Acknowledgments. Preface. 1. Introduction. 2. Mathematics Handbook. 3. Information Retrieval Models. 4. Mathematical Theory of Information Retrieval. 5. Relevance Effectiveness in Information Retrieval. 6. Further Topics in Information Retrieval. Appendices. References. Index.

Patent
Inaba Mitsuaki1, Yuji Kanno1
26 Mar 2002
TL;DR: In this article, a distributed document retrieval method for performing document retrieval by plural retrieval servers that each perform document retrieval for a database storing plural documents, and an integrating retrieval server that is connected to the plural retrieval server over communication and issues retrieval orders to the retrieval servers.
Abstract: A distributed document retrieval method for performing document retrieval by plural retrieval servers that each perform document retrieval for a database storing plural documents, and an integrating retrieval server that is connected to the plural retrieval servers over communication and issues retrieval orders to the retrieval servers, wherein each retrieval server delivers statistical information created based on intermediate results obtained by retrieval operation to the integrating retrieval server, the integrating retrieval server compiles the statistical information to create global statistical information and delivers it to each retrieval server, and each retrieval server calculates scores based on the global statistical information and sends retrieval results matching retrieval conditions back to the integrating retrieval server. By the above described operation, efficient and correct ranking among retrieval documents is achieved with improved document retrieval quality.

Book
01 Jan 2002
TL;DR: In this paper, the technologies of document retrieval, information extraction, and text categorization are discussed in a way which highlights commonalities in terms of both general principles and practical concerns.
Abstract: This text covers the technologies of document retrieval, information extraction, and text categorization in a way which highlights commonalities in terms of both general principles and practical concerns. It assumes some mathematical background on the part of the reader, but the chapters typically begin with a non-mathematical account of the key issues. Current research topics are covered only to the extent that they are informing current applications; detailed coverage of longer term research and more theoretical treatments should be sought elsewhere. There are many pointers at the ends of the chapters that the reader can follow to explore the literature. However, the book does maintain a strong emphasis on evaluation in every chapter both in terms of methodology and the results of controlled experimentation.

Journal ArticleDOI
TL;DR: This paper deals with the measurement of bias in search engines on the World Wide Web by measuring the deviation from the ideal of the distribution produced by a particular search engine.
Abstract: This paper deals with the measurement of bias in search engines on the World Wide Web. Bias is taken to mean the balance and representativeness of items in a collection retrieved from a database for a set of queries. This calls for assessing the degree to which the distribution of items in a collection deviates from the ideal. Ascertaining this ideal poses problems similar to those associated with determining relevance in the measurement of recall and precision. Instead of enlisting subject experts or users to determine such an ideal, a family of comparable search engines is used to approximate it for a set of queries. The distribution is obtained by computing the frequencies of occurrence of the uniform resource locators (URLs) in the collection retrieved by several search engines for the given queries. Bias is assessed by measuring the deviation from the ideal of the distribution produced by a particular search engine.

Journal ArticleDOI
TL;DR: A statistical correlation model for image retrieval captures the semantic relationships among images in a database from simple statistics of user-provided relevance feedback information and can be efficiently integrated into an image retrieval system to help improve the retrieval performance.

Proceedings ArticleDOI
11 Aug 2002
TL;DR: The view presented in this paper is that the fundamental vocabulary of the system is the images in the database and that relevance feedback is a document whose words are the images which expresses the semantic intent of the user over that query.
Abstract: This paper proposes a novel view of the information generated by relevance feedback. The latent semantic analysis is adapted to this view to extract useful inter-query information. The view presented in this paper is that the fundamental vocabulary of the system is the images in the database and that relevance feedback is a document whose words are the images. A relevance feedback document contains the intra-query information which expresses the semantic intent of the user over that query. The inter-query information then takes the form of a collection of documents which can be subjected to latent semantic analysis. An algorithm to query the latent semantic index is presented and evaluated against real data sets.

Proceedings Article
01 Jan 2002
TL;DR: This work proposes to represent documents using phrases, a vector space model that represents a document as a vector of index terms, and shows that phrase-based VSM yields a 16% increase of retrieval accuracy compared to the stem-based model.
Abstract: Many information retrieval systems are based on vector space model (VSM) that represents a document as a vector of index terms. Concepts have been proposed to replace word stems as the index terms to improve retrieval accuracy. However, past research revealed that such systems did not outperform the traditional stem-based systems. Incorporating conceptual similarity derived from knowledge sources should have the potential to improve retrieval accuracy. Yet the incompleteness of the knowledge source precludes significant improvement. To remedy this problem, we propose to represent documents using phrases. A phrase consists of multiple concepts and word stems. The similarity between two phrases is jointly determined by their conceptual similarity and their common word stems. The document similarity can in turn be derived from phrase similarities. Using OHSUMED as a test collection and UMLS as the knowledge source, our experiment results reveal that phrase-based VSM yields a 16% increase of retrieval accuracy compared to the stem-based model.

Book ChapterDOI
07 Mar 2002
TL;DR: The key idea is to separate the role of document storers from the machines visible to the users, which makes each individual part of the system less prone to attacks, and therefore to censorship.
Abstract: In this paper we propose a new Peer-to-Peer architecture for a censorship resistant system with user, server and active-server document anonymity as well as efficient document retrieval. The retrieval service is layered on top of an existing Peer-to-Peer infrastructure, which should facilitate its implementation. The key idea is to separate the role of document storers from the machines visible to the users, which makes each individual part of the system less prone to attacks, and therefore to censorship.

Proceedings ArticleDOI
24 Aug 2002
TL;DR: This work describes a mechanism for the generation of lexical paraphrases of queries posed to an Internet resource using WordNet and part-of-speech information to propose synonyms for the content words in the queries and evaluates its mechanism using 404 queries.
Abstract: We describe a mechanism for the generation of lexical paraphrases of queries posed to an Internet resource. These paraphrases are generated using WordNet and part-of-speech information to propose synonyms for the content words in the queries. Statistical information, obtained from a corpus, is then used to rank the paraphrases. We evaluated our mechanism using 404 queries whose answers reside in the LA Times subset of the TREC-9 corpus. There was a 14% improvement in performance when paraphrases were used for document retrieval.

Patent
03 Jun 2002
TL;DR: In this article, a document search and retrieval system and program product therefor is described, where search requests are provided to the system through a user interface, and a document decomposer decomposes documents into individual document components.
Abstract: A document search and retrieval system and program product therefor. Search requests are provided to the system through a user interface. A document decomposer decomposes documents into individual document components. Document components and corresponding searchable indices for each are stored in a Component Library. A search unit searches stored document components responsive to search queries. A results validator compares document hitlists with a document type identified in a search query to select valid hitlists entries for a final hitlist. A document view assembly module collects identified document components and assembles them into a document for view at the user interface.

Patent
22 Jul 2002
TL;DR: In this paper, a language model (114) is created for speech recognition from a text database (122) by an offline modeling processing (130) (solid line arrows), when a user talks to request for search, an acoustic model and the language model are used to perform a speech recognition processing and write-up is created.
Abstract: A language model (114) is created for speech recognition from a text database (122)by an offline modeling processing (130) (solid line arrows). In an online processing, when a user talks to request for search, an acoustic model (112) and the language model (114) are used to perform a speech recognition processing (110) and write-up is created. Next, by using the search request written up, a text search processing (120) is performed and the search result is output in the order of higher correlation.

Proceedings ArticleDOI
11 Mar 2002
TL;DR: New methods based on a context-diary and caching aimed at improving both the precision of relevant retrieved information and the speed/availability of retrieval are suggested.
Abstract: Information retrieval systems are usually unaware of the context in which they are being used. We believe that exploiting context information to augment existing retrieval methods can lead to increased retrieval precision. This approach is particularly important with the development of wireless mobile information appliances, such as PDAs. Many of these devices are aware of the user's physical context, and this has led to the evolution of context-aware applications. Such applications can automatically utilise the user's current context, e.g. location or ambient temperature. Context-Aware Retrieval is related to traditional Information Retrieval and Information Filtering, but is potentially more challenging due to the often continuous changes in user context. To meet these challenges we suggest a potential advantage of Context-Aware Retrieval: this is that the current context is often changing gradually and semi-predictably. In this paper we suggest new methods based on a context-diary and caching aimed at improving both the precision of relevant retrieved information and the speed/availability of retrieval. The methods can be used, in principle, on top of existing retrieval systems.

Journal ArticleDOI
TL;DR: The ERIn (Evaluation-Recommendation-Information) model, a decision-theoretic framework for understanding information-related activity, highlights the centrality of recommending in the document retrieval process, and may be used to clarify the respects in which indexing, rating, and citation may be considered analogous.
Abstract: The core of any document retrieval system is a mechanism that ranks the documents in a large collection in order of the likelihood with which they match the preferences of any person who interacts with the system. Given a broader interpretation of recommending than is commonly accepted, such a preference ordering may be viewed as a recommendation, made by the system to the information seeker, that is itself typically derived through synthesis of multiple preference orderings expressed as recommendations by indexers, information seekers, and document authors. The ERIn (Evaluation-Recommendation-Information) model, a decision-theoretic framework for understanding information-related activity, highlights the centrality of recommending in the document retrieval process, and may be used to clarify the respects in which indexing, rating, and citation may be considered analogous, as well as to make explicit the points at which content-based, collaboration-based, and context-based flavors of document retrieval systems vary.

Journal ArticleDOI
TL;DR: This article describes how a document map that is automatically organized for browsing and visualization can be successfully utilized also in speeding up document retrieval and shows significantly improved performance compared to Salton's vector space model.
Abstract: A map of text documents arranged using the Self-Organizing Map (SOM) algorithm (1) is organized in a meaningful manner so that items with similar content appear at nearby locations of the 2-dimensional map display, and (2) clusters the data, resulting in an approximate model of the data distribution in the high-dimensional document space. This article describes how a document map that is automatically organized for browsing and visualization can be successfully utilized also in speeding up document retrieval. Furthermore, experiments on the well-known CISI collection l3r show significantly improved performance compared to Salton's vector space model, measured by average precision (AP) when retrieving a small, fixed number of best documents. Regarding comparison with Latent Semantic Indexing the results are inconclusive.

Book ChapterDOI
19 Aug 2002
TL;DR: It is shown how a rigorous evaluation of Document Transformation can be carried out using the referer logs kept by web servers, and a new strategy for Document Transformation is described that is suitable for long-term incremental learning.
Abstract: This paper considers how web search engines can learn from the successful searches recorded in their user logs.Document Transformation is a feasible approach that uses these logs to improve document representations. Existing test collections do not allow an adequate investigation of Document Transformation, but we show how a rigorous evaluation of this method can be carried out using the referer logs kept by web servers. We also describe a new strategy for Document Transformation that is suitable for long-term incremental learning.Our experiments show that Document Transformation improves retrieval performance over a medium sized collection of webpages.Commercial search engines may be able to achieve similar improvements by incorporating this approach.

Book ChapterDOI
25 Mar 2002
TL;DR: An investigation of a tf-idf-acc approach, where tf and idf are the classical term frequency and inverse document frequency, and acc, a new parameter called accessibility, that captures the structure of documents, is reported on.
Abstract: Structured document retrieval aims at retrieving the document components that best satisfy a query, instead of merely retrieving pre-defined document units. This paper reports on an investigation of a tf-idf-acc approach, where tf and idf are the classical term frequency and inverse document frequency, and acc, a new parameter called accessibility, that captures the structure of documents. The tf-idf-acc approach is defined using a probabilistic relational algebra. To investigate the retrieval quality and estimate the acc values, we developed a method that automatically constructs diverse test collections of structured documents from a standard test collection, with which experiments were carried out. The analysis of the experiments provides estimates of the acc values.