Showing papers on "Document retrieval published in 2002"

PDF

Open Access

Monograph•DOI•

Natural language processing for online applications : text retrieval, extraction and categorization

[...]

20 Jun 2002

TL;DR: This text covers the emerging technologies of document retrieval, information extraction, and text categorization in a way which highlights commonalities in terms of both general principles and practical issues.

...read moreread less

Abstract: This text covers the emerging technologies of document retrieval, information extraction, and text categorization in a way which highlights commonalities in terms of both general principles and practical issues. It seeks to satisfy a need on the part of technology practitioners in the Internet space, faced with having to make difficult decisions as to what research has been done and what the best practices are. It is not intended as a vendor guide or as a recipe for building applications. But it does identify the key technologies, the issues involved, and the strengths and weaknesses of the various approaches. There is also a strong emphasis on evaluation in every chapter, both in terms of methodology (how to evaluate) and what controlled experimentation and industrial experience have to tell us.

...read moreread less

321 citations

Proceedings Article•DOI•

Efficient algorithms for document retrieval problems

[...]

S. Muthukrishnan¹•Institutions (1)

AT&T Labs¹

06 Jan 2002

TL;DR: This paper considers document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology, and provides the first known optimal algorithm for the document listing problem.

...read moreread less

Abstract: We are given a collection D of text documents d1,…,dk, with ∑i = n, which may be preprocessed. In the document listing problem, we are given an online query comprising of a pattern string p of length m and our goal is to return the set of all documents that contain one or more copies of p. In the closely related occurrence listing problem, we output the set of all positions within the documents where pattern p occurs. In 1973, Weiner [24] presented an algorithm with O(n) time and space preprocessing following which the occurrence listing problem can be solved in time O(m + output) where output is the number of positions where p occurs; this algorithm is clearly optimal. In contrast, no optimal algorithm is known for the closely related document listing problem, which is perhaps more natural and certainly well-motivated.We provide the first known optimal algorithm for the document listing problem. More generally, we initiate the study of pattern matching problems that require retrieving documents matched by the patterns; this contrasts with pattern matching problems that have been studied more frequently, namely, those that involve retrieving all occurrences of patterns. We consider document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology. We present very efficient (optimal) algorithms for our document retrieval problems. Our approach for solving such problems involve performing "local" encodings whereby they are reduced to range query problems on geometric objects --- points and lines --- that have color. We present improved algorithms for these colored range query problems that arise in our reductions using the structural properties of strings. This approach is quite general and yields simple, efficient, implementable algorithms for all the document retrieval problems in this paper.

...read moreread less

267 citations

Proceedings Article•DOI•

Passage retrieval based on language models

[...]

Xiaoyong Liu¹, W. Bruce Croft¹•Institutions (1)

University of Massachusetts Amherst¹

04 Nov 2002

TL;DR: The experiments show that passage retrieval is feasible in the language modeling context, and more importantly, it can provide more reliable performance than retrieval based on full documents.

...read moreread less

Abstract: Previous research has shown that passage-level evidence can bring added benefits to document retrieval when documents are long or span different subject areas. Recent developments in language modeling approach to IR provided a new effective alternative to traditional retrieval models. These two streams of research motivate us to examine the use of passages in a language model framework. This paper reports on experiments using passages in a simple language model and a relevance model, and compares the results with document-based retrieval. Results from the INQUERY search engine, which is not based on a language modeling approach, are also given for comparison. Test data include two heterogeneous and one homogeneous document collections. Our experiments show that passage retrieval is feasible in the language modeling context, and more importantly, it can provide more reliable performance than retrieval based on full documents.

...read moreread less

228 citations

Proceedings Article•DOI•

Information retrieval on the semantic web

[...]

Urvi Shah, Tim Finin¹, Anupam Joshi¹, R. Scott Cost¹, James Matfield² - Show less +1 more•Institutions (2)

University of Maryland, Baltimore County¹, Johns Hopkins University Applied Physics Laboratory²

04 Nov 2002

TL;DR: An approach to retrieval of documents that contain of both free text and semantically enriched markup in which both documents and queries can be marked up with statements in the DAML+OIL semantic web language is described.

...read moreread less

Abstract: We describe an approach to retrieval of documents that contain of both free text and semantically enriched markup. In particular, we present the design and implementation prototype of a framework in which both documents and queries can be marked up with statements in the DAML+OIL semantic web language. These statements provide both structured and semi-structured information about the documents and their content. We claim that indexing text and semantic markup together will significantly improve retrieval performance. Our approach allows inferencing to be done over this information at several points: when a document is indexed, when a query is processed and when query results are evaluated.

...read moreread less

227 citations

Journal Article•DOI•

Information Retrieval: Algorithms and Heuristics

[...]

Hugo Zaragoza¹•Institutions (1)

Microsoft¹

01 Apr 2002-Information Retrieval

211 citations

Proceedings Article•DOI•

Probabilistic question answering on the web

[...]

Dragomir R. Radev¹, Weiguo Fan¹, Hong Qi¹, Harris Wu¹, Amardeep Grewal¹ - Show less +1 more•Institutions (1)

University of Michigan¹

07 May 2002

TL;DR: The architecture that augments existing search engines so that they support natural language question answering, called NSIR, is developed and some probabilistic approaches to the last three of these stages are described.

...read moreread less

Abstract: Web-based search engines such as Google and NorthernLight return documents that are relevant to a user query, not answers to user questions. We have developed an architecture that augments existing search engines so that they support natural language question answering. The process entails five steps: query modulation, document retrieval, passage extraction, phrase extraction, and answer ranking. In this paper we describe some probabilistic approaches to the last three of these stages. We show how our techniques apply to a number of existing search engines and we also present results contrasting three different methods for question answering. Our algorithm, probabilistic phrase reranking (PPR) using proximity and question type features achieves a total reciprocal document rank of .20 on the TREC 8 corpus. Our techniques have been implemented as a Web-accessible system, called NSIR.

...read moreread less

182 citations

Book Chapter•DOI•

Improved Ant-Based Clustering and Sorting in a Document Retrieval Interface

[...]

Julia Handl¹, Bernd Meyer²•Institutions (2)

University of Erlangen-Nuremberg¹, Monash University²

07 Sep 2002

TL;DR: In this article, the authors re-examine the capability of ant-based meta-heuristics to simultaneously perform a combination of clustering and multi-dimensional scaling, and they show how to improve on this by some modifications of the algorithm and a hybridization with a simple pre-processing phase.

...read moreread less

Abstract: Sorting and clustering methods inspired by the behavior of real ants are among the earliest methods in ant-based meta-heuristics. We revisit these methods in the context of a concrete application and introduce some modifications that yield significant improvements in terms of both quality and efficiency. Firstly, we re-examine their capability to simultaneously perform a combination of clustering and multi-dimensional scaling. In contrast to the assumptions made in earlier literature, our results suggest that these algorithms perform scaling only to a very limited degree. We show how to improve on this by some modifications of the algorithm and a hybridization with a simple pre-processing phase. Secondly, we discuss how the time-complexity of these algorithms can be improved. The improved algorithms are used as the core mechanism in a visual document retrieval system for world-wide web searches.

...read moreread less

181 citations

Patent•

Document retrieval assisting method and system for the same and document retrieval service using the same

[...]

Shingo Nishioka¹, Makoto Iwayama¹, Kazuhiro Ono¹, Akihiko Takano¹, Yoshiki Niwa¹, Atsuko Yamaguchi¹ - Show less +2 more•Institutions (1)

Hitachi¹

29 Apr 2002

TL;DR: In this article, the search results display area and topic word display area adjacently on a retrieval assisting interface, the title information and topic information can be browsed by users; by arranging search results analysis means such as mark title button for emphasizing documents containing designated topic words, along with mark topic word button for emphasis topic words contained in a designated document, users can analyze search results readily from various standpoints.

...read moreread less

Abstract: Achieving efficient analysis of search results, which is required for the examination of search queries, by listing up both title information of a retrieved document group and the whole information. By arranging search results display area and topic word display area adjacently on a retrieval assisting interface, the title information and topic information can be browsed by users; by arranging search results analysis means such as mark title button for emphasizing documents containing designated topic words, along with mark topic word button for emphasizing topic words contained in a designated document, users can analyze search results readily from various standpoints.

...read moreread less

136 citations

Proceedings Article•DOI•

Lexical chains for question answering

[...]

Dan Moldovan, Adrian Novischi

24 Aug 2002

TL;DR: Topical relations expressed as lexical chains on extended WordNet improve the performance of a question answering system by increasing the document retrieval recall and by providing the much needed axioms that link question keywords with answers.

...read moreread less

Abstract: The paper presents a method for finding topically related words on an extended WordNet. By exploiting the information in the WordNet glosses, the connectivity between the synsets is dramatically increased. Topical relations expressed as lexical chains on extended WordNet improve the performance of a question answering system by increasing the document retrieval recall and by providing the much needed axioms that link question keywords with answers.

...read moreread less

133 citations

Book Chapter•DOI•

The Use of Implicit Evidence for Relevance Feedback in Web Retrieval

[...]

Ryen W. White¹, Ian Ruthven², Joemon M. Jose¹•Institutions (2)

University of Glasgow¹, University of Strathclyde²

25 Mar 2002

TL;DR: The research focuses on the degree to which implicit evidence of document relevance can be substituted for explicit evidence in terms of both user opinion and search effectiveness.

...read moreread less

Abstract: In this paper we report on the application of two contrasting types of relevance feedback for web retrieval. We compare two systems; one using explicit relevance feedback (where searchers explicitly have to mark documents relevant) and one using implicit relevance feedback (where the system endeavours to estimate relevance by mining the searcher's interaction). The feedback is used to update the display according to the user's interaction. Our research focuses on the degree to which implicit evidence of document relevance can be substituted for explicit evidence. We examine the two variations in terms of both user opinion and search effectiveness.

...read moreread less

131 citations

Journal Article•DOI•

The InfoSky visual explorer: exploiting hierarchical structure and document similarities

[...]

Keith Andrews¹, Wolfgang Kienreich, Vedran Sabol, Jutta Becker, Georg Droschl, Frank Kappe, Michael Granitzer, Peter Auer¹, Klaus Tochtermann - Show less +5 more•Institutions (1)

Graz University of Technology¹

01 Dec 2002-Information Visualization

TL;DR: InfoSky is a system enabling users to explore large, hierarchically structured document collections using a planar graphical representation with variable magnification, and can map metadata such as document size or age to attributes of the visualisation such as colour and luminance.

...read moreread less

Abstract: InfoSky is a system enabling users to explore large, hierarchically structured document collections. Similar to a real-world telescope, InfoSky employs a planar graphical representation with variable magnification. Documents of similar content are placed close to each other and are visualised as stars, forming clusters with distinct shapes. For greater performance, the hierarchical structure is exploited and force-directed placement is applied recursively at each level on much fewer objects, rather than on the whole corpus. Collections of documents at a particular level in the hierarchy are visualised with bounding polygons using a modified weighted Voronoi diagram. Their area is related to the number of documents contained. Textual labels are displayed dynamically during navigation, adjusting to the visualisation content. Navigation is animated and provides a seamless zooming transition between summary and detail view. Users can map metadata such as document size or age to attributes of the visualisation such as colour and luminance. Queries can be made and matching documents or collections are highlighted. Formative usability testing is ongoing; a small baseline experiment comparing the telescope browser to a tree browser is discussed.

...read moreread less

Proceedings Article•DOI•

Biterm language models for document retrieval

[...]

Munirathnam Srikanth¹, Rohini K. Srihari¹•Institutions (1)

University at Buffalo¹

11 Aug 2002

TL;DR: A general language model was proposed that combined bigram language models with Good-Turing estimate and corpus-based smoothing of unigram probabilities to demonstrate better performance of language models against vector space or probabilistic retrieval models for document retrieval.

...read moreread less

Abstract: Statistical Language Models(LM) have been used in many natural language processing tasks including speech recognition and machine translation [5, 2]. Recently language models have been explored as a framework for information retrieval [9, 4, 7, 1, 6]. The basic idea is to view each document to have its own language model and model querying as a generative process. Documents are ranked based on the probability of their language model generating the given query. Since documents are fixed entities in information retrieval, language models for documents suffer from sparse data problem. Smoothed unigram models have been used to demonstrate better performance of language models against vector space or probabilistic retrieval models for document retrieval. Song and Croft [10] proposed a general language model that combined bigram language models with Good-Turing estimate and corpus-based smoothing of unigram probabilities. Improved performance was observed with combined bigram language models. The language models explored for information retrieval mimic those used for speech recognition. Specifically, in the bigram model a document d represented as word sequence w1, w2, · · · , wn is modeled as

...read moreread less

Patent•

Process and system for sparse vector and matrix representation of document indexing and retrieval

[...]

Aric Coady

11 Jan 2002

TL;DR: In this paper, a new data structure and algorithms which offer at least equal performance in common sparse matrix tasks, and improved performance in many, were proposed and applied to a word document index to produce fast build and query times for document retrieval.

...read moreread less

Abstract: A new data structure and algorithms which offer at least equal performance in common sparse matrix tasks, and improved performance in many. This is applied to a word-document index to produce fast build and query times for document retrieval.

...read moreread less

Proceedings Article•DOI•

Term-specific smoothing for the language modeling approach to information retrieval: the importance of a query term

[...]

Djoerd Hiemstra¹•Institutions (1)

University of Twente¹

11 Aug 2002

TL;DR: The new language modeling approach is shown to explain a number of practical facts of today's information retrieval systems that are not very well explained by the current state of information retrieval theory, including stop words, mandatory terms, coordination level ranking and retrieval using phrases.

...read moreread less

Abstract: This paper follows a formal approach to information retrieval based on statistical language models. By introducing some simple reformulations of the basic language modeling approach we introduce the notion of importance of a query term. The importance of a query term is an unknown parameter that explicitly models which of the query terms are generated from the relevant documents (the important terms), and which are not (the unimportant terms). The new language modeling approach is shown to explain a number of practical facts of today's information retrieval systems that are not very well explained by the current state of information retrieval theory, including stop words, mandatory terms, coordination level ranking and retrieval using phrases.

...read moreread less

Journal Article•DOI•

Briefly noted - Mathematical foundations of information retrieval

[...]

Sandor Dominich¹•Institutions (1)

University of Pannonia¹

01 Mar 2002-Computational Linguistics

TL;DR: This paper presents a meta-modelling framework for estimating the relevance of information retrieval in a number of discrete-time models and shows clear patterns in how these models are modified over time.

...read moreread less

Abstract: Acknowledgments. Preface. 1. Introduction. 2. Mathematics Handbook. 3. Information Retrieval Models. 4. Mathematical Theory of Information Retrieval. 5. Relevance Effectiveness in Information Retrieval. 6. Further Topics in Information Retrieval. Appendices. References. Index.

...read moreread less

Patent•

Distributed document retrieval method and device, and distributed document retrieval program and recording medium recording the program

[...]

Inaba Mitsuaki¹, Yuji Kanno¹•Institutions (1)

Panasonic¹

26 Mar 2002

TL;DR: In this article, a distributed document retrieval method for performing document retrieval by plural retrieval servers that each perform document retrieval for a database storing plural documents, and an integrating retrieval server that is connected to the plural retrieval server over communication and issues retrieval orders to the retrieval servers.

...read moreread less

Abstract: A distributed document retrieval method for performing document retrieval by plural retrieval servers that each perform document retrieval for a database storing plural documents, and an integrating retrieval server that is connected to the plural retrieval servers over communication and issues retrieval orders to the retrieval servers, wherein each retrieval server delivers statistical information created based on intermediate results obtained by retrieval operation to the integrating retrieval server, the integrating retrieval server compiles the statistical information to create global statistical information and delivers it to each retrieval server, and each retrieval server calculates scores based on the global statistical information and sends retrieval results matching retrieval conditions back to the integrating retrieval server. By the above described operation, efficient and correct ranking among retrieval documents is achieved with improved document retrieval quality.

...read moreread less

Book•

Natural language processing for online applications

[...]

Peter Jackson¹, Isabelle Moulinier¹•Institutions (1)

Thomson Corporation¹

01 Jan 2002

TL;DR: In this paper, the technologies of document retrieval, information extraction, and text categorization are discussed in a way which highlights commonalities in terms of both general principles and practical concerns.

...read moreread less

Abstract: This text covers the technologies of document retrieval, information extraction, and text categorization in a way which highlights commonalities in terms of both general principles and practical concerns. It assumes some mathematical background on the part of the reader, but the chapters typically begin with a non-mathematical account of the key issues. Current research topics are covered only to the extent that they are informing current applications; detailed coverage of longer term research and more theoretical treatments should be sought elsewhere. There are many pointers at the ends of the chapters that the reader can follow to explore the literature. However, the book does maintain a strong emphasis on evaluation in every chapter both in terms of methodology and the results of controlled experimentation.

...read moreread less

Journal Article•DOI•

Assessing bias in search engines

[...]

Abbe Mowshowitz¹, Akira Kawaguchi¹•Institutions (1)

City College of New York¹

01 Jan 2002-Information Processing and Management

TL;DR: This paper deals with the measurement of bias in search engines on the World Wide Web by measuring the deviation from the ideal of the distribution produced by a particular search engine.

...read moreread less

Abstract: This paper deals with the measurement of bias in search engines on the World Wide Web. Bias is taken to mean the balance and representativeness of items in a collection retrieved from a database for a set of queries. This calls for assessing the degree to which the distribution of items in a collection deviates from the ideal. Ascertaining this ideal poses problems similar to those associated with determining relevance in the measurement of recall and precision. Instead of enlisting subject experts or users to determine such an ideal, a family of comparable search engines is used to approximate it for a set of queries. The distribution is obtained by computing the frequencies of occurrence of the uniform resource locators (URLs) in the collection retrieved by several search engines for the given queries. Bias is assessed by measuring the deviation from the ideal of the distribution produced by a particular search engine.

...read moreread less

Journal Article•DOI•

Statistical correlation analysis in image retrieval

[...]

Mingjing Li¹, Zheng Chen¹, Hong-Jiang Zhang¹•Institutions (1)

Microsoft¹

01 Dec 2002-Pattern Recognition

TL;DR: A statistical correlation model for image retrieval captures the semantic relationships among images in a database from simple statistics of user-provided relevance feedback information and can be efficiently integrated into an image retrieval system to help improve the retrieval performance.

...read moreread less

Proceedings Article•DOI•

Building a latent semantic index of an image database from patterns of relevance feedback

[...]

Douglas R. Heisterkamp¹•Institutions (1)

Oklahoma State University–Stillwater¹

11 Aug 2002

TL;DR: The view presented in this paper is that the fundamental vocabulary of the system is the images in the database and that relevance feedback is a document whose words are the images which expresses the semantic intent of the user over that query.

...read moreread less

Abstract: This paper proposes a novel view of the information generated by relevance feedback. The latent semantic analysis is adapted to this view to extract useful inter-query information. The view presented in this paper is that the fundamental vocabulary of the system is the images in the database and that relevance feedback is a document whose words are the images. A relevance feedback document contains the intra-query information which expresses the semantic intent of the user over that query. The inter-query information then takes the form of a collection of documents which can be subjected to latent semantic analysis. An algorithm to query the latent semantic index is presented and evaluated against real data sets.

...read moreread less

Proceedings Article•

Free-text medical document retrieval via phrase-based vector space model.

[...]

Wenlei Mao¹, Wesley W. Chu•Institutions (1)

University of California, Los Angeles¹

01 Jan 2002

TL;DR: This work proposes to represent documents using phrases, a vector space model that represents a document as a vector of index terms, and shows that phrase-based VSM yields a 16% increase of retrieval accuracy compared to the stem-based model.

...read moreread less

Abstract: Many information retrieval systems are based on vector space model (VSM) that represents a document as a vector of index terms. Concepts have been proposed to replace word stems as the index terms to improve retrieval accuracy. However, past research revealed that such systems did not outperform the traditional stem-based systems. Incorporating conceptual similarity derived from knowledge sources should have the potential to improve retrieval accuracy. Yet the incompleteness of the knowledge source precludes significant improvement. To remedy this problem, we propose to represent documents using phrases. A phrase consists of multiple concepts and word stems. The similarity between two phrases is jointly determined by their conceptual similarity and their common word stems. The document similarity can in turn be derived from phrase similarities. Using OHSUMED as a test collection and UMLS as the knowledge source, our experiment results reveal that phrase-based VSM yields a 16% increase of retrieval accuracy compared to the stem-based model.

...read moreread less

Book Chapter•DOI•

Anonymizing Censorship Resistant Systems

[...]

Andrei Serjantov¹•Institutions (1)

University of Cambridge¹

07 Mar 2002

TL;DR: The key idea is to separate the role of document storers from the machines visible to the users, which makes each individual part of the system less prone to attacks, and therefore to censorship.

...read moreread less

Abstract: In this paper we propose a new Peer-to-Peer architecture for a censorship resistant system with user, server and active-server document anonymity as well as efficient document retrieval. The retrieval service is layered on top of an existing Peer-to-Peer infrastructure, which should facilitate its implementation. The key idea is to separate the role of document storers from the machines visible to the users, which makes each individual part of the system less prone to attacks, and therefore to censorship.

...read moreread less

Proceedings Article•DOI•

Lexical query paraphrasing for document retrieval

[...]

Ingrid Zukerman¹, Bhavani Raskutti²•Institutions (2)

Monash University, Clayton campus¹, Telstra²

24 Aug 2002

TL;DR: This work describes a mechanism for the generation of lexical paraphrases of queries posed to an Internet resource using WordNet and part-of-speech information to propose synonyms for the content words in the queries and evaluates its mechanism using 404 queries.

...read moreread less

Abstract: We describe a mechanism for the generation of lexical paraphrases of queries posed to an Internet resource. These paraphrases are generated using WordNet and part-of-speech information to propose synonyms for the content words in the queries. Statistical information, obtained from a corpus, is then used to rank the paraphrases. We evaluated our mechanism using 404 queries whose answers reside in the LA Times subset of the TREC-9 corpus. There was a 14% improvement in performance when paraphrases were used for document retrieval.

...read moreread less

Patent•

System and method for generating and retrieving different document layouts from a given content

[...]

Gregory T. Brown¹, Thomas Anthony Cofino¹, Yurdaer N. Doganata¹, Youssef Drissi¹, Tong-haing Fin¹, Moon J. Kim¹, Lev Kozakov¹, John W. Miller¹ - Show less +4 more•Institutions (1)

IBM¹

03 Jun 2002

TL;DR: In this article, a document search and retrieval system and program product therefor is described, where search requests are provided to the system through a user interface, and a document decomposer decomposes documents into individual document components.

...read moreread less

Abstract: A document search and retrieval system and program product therefor. Search requests are provided to the system through a user interface. A document decomposer decomposes documents into individual document components. Document components and corresponding searchable indices for each are stored in a Component Library. A search unit searches stored document components responsive to search queries. A results validator compares document hitlists with a document type identified in a search query to select valid hitlists entries for a final hitlist. A document view assembly module collects identified document components and assembles them into a document for view at the user interface.

...read moreread less

Patent•

Speech input search system

[...]

Atsushi Fujii, Ito Katsunobu, Tetsuya Ishikawa, Tomoyoshi Akiba

22 Jul 2002

TL;DR: In this paper, a language model (114) is created for speech recognition from a text database (122) by an offline modeling processing (130) (solid line arrows), when a user talks to request for search, an acoustic model and the language model are used to perform a speech recognition processing and write-up is created.

...read moreread less

Abstract: A language model (114) is created for speech recognition from a text database (122)by an offline modeling processing (130) (solid line arrows). In an online processing, when a user talks to request for search, an acoustic model (112) and the language model (114) are used to perform a speech recognition processing (110) and write-up is created. Next, by using the search request written up, a text search processing (120) is performed and the search result is output in the order of higher correlation.

...read moreread less

Proceedings Article•DOI•

Exploiting contextual change in context-aware retrieval

[...]

Peter J. Brown¹, Gareth J. F. Jones¹•Institutions (1)

University of Exeter¹

11 Mar 2002

TL;DR: New methods based on a context-diary and caching aimed at improving both the precision of relevant retrieved information and the speed/availability of retrieval are suggested.

...read moreread less

Abstract: Information retrieval systems are usually unaware of the context in which they are being used. We believe that exploiting context information to augment existing retrieval methods can lead to increased retrieval precision. This approach is particularly important with the development of wireless mobile information appliances, such as PDAs. Many of these devices are aware of the user's physical context, and this has led to the evolution of context-aware applications. Such applications can automatically utilise the user's current context, e.g. location or ambient temperature. Context-Aware Retrieval is related to traditional Information Retrieval and Information Filtering, but is potentially more challenging due to the often continuous changes in user context. To meet these challenges we suggest a potential advantage of Context-Aware Retrieval: this is that the current context is often changing gradually and semi-predictably. In this paper we suggest new methods based on a context-diary and caching aimed at improving both the precision of relevant retrieved information and the speed/availability of retrieval. The methods can be used, in principle, on top of existing retrieval systems.

...read moreread less

Journal Article•DOI•

On recommending

[...]

Jonathan Furner¹•Institutions (1)

University of California, Los Angeles¹

02 Aug 2002-Journal of the Association for Information Science and Technology

TL;DR: The ERIn (Evaluation-Recommendation-Information) model, a decision-theoretic framework for understanding information-related activity, highlights the centrality of recommending in the document retrieval process, and may be used to clarify the respects in which indexing, rating, and citation may be considered analogous.

...read moreread less

Abstract: The core of any document retrieval system is a mechanism that ranks the documents in a large collection in order of the likelihood with which they match the preferences of any person who interacts with the system. Given a broader interpretation of recommending than is commonly accepted, such a preference ordering may be viewed as a recommendation, made by the system to the information seeker, that is itself typically derived through synthesis of multiple preference orderings expressed as recommendations by indexers, information seekers, and document authors. The ERIn (Evaluation-Recommendation-Information) model, a decision-theoretic framework for understanding information-related activity, highlights the centrality of recommending in the document retrieval process, and may be used to clarify the respects in which indexing, rating, and citation may be considered analogous, as well as to make explicit the points at which content-based, collaboration-based, and context-based flavors of document retrieval systems vary.

...read moreread less

Journal Article•DOI•

Text Retrieval Using Self-Organized Document Maps

[...]

Krista Lagus¹•Institutions (1)

Helsinki University of Technology¹

01 Feb 2002-Neural Processing Letters

TL;DR: This article describes how a document map that is automatically organized for browsing and visualization can be successfully utilized also in speeding up document retrieval and shows significantly improved performance compared to Salton's vector space model.

...read moreread less

Abstract: A map of text documents arranged using the Self-Organizing Map (SOM) algorithm (1) is organized in a meaningful manner so that items with similar content appear at nearby locations of the 2-dimensional map display, and (2) clusters the data, resulting in an approximate model of the data distribution in the high-dimensional document space. This article describes how a document map that is automatically organized for browsing and visualization can be successfully utilized also in speeding up document retrieval. Furthermore, experiments on the well-known CISI collection l3r show significantly improved performance compared to Salton's vector space model, measured by average precision (AP) when retrieving a small, fixed number of best documents. Regarding comparison with Latent Semantic Indexing the results are inconclusive.

...read moreread less

Book Chapter•DOI•

Long-Term Learning for Web Search Engines

[...]

Charles Kemp¹, Kotagiri Ramamohanarao¹•Institutions (1)

University of Melbourne¹

19 Aug 2002

TL;DR: It is shown how a rigorous evaluation of Document Transformation can be carried out using the referer logs kept by web servers, and a new strategy for Document Transformation is described that is suitable for long-term incremental learning.

...read moreread less

Abstract: This paper considers how web search engines can learn from the successful searches recorded in their user logs.Document Transformation is a feasible approach that uses these logs to improve document representations. Existing test collections do not allow an adequate investigation of Document Transformation, but we show how a rigorous evaluation of this method can be carried out using the referer logs kept by web servers. We also describe a new strategy for Document Transformation that is suitable for long-term incremental learning.Our experiments show that Document Transformation improves retrieval performance over a medium sized collection of webpages.Commercial search engines may be able to achieve similar improvements by incorporating this approach.

...read moreread less

Book Chapter•DOI•

The Accessibility Dimension for Structured Document Retrieval

[...]

Thomas Rölleke¹, Mounia Lalmas¹, Gabriella Kazai¹, Ian Ruthven², Stefan Quicker - Show less +1 more•Institutions (2)

Queen Mary University of London¹, University of Strathclyde²

25 Mar 2002

TL;DR: An investigation of a tf-idf-acc approach, where tf and idf are the classical term frequency and inverse document frequency, and acc, a new parameter called accessibility, that captures the structure of documents, is reported on.

...read moreread less

Abstract: Structured document retrieval aims at retrieving the document components that best satisfy a query, instead of merely retrieving pre-defined document units. This paper reports on an investigation of a tf-idf-acc approach, where tf and idf are the classical term frequency and inverse document frequency, and acc, a new parameter called accessibility, that captures the structure of documents. The tf-idf-acc approach is defined using a probabilistic relational algebra. To investigate the retrieval quality and estimate the acc values, we developed a method that automatically constructs diverse test collections of structured documents from a standard test collection, with which experiments were carried out. The analysis of the experiments provides estimates of the acc values.

...read moreread less

Collapse