scispace - formally typeset
Search or ask a question

Showing papers on "Human–computer information retrieval published in 2003"


Journal ArticleDOI
28 Jul 2003
TL;DR: The design and evaluation of a system, called Stuff I've Seen (SIS), that facilitates information re-use and provides a unified index of information that a person has seen, whether it was seen as email, web page, document, appointment, etc.
Abstract: Most information retrieval technologies are designed to facilitate information discovery. However, much knowledge work involves finding and re-using previously seen information. We describe the design and evaluation of a system, called Stuff I've Seen (SIS), that facilitates information re-use. This is accomplished in two ways. First, the system provides a unified index of information that a person has seen, whether it was seen as email, web page, document, appointment, etc. Second, because the information has been seen before, rich contextual cues can be used in the search interface. The system has been used internally by more than 230 employees. We report on both qualitative and quantitative aspects of system use. Initial findings show that time and people are important retrieval cues. Users find information more easily using SIS, and use other search tools less frequently after installation.

887 citations



Journal ArticleDOI
TL;DR: The results show that individual computer experience, quality of search systems, motivation, and perceptions of technology acceptance are all key factors that affect individual feelings to use search engines as an information retrieval tool.

342 citations


Proceedings ArticleDOI
08 Sep 2003
TL;DR: Initial results suggest that the algorithms presented can retrieve a significantly higher percentage of the links than analysts, even when using existing tools, and do so in much less time while achieving comparable signal-to-noise levels.
Abstract: We present an approach for improving requirements tracing based on framing it as an information retrieval (IR) problem. Specifically, we focus on improving recall and precision in order to reduce the number of missed traceability links as well as to reduce the number of irrelevant potential links that an analyst has to examine when performing requirements tracing. Several IR algorithms were adapted and implemented to address this problem. We evaluated our algorithms by comparing their results and performance to those of a senior analyst who traced manually as well as with an existing requirements tracing tool. Initial results suggest that we can retrieve a significantly higher percentage of the links than analysts, even when using existing tools, and do so in much less time while achieving comparable signal-to-noise levels.

296 citations


Journal ArticleDOI
01 Apr 2003
TL;DR: This report summarizes a discussion of IR research challenges that took place at a recent workshop, which identified Contextual retrieval and global information access were identified as particularly important long-term challenges.
Abstract: Information retrieval (IR) research has reached a point where it is appropriate to assess progress and to define a research agenda for the next five to ten years. This report summarizes a discussion of IR research challenges that took place at a recent workshop. The attendees of the workshop considered information retrieval research in a range of areas chosen to give broad coverage of topic areas that engage information retrieval researchers. Those areas are retrieval models, cross-lingual retrieval, Web search, user modeling, filtering, topic detection and tracking, classification, summarization, question answering, metasearch, distributed retrieval, multimedia retrieval, information extraction, as well as testbed requirements for future work. The potential use of language modeling techniques in these areas was also discussed. The workshop identified major challenges within each of those areas. The following are recurring themes that ran throughout: • User and context sensitive retrieval • Multi-lingual and multi-media issues • Better target tasks • Improved objective evaluations • Substantially more labeled data • Greater variety of data sources • Improved formal models Contextual retrieval and global information access were identified as particularly important long-term challenges.

240 citations


Book ChapterDOI
24 Jul 2003
TL;DR: The results are encouraging, indicating that pseudo-relevance feedback shows great promise for multimedia retrieval with very varied and errorful data.
Abstract: We present an algorithm for video retrieval that fuses the decisions of multiple retrieval agents in both text and image modalities. While the normalization and combination of evidence is novel, this paper emphasizes the successful use of negative pseudorelevance feedback to improve image retrieval performance. Although we have not solved all problems in video information retrieval, the results are encouraging, indicating that pseudo-relevance feedback shows great promise for multimedia retrieval with very varied and errorful data.

229 citations


Proceedings ArticleDOI
28 Jul 2003
TL;DR: This study explores the development and subsequent evaluation of a statistical word sense disambiguation system which demonstrates increased precision from a sense based vector space retrieval model over traditional TF*IDF techniques.
Abstract: Word sense ambiguity is recognized as having a detrimental effect on the precision of information retrieval systems in general and web search systems in particular, due to the sparse nature of the queries involved. Despite continued research into the application of automated word sense disambiguation, the question remains as to whether less than 90% accurate automated word sense disambiguation can lead to improvements in retrieval effectiveness. In this study we explore the development and subsequent evaluation of a statistical word sense disambiguation system which demonstrates increased precision from a sense based vector space retrieval model over traditional TF*IDF techniques.

211 citations


Patent
30 Jul 2003
TL;DR: In this article, an information retrieval system for automatically retrieving information related to the context of an active task being manipulated by a user is presented, where the system observes the operation of the active task and user interactions and utilizes predetermined criteria to generate a context representation.
Abstract: An information retrieval system for automatically retrieving information related to the context of an active task being manipulated by a user. The system observes the operation of the active task and user interactions, and utilizes predetermined criteria to generate a context representation of the active task that are relevant to the context of the active task. The information retrieval system then processes the context representation to generate queries or search terms for conducting an information search. The information retrieval system reorders the terms in a query so that they occur in a meaningful order as they naturally occur in a document or active task being manipulated by the user. Furthermore, the information retrieval system may access a user profile to retrieve information related to the user, and the select information sources or transform search terms based on attributes related to the user, such as the user's occupation, position in a company, major in school, etc.

206 citations



Proceedings Article
01 Jan 2003
TL;DR: NLP needs to be optimized for IR in order to be effective and document retrieval is not an ideal application for NLP, at least given the current state-of-the-art in NLP.
Abstract: Many Natural Language Processing (NLP) techniques have been used in Information Retrieval. The results are not encouraging. Simple methods (stopwording, porter-style stemming, etc.) usually yield significant improvements, while higher-level processing (chunking, parsing, word sense disambiguation, etc.) only yield very small improvements or even a decrease in accuracy. At the same time, higher-level methods increase the processing and storage cost dramatically. This makes them hard to use on large collections. We review NLP techniques and come to the conclusion that (a) NLP needs to be optimized for IR in order to be effective and (b) document retrieval is not an ideal application for NLP, at least given the current state-of-the-art in NLP. Other IR-related tasks, e.g., question answering and information extraction, seem to be better suited.

156 citations


Proceedings ArticleDOI
03 Nov 2003
TL;DR: This work proposes a new method of obtaining expansion terms, based on selecting terms from past user queries that are associated with documents in the collection, that is effective for query expansion for web retrieval.
Abstract: Hundreds of millions of users each day use web search engines to meet their information needs Advances in web search effectiveness are therefore perhaps the most significant public outcomes of IR research Query expansion is one such method for improving the effectiveness of ranked retrieval by adding additional terms to a query In previous approaches to query expansion, the additional terms are selected from highly ranked documents returned from an initial retrieval run We propose a new method of obtaining expansion terms, based on selecting terms from past user queries that are associated with documents in the collection Our scheme is effective for query expansion for web retrieval: our results show relative improvements over unexpanded full text retrieval of 26%--29%, and 18%--20% over an optimised, conventional expansion approach

Proceedings Article
01 Jun 2003
TL;DR: ODISSEA as discussed by the authors is a P2P-based search engine for massive document collections that uses a two-tier search engine architecture and a global index structure distributed over the nodes of the system.
Abstract: We consider the problem of building a P2P-based search engine for massive document collections. We describe a prototype system called ODISSEA (Open DIStributed Search Engine Architecture) that is currently under development in our group. ODISSEA provides a highly distributed global indexing and query execution service that can be used for content residing inside or outside of a P2P network. ODISSEA is different from many other approaches to P2P search in that it assumes a two-tier search engine architecture and a global index structure distributed over the nodes of the system. We give an overview of the proposed system and discuss the basic design choices. Our main focus is on efficient query execution, and we discuss how recent work on topqueries in the database community can be applied in a highly distributed environment. We also give some preliminary simulation results on a real search engine log and a terabytesize web page collection that indicate good scalability for our approach. Project homepage: http://cis.poly.edu/westlab/odissea/. A preliminary version of this paper appeared at the International Workshop on the Web and Databases, June 2003. Contact author. Email: suel@poly.edu

Proceedings ArticleDOI
09 Nov 2003
TL;DR: Field studies of information gathering in two design teams that had very different products, disciplinary backgrounds, and tools found striking similarities in the kinds of information they sought and the methods used to get it.
Abstract: Information retrieval is generally considered an individual activity, and information retrieval research and tools reflect this view. As digitally mediated communication and information sharing increase, collaborative information retrieval merits greater attention and support. We describe field studies of information gathering in two design teams that had very different products, disciplinary backgrounds, and tools. We found striking similarities in the kinds of information they sought and the methods used to get it. For example, each team sought information about design constraints from external sources. A common strategy was to propose ideas and request feedback, rather than to ask directly for recommendations. Some differences in information seeking and sharing reflected differences in work contexts. Our findings suggest some ways that existing team collaboration tools could support collaborative information retrieval more effectively.

Proceedings ArticleDOI
02 Nov 2003
TL;DR: This work presents a novel approach that uses pseudo-relevance feedback from retrieved items that are NOT similar to the query items without further inquiring user feedback and suggests a score combination scheme via posterior probability estimation.
Abstract: Video information retrieval requires a system to find information relevant to a query which may be represented simultaneously in different ways through a text description, audio, still images and/or video sequences. We present a novel approach that uses pseudo-relevance feedback from retrieved items that are NOT similar to the query items without further inquiring user feedback. We provide insight into this approach using a statistical model and suggest a score combination scheme via posterior probability estimation. An evaluation on the 2002 TREC Video Track queries shows that this technique can improve video retrieval performance on a real collection. We believe that negative pseudo-relevance feedback shows great promise for very difficult multimedia retrieval tasks, especially when combined with other different retrieval algorithms.

Proceedings Article
01 Aug 2003
TL;DR: It is suggested that retrieval can be tightly bound to infer- ence, which makes today's Web search engines useful to Semantic Web inference engines, and causes improvements in either retrieval or inference to lead directly to improvements in the other.
Abstract: vision of the Semantic Web is that it will be much like the Web we know today, except that documents will be enriched by annotations in machine understandable markup. These annotations will provide metadata about the documents as well as machine interpretable statements capturing some of the meaning of document content. We discuss how the informa- tion retrieval paradigm might be recast in such an environ- ment. We suggest that retrieval can be tightly bound to infer- ence. Doing so makes today's Web search engines useful to Semantic Web inference engines, and causes improvements in either retrieval or inference to lead directly to improvements in the other.

Patent
Ching-Yung Lin1, Apostol Natsev, Milind Naphade1, John R. Smith1, Belle L. Tseng1 
30 Jun 2003
TL;DR: In this article, the use of search fusion methods for querying multimedia databases and more specifically to a method and system for constructing a multi-modal query of a multimedia repository by forming multiple uni modal searches and explicitly selecting fusion methods.
Abstract: The present invention relates to the use of search fusion methods for querying multimedia databases and more specifically to a method and system for constructing a multi-modal query of a multimedia repository by forming multiple uni-modal searches and explicitly selecting fusion methods for combining their results. The present invention also relates to the integration of search methods for content-based retrieval, model-based retrieval, text-based retrieval, and metadata search, and the use of graphical user interfaces allowing the user to form queries fusing these search methods.

01 Jan 2003
TL;DR: The approach shows the usefulness of using formal information retrieval models for the task of image annotation and retrieval by assuming that regions in an image can be described using a small vocabulary of blobs.
Abstract: Libraries have traditionally used manual image annotation for indexing and then later retrieving their image collections. However, manual image annotation is an expensive and labor intensive procedure and hence there has been great interest in coming up with automatic ways to retrieve images based on content. Here, we propose an automatic approach to annotating and retrieving images based on a training set of images. We assume that regions in an image can be described using a small vocabulary of blobs. Blobs are generated from image features using clustering. Given a training set of images with annotations, we show that probabilistic models allow us to predict the probability of generating a word given the blobs in an image. This may be used to automatically annotate and retrieve images given a word as a query. We show that relevance models. allow us to derive these probabilities in a natural way. Experiments show that the annotation performance of this cross-media relevance model is almost six times as good (in terms of mean precision) than a model based on word-blob co-occurrence model and twice as good as a state of the art model derived from machine translation. Our approach shows the usefulness of using formal information retrieval models for the task of image annotation and retrieval.

Proceedings Article
01 Jan 2003
TL;DR: A query interface exploiting the intuitiveness of natural language for the largest Austrian web-based tourism platform Tiscover, which shows how users formulate queries when their imagination is not limited by conventional search interfaces with structured forms consisting of check boxes, radio buttons and special-purpose text fields.
Abstract: With the increasing amount of information available on the Internet one of the most challenging tasks is to provide search interfaces that are easy to use without having to learn a specific syntax. Hence, we present a query interface exploiting the intuitiveness of natural language for the largest Austrian web-based tourism platform Tiscover. Furthermore, we will describe the results and our insights from analyzing the natural language queries collected during a field trial in which the interface was promoted via the Tiscover homepage. This analysis shows how users formulate queries when their imagination is not limited by conventional search interfaces with structured forms consisting of check boxes, radio buttons and special-purpose text fields. The results of this field test are thus valuable indicators into which direction the web-based tourism information system should be extended to better serve the customers.


Proceedings ArticleDOI
13 Oct 2003
TL;DR: A re-ranking method to improve Web image retrieval by reordering the images retrieved from an image search engine based on a relevance model, which is a probabilistic model that evaluates the relevance of the HTML document linking to the image, and assigns a probability of relevance.
Abstract: Web image retrieval is a challenging task that requires efforts from image processing, link structure analysis, and Web text retrieval. Since content-based image retrieval is still considered very difficult, most current large-scale Web image search engines exploit text and link structure to "understand" the content of the Web images. However, local text information, such as caption, filenames and adjacent text, is not always reliable and informative. Therefore, global information should be taken into account when a Web image retrieval system makes relevance judgment. We propose a re-ranking method to improve Web image retrieval by reordering the images retrieved from an image search engine. The re-ranking process is based on a relevance model, which is a probabilistic model that evaluates the relevance of the HTML document linking to the image, and assigns a probability of relevance. The experiment results showed that the re-ranked image retrieval achieved better performance than original Web image retrieval, suggesting the effectiveness of the re-ranking method. The relevance model is learned from the Internet without preparing any training data and independent of the underlying algorithm of the image search engines. The re-ranking process should be applicable to any image search engines with little effort.

Book ChapterDOI
01 Jan 2003
TL;DR: A simple statistical model for capturing the notion of topical relevance in information retrieval, called a relevance model, is developed and extensive evaluations of the relevance model approach are described on the TREC ad-hoc retrieval and cross-language tasks.
Abstract: We develop a simple statistical model, called a relevance model, for capturing the notion of topical relevance in information retrieval. Estimating probabilities of relevance has been an important part of many previous retrieval models, but we show how this estimation can be done in a more principled way based on a generative or language model approach. In particular, we focus on estimating relevance models when training examples (examples of relevant documents) are not available. We describe extensive evaluations of the relevance model approach on the TREC ad-hoc retrieval and cross-language tasks. In both cases, rankings based on relevance models significantly outperform strong baseline approaches.

Book ChapterDOI
16 Feb 2003
TL;DR: The time is ripe for the two to meet: NLP has grown out of prototypes and IR is having hard time trying to improve precision, so two examples of possible approaches are considered.
Abstract: It seems the time is ripe for the two to meet: NLP has grown out of prototypes and IR is having hard time trying to improve precision. Two examples of possible approaches are considered below. Lexware is a lexicon-based system for text analysis of Swedish applied in an information retrieval task. NLIR is an information retrieval system using intensive natural language processing to provide index terms on a higher level of abstraction than stems.

Journal ArticleDOI
TL;DR: This work improves the current Web information retrieval approach by raising the efficiency of information retrieval, enhancing the preciseness and mobility of information services, and enabling intelligent information services.

01 Jan 2003
TL;DR: The results show that concept-based query enhancement in ARCH leads to significantly higher precision for ambiguous queries without sacrificing recall, as well as comparing enhanced and non-enhanced queries over a range of topics.
Abstract: The effectiveness of Internet search engines is often hampered by the ambiguity of user queries and the reluctance or inability of users to build less ambiguous multi-word queries. Our system, ARCH, is a client-side Web agent, which incorporates domain-specific concept hierarchies together with interactive query formulation in order to automatically produce a richer and therefore less ambiguous query. Unlike traditional relevance feedback methods, ARCH assists users in query modification prior to the search task. ARCH uses the domain knowledge inherent in Web-based classification hierarchies such as Yahoo, combined with a user’s profile information, to add just those terms likely to improve the match with the user’s intent. The goal of the system is therefore to meet the user’s information needs by closing the gap between the user’s stated query and the actual intent of the search. We present a detailed evaluation of the query enhancement in ARCH, comparing enhanced and non-enhanced queries over a range of topics. Our results show that concept-based query enhancement in ARCH leads to significantly higher precision for ambiguous queries without sacrificing recall.

Journal Article
TL;DR: The authors provide a comprehensive description of the specifi c problems arising in cross-language information retrieval, the solutions proposed in this area, as well as the remaining problems, and a look into the future that draws a strong parallel between query expansion in monolingual IR and query translation in CLIR.
Abstract: Search for information is no longer exclusively limited within the native language of the user, but is more and more extended to other languages. This gives rise to the problem of cross-language information retrieval (CLIR), whose goal is to find relevant information written in a different language to a query. In addition to the problems of monolingual information retrieval (IR), translation is the key problem in CLIR: one should translate either the query or the documents from a language to another. However, this translation problem is not identical to full-text machine translation (MT): the goal is not to produce a human-readable translation, but a translation suitable for finding relevant documents. Specific translation methods are thus required. The goal of this book is to provide a comprehensive description of the specifi c problems arising in CLIR, the solutions proposed in this area, as well as the remaining problems. The book starts with a general description of the monolingual IR and CLIR problems. Different classes of approaches to translation are then presented: approaches using an MT system, dictionary-based translation and approaches based on parallel and comparable corpora. In addition, the typical retrieval effectiveness using different approaches is compared. It will be shown that translation approaches specifically designed for CLIR can rival and outperform high-quality MT systems. Finally, the book offers a look into the future that draws a strong parallel between query expansion in monolingual IR and query translation in CLIR, suggesting that many approaches developed in monolingual IR can be adapted to CLIR. The book can be used as an introduction to CLIR. Advanced readers can also find more technical details and discussions about the remaining research challenges in the future. It is suitable to new researchers who intend to carry out research on CLIR.

BookDOI
01 Jan 2003
TL;DR: Intelligent Web Agents that Learn to Retrieve and Extract Information and a Neural Net Approach to Data Mining: Classification of Users to Aid Information Management are presented.
Abstract: Creation and Representation of Web Resources.- Structure Analysis and Generation for Internet Documents.- A Fuzzy System for the Web Page Representation.- Flexible Representation and Retrieval of Web Documents.- Information Retrieval.- Intelligent Information Retrieval on the Web.- Internet as a Challenge to Fuzzy Querying.- Internet Search Based on Text Intuitionistic Fuzzy Similarity.- Content-Based Fuzzy Search in a Multimedia Web Database.- Self-Organizing Maps for Interactive Search in Document Databases.- Methods for Exploratory Cluster Analysis.- Textual Information Retrieval with User Profiles Using Fuzzy Clustering and Inferencing.- Intelligent Clustering as Source of Knowledge for Web Dialogue Manager in an Information Retrieval System.- Document Clustering Using Tolerance Rough Set Model Its Application to Information Retrieval.- Improving Web Search by the Identification of Contextual Information.- Intelligent Internet-Based Multiagent Systems.- Neural Agent for Text Database Discovery.- Intelligent Web Agents that Learn to Retrieve and Extract Information.- Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach.- Web Browsing Using Machine Learning on Text Data.- Retrieval of Semistructured Web Data.- Intelligent Retrieval of Hypermedia Documents.- Bootstrapping an Ontology-Based Information Extraction System.- Web Data Mining and Use.- Intelligent Web Mining.- A Neural Net Approach to Data Mining: Classification of Users to Aid Information Management.- Web-Based Expert Systems: Information Clients versus Knowledge Servers.

Proceedings ArticleDOI
28 Jul 2003
TL;DR: Experiments using the TREC data show that incorporating user query history, as context information, consistently improves the retrieval performance in both average precision and precision at 20 documents.
Abstract: In this poster,we incorporate user query history, as context information, to improve the retrieval performance in interactive retrieval. Experiments using the TREC data show that incorporating such context information indeed consistently improves the retrieval performance in both average precision and precision at 20 documents.

Journal ArticleDOI
TL;DR: The challenges to expand information retrieval (IR) on the Web, in particular other types of data, Web mining and issues related to crawling are explored.

Journal ArticleDOI
01 Apr 2003
TL;DR: A query-categorization approach for categorizing Web query terms from the logs of on-line search services into a predefined subject taxonomy based on their supposed popular search interests is introduced.
Abstract: In this paper, we propose a query-categorization approach to facilitating the engineering process of constructing Web taxonomies. One primary step in taxonomy construction is to acquire the domain-specific terminology terms and the mapping between the subjects and these terms. We introduce a technique for categorizing Web query terms from the logs of on-line search services into a predefined subject taxonomy based on their supposed popular search interests. The obtained experimental results show our technique's effectiveness in reducing the workload of human indexers in constructing Web taxonomies and also show its usefulness in various Web information retrieval applications.

01 Jan 2003
TL;DR: Relevance of document titles to the processing task can be predicted with reasonable accuracy from only a few features, whereas prediction of relevance of specific words will require new features and methods.
Abstract: We investigate whether it is possible to infer from implicit feedback what is relevant for a user in an information retrieval task. Eye movement signals are measured; they are very noisy but potentially contain rich hints about the current state and focus of attention of the user. In the experimental setting relevance is controlled by giving the user a specific search task, and the modeling goal is to predict from eye movements which of the given titles are relevant. We extract a set of standard features from the signal, and explore the data with statistical information visualization methods including standard self-organizing maps (SOMs) and SOMs that learn metrics. Relevance of document titles to the processing task can be predicted with reasonable accuracy from only a few features, whereas prediction of relevance of specific words will require new features and methods.