scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Information Systems in 2002"


Journal ArticleDOI
TL;DR: This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position, and test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences.
Abstract: Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. In order to develop IR techniques in this direction, it is necessary to develop evaluation approaches and methods that credit IR methods for their ability to retrieve highly relevant documents. This can be done by extending traditional evaluation methods, that is, recall and precision based on binary relevance judgments, to graded relevance judgments. Alternatively, novel measures based on graded relevance judgments may be developed. This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. The first one accumulates the relevance scores of retrieved documents along the ranked result list. The second one is similar but applies a discount factor to the relevance scores in order to devaluate late-retrieved documents. The third one computes the relative-to-the-ideal performance of IR techniques, based on the cumulative gain they are able to yield. These novel measures are defined and discussed and their use is demonstrated in a case study using TREC data: sample system run results for 20 queries in TREC-7. As a relevance base we used novel graded relevance judgments on a four-point scale. The test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences. The graphs based on the measures also provide insight into the performance IR techniques and allow interpretation, for example, from the user point of view.

4,337 citations


Journal Article
TL;DR: A new conceptual paradigm for performing search in context is presented, that largely automates the search process, providing even non-professional users with highly relevant results.
Abstract: Keyword-based search engines are in widespread use today as a popular means for Web-based information retrieval Although such systems seem deceptively simple, a considerable amount of skill is required in order to satisfy non-trivial information needs This paper presents a new conceptual paradigm for performing search in context, that largely automates the search process, providing even non-professional users with highly relevant results This paradigm is implemented in practice in the IntelliZap system, where search is initiated from a text query marked by the user in a document she views, and is guided by the text surrounding the marked query in that document (“the context”) The context-driven information retrieval process involves semantic keyword extraction and clustering to automatically generate new, augmented queries The latter are submitted to a host of general and domain-specific search engines Search results are then semantically reranked, using context Experimental results testify that using context to guide search, effectively offers even inexperienced users an advanced search tool on the Web

1,615 citations


Journal ArticleDOI
TL;DR: A framework for deriving probabilistic models of Information Retrieval using term-weighting models obtained in the language model approach by measuring the divergence of the actual term distribution from that obtained under a random process is introduced.
Abstract: We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive term-weighting models by measuring the divergence of the actual term distribution from that obtained under a random process. Among the random processes we study the binomial distribution and Bose--Einstein statistics. We define two types of term frequency normalization for tuning term weights in the document--query matching process. The first normalization assumes that documents have the same length and measures the information gain with the observed term once it has been accepted as a good descriptor of the observed document. The second normalization is related to the document length and to other statistics. These two normalization methods are applied to the basic models in succession to obtain weighting formulae. Results show that our framework produces different nonparametric models forming baseline alternatives to the standard tf-idf model.

913 citations


Journal Article
TL;DR: A new query clustering method that makes use of user logs which allow us to identify the documents the users have selected for a query and shows that a combination of both keywords and user logs is better than using either method alone.
Abstract: Query clustering is a process used to discover frequently asked questions or most popular topics on a search engine. This process is crucial for search engines based on question-answering. Because of the short lengths of queries, approaches based on keywords are not suitable for query clustering. This paper describes a new query clustering method that makes use of user logs which allow us to identify the documents the users have selected for a query. The similarity between two queries may be deduced from the common documents the users selected for them. Our experiments show that a combination of both keywords and user logs is better than using either method alone.

486 citations


Journal ArticleDOI
TL;DR: A new algorithm for duplicate document detection that uses collection statistics and shows that the approach, called I-Match, scales in terms of the number of documents and works well for documents of all sizes.
Abstract: We present a new algorithm for duplicate document detection that uses collection statistics. We compare our approach with the state-of-the-art approach using multiple collections. These collections include a 30 MB 18,577 web document collection developed by Excite@Home and three NIST collections. The first NIST collection consists of 100 MB 18,232 LA-Times documents, which is roughly similar in the number of documents to the ExciteaHome collection. The other two collections are both 2 GB and are the 247,491-web document collection and the TREC disks 4 and 5---528,023 document collection. We show that our approach called I-Match, scales in terms of the number of documents and works well for documents of all sizes. We compared our solution to the state of the art and found that in addition to improved accuracy of detection, our approach executed in roughly one-fifth the time.

294 citations


Journal ArticleDOI
TL;DR: These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
Abstract: Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it uses about the same memory as a binary search tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or near-sorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.

189 citations


Journal Article
TL;DR: It is found that the combination of keywords and single-sentence summaries provides significant improvements in access times and number of required pen actions, as compared to other schemes.
Abstract: We present a design and implementation for displaying and manipulating HTML pages on small handheld devices such as personal digital assistants (PDAs), or cellular phones. We introduce methods for summarizing parts of Web pages and HTML forms. Each Web page is broken into text units that can each be hidden, partially displayed, made fully visible, or summarized. A variety of methods are introduced that summarize the text units. In addition, HTML forms are also summarized by displaying just the text labels that prompt the use for input. We tested the relative performance of the summarization methods by asking human subjects to accomplish single-page information search tasks. We found that the combination of keywords and single-sentence summaries provides significant improvements in access times and number of required pen actions, as compared to other schemes. Our experiments also show that our algorithms can identify the appropriate labels for forms in 95% of the cases, allowing effective form support for small screens.

170 citations


Journal ArticleDOI
TL;DR: The mapping between the two design levels is proposed to transform the XML semantic model into the XML Schema, based on which XML documents can be systematically created, managed, and validated.
Abstract: The eXtensible Markup Language (XML) is fast emerging as the dominant standard for describing and interchanging data among various systems and databases on the Internet. It offers the Document Type Definition (DTD) as a formalism for defining the syntax and structure of XML documents. The XML Schema definition language, as a replacement for the DTD, provides more rich facilities for defining and constraining the content of XML documents. However, it does not concentrate on the semantics that underlies these documents, representing a logical data model rather than a conceptual model. To enable efficient business application development in large-scale electronic commerce environments, it is necessary to describe and model real-world data semantics and their complex interrelationships. In this article, we describe a design methodology for XML documents. The aim is to enforce XML conceptual modeling power and bridge the gap between software development and XML document structures. The proposed methodology is comprised of two design levels: the semantic level and the schema level. The first level is based on a semantic network, which provides semantic modeling of XML through four major components: a set of atomic and complex nodes, representing real-world objects; a set of directed edges, representing semantic relationships between the objects; a set of labels denoting different types of semantic relationships, including aggregation, generalization, association, and of-property relationships; and finally a set of constraints defined over nodes and edges to constrain semantic relationships and object domains. The other level of the proposed methodology is concerned with detailed XML schema design, including element/attribute declarations and simple/complex type definitions. The mapping between the two design levels is proposed to transform the XML semantic model into the XML Schema, based on which XML documents can be systematically created, managed, and validated.

107 citations


Journal ArticleDOI
TL;DR: A keyblock-based approach to content-based image retrieval where each image is encoded as a set of one-dimensional index codes linked to the keyblocks in the codebook, analogous to considering a text document as a linear list of keywords.
Abstract: The success of text-based retrieval motivates us to investigate analogous techniques which can support the querying and browsing of image data. However, images differ significantly from text both syntactically and semantically in their mode of representing and expressing information. Thus, the generalization of information retrieval from the text domain to the image domain is non-trivial. This paper presents a framework for information retrieval in the image domain which supports content-based querying and browsing of images. A critical first step to establishing such a framework is to construct a codebook of "keywords" for images which is analogous to the dictionary for text documents. We refer to such "keywords" in the image domain as "keyblocks." In this paper, we first present various approaches to generating a codebook containing keyblocks at different resolutions. Then we present a keyblock-based approach to content-based image retrieval. In this approach, each image is encoded as a set of one-dimensional index codes linked to the keyblocks in the codebook, analogous to considering a text document as a linear list of keywords. Generalizing upon text-based information retrieval methods, we then offer various techniques for image-based information retrieval. By comparing the performance of this approach with conventional techniques using color and texture features, we demonstrate the effectiveness of the keyblock-based approach to content-based image retrieval.

97 citations


Journal ArticleDOI
TL;DR: In this article, the authors discuss how a network of autonomous archiving sites can trade data to achieve the most reliable replication in peer-to-peer data archiving systems by trading blocks of space (deeds).
Abstract: Data archiving systems rely on replication to preserve information. This paper discusses how a network of autonomous archiving sites can trade data to achieve the most reliable replication. A series of binary trades among sites produces a peer-to-peer archiving network. Two trading algorithms are examined, one based on trading collections (even if they are different sizes) and another based on trading equal sized blocks of space (which can then store collections). The concept of deeds is introduced; deeds track the blocks of space owned by one site at another. Policies for tuning these algorithms to provide the highest reliability, for example by changing the order in which sites are contacted and offered trades, are discussed. Finally, simulation results are presented that reveal which policies are best. The experiments indicate that a digital archive can achieve the best reliability by trading blocks of space (deeds), and that following certain policies will allow that site to maximize its reliability.

76 citations


Journal ArticleDOI
TL;DR: It is argued that the results of multiple functions can be merged, by analogy with ensembling classifiers, and present a simple combination technique based on the rank values of the suggested terms, which is effective not only with respect to unexpanded queries but also to any individual method, with notable improvements on the system's precision.
Abstract: In this article we consider methods for automatic query expansion from top retrieved documents (i.e., retrieval feedback) that make use of various functions for scoring expansion terms within Rocchio's classical reweighting scheme. An analytical comparison shows that the retrieval performance of methods based on distinct term-scoring functions is comparable on the whole query set but differs considerably on single queries, consistent with the fact that the ordered sets of expansion terms suggested for each query by the different functions are largely uncorrelated. Motivated by these findings, we argue that the results of multiple functions can be merged, by analogy with ensembling classifiers, and present a simple combination technique based on the rank values of the suggested terms. The combined retrieval feedback method is effective not only with respect to unexpanded queries but also to any individual method, with notable improvements on the system's precision. Furthermore, the combined method is robust with respect to variation of experimental parameters and it is beneficial even when the same information needs are expressed with shorter queries.

Journal ArticleDOI
TL;DR: This paper proposes a novel ranking scheme for popular topics that places the most authoritative pages on the query topic at the top of the ranking, and presents a prototype search engine that implements the scheme and discusses its performance.
Abstract: In response to a query, a search engine returns a ranked list of documents. If the query is about a popular topic (i.e., it matches many documents), then the returned list is usually too long to view fully. Studies show that users usually look at only the top 10 to 20 results. However, we can exploit the fact that the best targets for popular topics are usually linked to by enthusiasts in the same domain. In this paper, we propose a novel ranking scheme for popular topics that places the most authoritative pages on the query topic at the top of the ranking. Our algorithm operates on a special index of "expert documents." These are a subset of the pages on the WWW identified as directories of links to non-affiliated sources on specific topics. Results are ranked based on the match between the query and relevant descriptive text for hyperlinks on expert pages pointing to a given result page. We present a prototype search engine that implements our ranking scheme and discuss its performance. With a relatively small (2.5 million page) expert index, our algorithm was able to perform comparably on popular queries with the best of the mainstream search engines.


Journal ArticleDOI
TL;DR: This work proposes a semistatic phrase-based approach called xray that builds a model offline using sample training data extracted from a collection, and then compresses the entire collection online in a single pass, and shows that xray is effective on both text and general-purpose collections.
Abstract: Compression of large collections can lead to improvements in retrieval times by offsetting the CPU decompression costs with the cost of seeking and retrieving data from disk. We propose a semistatic phrase-based approach called xray that builds a model offline using sample training data extracted from a collection, and then compresses the entire collection online in a single pass. The particular benefits of xray are that it can be used in applications where individual records or documents must be decompressed, and that decompression is fast. The xray scheme also allows new data to be added to a collection without modifying the semistatic model. Moreover, xray can be used to compress general-purpose data such as genomic, scientific, image, and geographic collections without prior knowledge of the structure of the data. We show that xray is effective on both text and general-purpose collections. In general, xray is more effective than the popular gzip and compress schemes, while being marginally less effective than bzip2. We also show that xray is efficient: of the popular schemes we tested, it is typically only slower than gzip in decompression. Moreover, the query evaluation costs of retrieval of documents from a large collection with our search engine is improved by more than 30p when xray is incorporated compared to an uncompressed approach. We use simple techniques for obtaining the training data from the collection to be compressed and show that with just over 4p of data the entire collection can be effectively compressed. We also propose four schemes for phrase-match selection during the single pass compression of the collection. We conclude that with these novel approaches xray is a fast and effective scheme for compression and decompression of large general-purpose collections.

Journal ArticleDOI
TL;DR: This paper proposes a representation for topic-specific information based on "knowledge-agent bases" that comprise all the information necessary to access information about a topic and assist in the full search process from query formulation assistance to result scanning on the device itself.
Abstract: Mobile knowledge seekers often need access to information on the Web during a meeting or on the road, while away from their desktop. A common practice today is to use pervasive devices such as Personal Digital Assistants or mobile phones. However, these devices have inherent constraints (e.g., slow communication, form factor) which often make information discovery tasks impractical.In this paper, we present a new focused-search approach specifically oriented for the mode of work and the constraints dictated by pervasive devices. It combines focused search within specific topics with encapsulation of topic-specific information in a persistent repository. One key characteristic of these persistent repositories is that their footprint is small enough to fit on local devices, and yet they are rich enough to support many information discovery tasks in disconnected mode. More specifically, we suggest a representation for topic-specific information based on "knowledge-agent bases" that comprise all the information necessary to access information about a topic (under the form of key concepts and key Web pages) and assist in the full search process from query formulation assistance to result scanning on the device itself. The key contribution of our work is the coupling of focused search with encapsulated knowledge representation making information discovery from pervasive devices practical as well as efficient. We describe our model in detail and demonstrate its aspects through sample scenarios.

Journal ArticleDOI
TL;DR: A methodology for handling and resolving each type of imperfection in a concept-based natural language query system that is developed and automated as much as possible, with the user mainly serving an assistive function.
Abstract: Missing information, imprecision, inconsistency, vagueness, uncertainty, and ignorance abound in information systems. Such imperfection is a fact of life in database systems. Although these problems are widely studied in relational database systems, this is not the case in conceptual query systems. And yet, concept-based query languages have been proposed and some are already commercial products. It is therefore imperative to study these problems in concept-based query languages, with a view to prescribing formal approaches to dealing with the problems. In this article, we have done just that for a concept-based natural language query system that we developed. A methodology for handling and resolving each type of imperfection is developed. The proposed approaches are automated as much as possible, with the user mainly serving an assistive function.

Journal Article
TL;DR: In this paper, the authors describe PicASHOW, a fully automated WWW image retrieval system that is based on several link-structure analyzing algorithms, which can be implemented by standard WWW search engines with reasonable overhead, in terms of both computations and storage.
Abstract: We describe PicASHOW, a fully automated WWW image retrieval system that is based on several link-structure analyzing algorithms. Our basic premise is that a page p displays (or links to) an image when the author of p considers the image to be of value to the viewers of the page. We thus extend some well known link-based WWW page retrieval schemes to the context of image retrieval. PicASHOW’s analysis of the link structure enables it to retrieve relevant images even when those are stored in files with meaningless names. The same analysis also allows it to identify image containers and image hubs. We define these as Web pages that are rich in relevant images, or from which many images are readily accessible. PicASHOW requires no image analysis whatsoever and no creation of taxonomies for preclassification of the Web’s images. It can be implemented by standard WWW search engines with reasonable overhead, in terms of both computations and storage, and with no change to user query formats. It can thus be used to easily add image retrieving capabilities to standard search engines. Our results demonstrate that PicASHOW, while relying almost exclusively on link analysis, compares well with dedicated WWW image retrieval systems. We conclude that link analysis, a proven effective technique for Web page search, can improve the performance of Web image retrieval, as well as extend its definition to include the retrieval of image hubs and containers.