scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Information Systems in 2004"


Journal ArticleDOI
TL;DR: The key decisions in evaluating collaborative filtering recommender systems are reviewed: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole.
Abstract: Recommender systems have been evaluated in many, often incomparable, ways. In this article, we review the key decisions in evaluating collaborative filtering recommender systems: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole. In addition to reviewing the evaluation strategies used by prior researchers, we present empirical results from the analysis of various accuracy metrics on one content domain where all the tested metrics collapsed roughly into three equivalence classes. Metrics within each equivalency class were strongly correlated, while metrics from different equivalency classes were uncorrelated.

5,686 citations


Journal ArticleDOI
TL;DR: This article presents one class of model-based recommendation algorithms that first determines the similarities between the various items and then uses them to identify the set of items to be recommended, and shows that these item-based algorithms are up to two orders of magnitude faster than the traditional user-neighborhood based recommender systems and provide recommendations with comparable or better quality.
Abstract: The explosive growth of the world-wide-web and the emergence of e-commerce has led to the development of recommender systems---a personalized information filtering technology used to identify a set of items that will be of interest to a certain user. User-based collaborative filtering is the most successful technology for building recommender systems to date and is extensively used in many commercial recommender systems. Unfortunately, the computational complexity of these methods grows linearly with the number of customers, which in typical commercial applications can be several millions. To address these scalability concerns model-based recommendation techniques have been developed. These techniques analyze the user--item matrix to discover relations between the different items and use these relations to compute the list of recommendations.In this article, we present one such class of model-based recommendation algorithms that first determines the similarities between the various items and then uses them to identify the set of items to be recommended. The key steps in this class of algorithms are (i) the method used to compute the similarity between the items, and (ii) the method used to combine these similarities in order to compute the similarity between a basket of items and a candidate recommender item. Our experimental evaluation on eight real datasets shows that these item-based algorithms are up to two orders of magnitude faster than the traditional user-neighborhood based recommender systems and provide recommendations with comparable or better quality.

2,265 citations


Journal ArticleDOI
Thomas Hofmann1
TL;DR: A new family of model-based algorithms designed for collaborative filtering rely on a statistical modelling technique that introduces latent class variables in a mixture model setting to discover user communities and prototypical interest profiles.
Abstract: Collaborative filtering aims at learning predictive models of user preferences, interests or behavior from community data, that is, a database of available user preferences. In this article, we describe a new family of model-based algorithms designed for this task. These algorithms rely on a statistical modelling technique that introduces latent class variables in a mixture model setting to discover user communities and prototypical interest profiles. We investigate several variations to deal with discrete and continuous response variables as well as with different objective functions. The main advantages of this technique over standard memory-based methods are higher accuracy, constant time prediction, and an explicit and compact model representation. The latter can also be used to mine for user communitites. The experimental evaluation shows that substantial improvements in accucracy over existing methods and published results can be obtained.

1,497 citations


Journal ArticleDOI
TL;DR: Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to or better than the best results achieved using a single smoothing methods and exhaustive parameter search on the test data.
Abstract: Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and to then rank documents by the likelihood of the query according to the estimated language model. A central issue in language model estimation is smoothing, the problem of adjusting the maximum likelihood estimator to compensate for data sparseness. In this article, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collections. Experimental results show that not only is the retrieval performance generally sensitive to the smoothing parameters, but also the sensitivity pattern is affected by the query type, with performance being more sensitive to smoothing for verbose queries than for keyword queries. Verbose queries also generally require more aggressive smoothing to achieve optimal performance. This suggests that smoothing plays two different role---to make the estimated document language model more accurate and to "explain" the noninformative words in the query. In order to decouple these two distinct roles of smoothing, we propose a two-stage smoothing strategy, which yields better sensitivity patterns and facilitates the setting of smoothing parameters automatically. We further propose methods for estimating the smoothing parameters automatically. Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to---or better than---the best results achieved using a single smoothing method and exhaustive parameter search on the test data.

1,334 citations


Journal ArticleDOI
TL;DR: Ontological inference is shown to improve user profiling, external ontological knowledge used to successfully bootstrap a recommender system and profile visualization employed to improve profiling accuracy are shown.
Abstract: We explore a novel ontological approach to user profiling within recommender systems, working on the problem of recommending on-line academic research papers. Our two experimental systems, Quickstep and Foxtrot, create user profiles from unobtrusively monitored behaviour and relevance feedback, representing the profiles in terms of a research paper topic ontology. A novel profile visualization approach is taken to acquire profile feedback. Research papers are classified using ontological classes and collaborative recommendation algorithms used to recommend papers seen by similar people on their current topics of interest. Two small-scale experiments, with 24 subjects over 3 months, and a large-scale experiment, with 260 subjects over an academic year, are conducted to evaluate different aspects of our approach. Ontological inference is shown to improve user profiling, external ontological knowledge used to successfully bootstrap a recommender system and profile visualization employed to improve profiling accuracy. The overall performance of our ontological recommender systems are also presented and favourably compared to other systems in the literature.

785 citations


Journal ArticleDOI
TL;DR: This article proposes to deal with the sparsity problem by applying an associative retrieval framework and related spreading activation algorithms to explore transitive associations among consumers through their past transactions and feedback to solve the problem of sparse transactional data.
Abstract: Recommender systems are being widely applied in many application settings to suggest products, services, and information items to potential consumers. Collaborative filtering, the most successful recommendation approach, makes recommendations based on past transactions and feedback from consumers sharing similar interests. A major problem limiting the usefulness of collaborative filtering is the sparsity problem, which refers to a situation in which transactional or feedback data is sparse and insufficient to identify similarities in consumer interests. In this article, we propose to deal with this sparsity problem by applying an associative retrieval framework and related spreading activation algorithms to explore transitive associations among consumers through their past transactions and feedback. Such transitive associations are a valuable source of information to help infer consumer interests and can be explored to deal with the sparsity problem. To evaluate the effectiveness of our approach, we have conducted an experimental study using a data set from an online bookstore. We experimented with three spreading activation algorithms including a constrained Leaky Capacitor algorithm, a branch-and-bound serial symbolic search algorithm, and a Hopfield net parallel relaxation search algorithm. These algorithms were compared with several collaborative filtering approaches that do not consider the transitive associations: a simple graph search approach, two variations of the user-based approach, and an item-based approach. Our experimental results indicate that spreading activation-based approaches significantly outperformed the other collaborative filtering methods as measured by recommendation precision, recall, the F-measure, and the rank score. We also observed the over-activation effect of the spreading activation approach, that is, incorporating transitive associations with past transactional data that is not sparse may "dilute" the data used to infer user preferences and lead to degradation in recommendation performance.

678 citations


Journal ArticleDOI
TL;DR: This paper presents an efficient method for mining both positive and negative association rules in databases, and extends traditional associations to include association rules of forms A ⇒ ¬ , which indicate negative associations between itemsets.
Abstract: This paper presents an efficient method for mining both positive and negative association rules in databases. The method extends traditional associations to include association rules of forms A ⇒ ¬ B, ¬ A ⇒ B, and ¬ A ⇒ ¬ B, which indicate negative associations between itemsets. With a pruning strategy and an interestingness measure, our method scales to large databases. The method has been evaluated using both synthetic and real-world databases, and our experimental results demonstrate its effectiveness and efficiency.

470 citations


Journal ArticleDOI
TL;DR: The new PocketLens collaborative filtering algorithm along with five peer-to-peer architectures for finding neighbors are presented and evaluated in a series of offline experiments, showing that Pocketlens can run on connected servers, on usually connected workstations, or on occasionally connected portable devices, and produce recommendations that are as good as the best published algorithms to date.
Abstract: Recommender systems using collaborative filtering are a popular technique for reducing information overload and finding products to purchase. One limitation of current recommenders is that they are not portable. They can only run on large computers connected to the Internet. A second limitation is that they require the user to trust the owner of the recommender with personal preference data. Personal recommenders hold the promise of delivering high quality recommendations on palmtop computers, even when disconnected from the Internet. Further, they can protect the user's privacy by storing personal information locally, or by sharing it in encrypted form. In this article we present the new PocketLens collaborative filtering algorithm along with five peer-to-peer architectures for finding neighbors. We evaluate the architectures and algorithms in a series of offline experiments. These experiments show that Pocketlens can run on connected servers, on usually connected workstations, or on occasionally connected portable devices, and produce recommendations that are as good as the best published algorithms to date.

370 citations


Journal ArticleDOI
TL;DR: The fundamental abstractions of Streams, Structures, Spaces, Scenarios, and Societies (5S), which allow us to define digital libraries rigorously and usefully, are proposed.
Abstract: Digital libraries (DLs) are complex information systems and therefore demand formal foundations lest development efforts diverge and interoperability suffers. In this article, we propose the fundamental abstractions of Streams, Structures, Spaces, Scenarios, and Societies (5S), which allow us to define digital libraries rigorously and usefully. Streams are sequences of arbitrary items used to describe both static and dynamic (e.g., video) content. Structures can be viewed as labeled directed graphs, which impose organization. Spaces are sets with operations on those sets that obey certain constraints. Scenarios consist of sequences of events or actions that modify states of a computation in order to accomplish a functional requirement. Societies are sets of entities and activities and the relationships among them. Together these abstractions provide a formal foundation to define, relate, and unify concepts---among others, of digital objects, metadata, collections, and services---required to formalize and elucidate "digital libraries". The applicability, versatility, and unifying power of the 5S model are demonstrated through its use in three distinct applications: building and interpretation of a DL taxonomy, informal and formal analysis of case studies of digital libraries (NDLTD and OAI), and utilization as a formal basis for a DL description language.

328 citations


Journal ArticleDOI
TL;DR: A comprehensive framework for managing various semantic conflicts is proposed that provides a unified view of the underlying representational and reasoning formalism for the semantic mediation process and suggests that correct identification and construction of both schema and ontology-schema mapping knowledge play very important roles in achieving interoperability at both the data and schema levels.
Abstract: Interoperability is the most critical issue facing businesses that need to access information from multiple information systems. Our objective in this research is to develop a comprehensive framework and methodology to facilitate semantic interoperability among distributed and heterogeneous information systems. A comprehensive framework for managing various semantic conflicts is proposed. Our proposed framework provides a unified view of the underlying representational and reasoning formalism for the semantic mediation process. This framework is then used as a basis for automating the detection and resolution of semantic conflicts among heterogeneous information sources. We define several types of semantic mediators to achieve semantic interoperability. A domain-independent ontology is used to capture various semantic conflicts. A mediation-based query processing technique is developed to provide uniform and integrated access to the multiple heterogeneous databases. A usable prototype is implemented as a proof-of-concept for this work. Finally, the usefulness of our approach is evaluated using three cases in different application domains. Various heterogeneous datasets are used during the evaluation phase. The results of the evaluation suggest that correct identification and construction of both schema and ontology-schema mapping knowledge play very important roles in achieving interoperability at both the data and schema levels.

270 citations


Journal ArticleDOI
TL;DR: Substantial commercial interest focused attention on a variety of practical questions, including the speed with which recommendations could be generated, the scale of problems that could be addressed, and the assessment of the value of recommendations to the business itself or to the customers.
Abstract: Recommender systems use the opinions of members of a community to help individuals in that community identify the information or products most likely to be interesting to them or relevant to their needs. These systems, originally referred to as collaborative filtering systems, were developed to address two challenges that could not be addressed by existing keyword-based information filtering systems. First, they addressed the problem of overwhelming numbers of on-topic documents—ones which would all be selected by a keyword filter— by filtering based on human judgement about the quality of those documents. Second, they addressed the problem of filtering non-text documents based on human taste. For example, the Ringo system [Shardanand and Maes, 1995] applied collaborative filtering to recommend music to individuals and later research and commercial systems applied the same techniques to other art forms. Early research in this area focused largely on the ability of these systems to generate recommendations that were valued by the users of the system. And, indeed, these systems generated substantial enthusiasm and support from their users. In 1996, at the first of a series of workshops on collaborative filtering, it first became clear that some fairly simple algorithms (namely weighted knearest-neighbor algorithms applied to a sparse matrix of the ratings that users assigned to particular items or documents) worked well for several different research groups and application areas. This workshop also started using the term “Recommender Systems” and led to the publication of a special issue of Communications of the ACM on the topic (March 1997). At this point, the Recommender Systems research field diverged. Substantial commercial interest focused attention on a variety of practical questions, including the speed with which recommendations could be generated, the scale of problems that could be addressed, and the assessment of the value of recommendations to the business itself or to the customers. At the same time, a broad range of machine learning researchers (broadly defined) started applying a wide variety of techniques to recommendation problems, exploring issues of improving accuracy of algorithms, better exploiting knowledge about the

Journal ArticleDOI
TL;DR: Combined use of a partial nextword, partial phrase, and conventional inverted index allows evaluation of phrase queries in a quarter the time required to evaluate such queries with an inverted file alone; the additional space overhead is only 26% of the size of the inverted file.
Abstract: Search engines need to evaluate queries extremely fast, a challenging task given the quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this article we consider how phrase queries can be efficiently supported with low disk overheads. Our previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. Alternatively, special-purpose phrase indexes can be used, but it is not feasible to index all phrases. We propose combinations of nextword indexes and phrase indexes with inverted files as a solution to this problem. Our experiments show that combined use of a partial nextword, partial phrase, and conventional inverted index allows evaluation of phrase queries in a quarter the time required to evaluate such queries with an inverted file alone; the additional space overhead is only 26p of the size of the inverted file.

Journal ArticleDOI
TL;DR: XIRQL ("circle") is an XML query language that incorporates imprecision and vagueness for both structural and content-oriented query conditions, and is processed by the HyREX retrieval engine.
Abstract: XIRQL ("circle") is an XML query language that incorporates imprecision and vagueness for both structural and content-oriented query conditions. The corresponding uncertainty is handled by a consistent probabilistic model. The core features of XIRQL are (1) document ranking based on index term weighting, (2) specificity-oriented search for retrieving the most relevant parts of documents, (3) datatypes with vague predicates for dealing with specific types of content and (4) structural vagueness for vague interpretation of structural query conditions. A XIRQL database may contain several classes of documents, where all documents in a class conform to the same DTD; links between documents also are supported. XIRQL queries are translated into a path algebra, which can be processed by our HyREX retrieval engine.

Journal ArticleDOI
TL;DR: The proposed approach is effective in extracting translations of unknown queries, is easy to combine with the probabilistic retrieval model to improve the cross-language retrieval performance, and is very useful when the considered language pairs lack a sufficient number of anchor texts.
Abstract: To discover translation knowledge in diverse data resources on the Web, this article proposes an effective approach to finding translation equivalents of query terms and constructing multilingual lexicons through the mining of Web anchor texts and link structures. Although Web anchor texts are wide-scoped hypertext resources, not every particular pair of languages contains sufficient anchor texts for effective extraction of translations for Web queries. For more generalized applications, the approach is designed based on a transitive translation model. The translation equivalents of a query term can be extracted via its translation in an intermediate language. To reduce interference from translation errors, the approach further integrates a competitive linking algorithm into the process of determining the most probable translation. A series of experiments has been conducted, including performance tests on term translation extraction, cross-language information retrieval, and translation suggestions for practical Web search services, respectively. The obtained experimental results have shown that the proposed approach is effective in extracting translations of unknown queries, is easy to combine with the probabilistic retrieval model to improve the cross-language retrieval performance, and is very useful when the considered language pairs lack a sufficient number of anchor texts. Based on the approach, an experimental system called LiveTrans has been developed for English--Chinese cross-language Web search.

Journal ArticleDOI
TL;DR: The design and implementation of DISCOVIR: DIStributed COntent-based Visual Information Retrieval system using the Peer-to-Peer (P2P) Network is presented and a Firework Query Model for distributed information retrieval is proposed.
Abstract: With the recent advances of distributed computing, the limitation of information retrieval from a centralized image collection can be removed by allowing distributed image data sources to interact with each other for data storage sharing and information retrieval. In this article, we present our design and implementation of DISCOVIR: DIStributed COntent-based Visual Information Retrieval system using the Peer-to-Peer (P2P) Network. We describe the system architecture and detail the interactions among various system modules. Specifically, we propose a Firework Query Model for distributed information retrieval, which aims to reduce the network traffic of query passing in the network. We carry out experiments to show the distributed image retrieval system and the Firework information retrieval algorithm. The results show that the algorithm reduces network traffic while increases searching performance.

Journal ArticleDOI
TL;DR: This article proposes in addition to the classification capacity of clustering techniques, the possibility of offering a indicative extract about the contents of several sources by means of multidocument summarization techniques.
Abstract: A more and more generalized problem in effective information access is the presence in the same corpus of multiple documents that contain similar information. Generally, users may be interested in locating, for a topic addressed by a group of similar documents, one or several particular aspects. This kind of task, called instance or aspectual retrieval, has been explored in several TREC Interactive Tracks. In this article, we propose in addition to the classification capacity of clustering techniques, the possibility of offering a indicative extract about the contents of several sources by means of multidocument summarization techniques. Two kinds of summaries are provided. The first one covers the similarities of each cluster of documents retrieved. The second one shows the particularities of each document with respect to the common topic in the cluster. The document multitopic structure has been used in order to determine similarities and differences of topics in the cluster of documents. The system is independent of document domain and genre. An evaluation of the proposed system with users proves significant improvements in effectiveness. The results of previous experiments that have compared clustering algorithms are also reported.

Journal ArticleDOI
TL;DR: An architecture and design is proposed that accomplish encapsulation of digital object content with metadata describing its origins, cryptographic sealing, webs of trust for public keys rooted in a forest of respected institutions, and a certain way of managing information identifiers that will satisfy emerging needs in civilian and military record management.
Abstract: In ancient times, wax seals impressed with signet rings were affixed to documents as evidence of their authenticity. A digital counterpart is a message authentication code fixed firmly to each important document. If a digital object is sealed together with its own audit trail, each user can examine this evidence to decide whether to trust the content---no matter how distant this user is in time, space, and social affiliation from the document's source.We propose an architecture and design that accomplish this: encapsulation of digital object content with metadata describing its origins, cryptographic sealing, webs of trust for public keys rooted in a forest of respected institutions, and a certain way of managing information identifiers. These means will satisfy emerging needs in civilian and military record management, including medical patient records, regulatory records for aircraft and pharmaceuticals, business records for financial audit, legislative and legal briefs, and scholarly works.This is true for any kind of digital object, independent of its purposes and of most data type and representation details, and provides every kind of user---information authors and editors, librarians and collection managers, and information consumers---with autonomy for implied tasks. Our prototype will conform to applicable standards, will be interoperable over most computing bases, and will be compatible with existing digital library software.The proposed architecture integrates software that is mostly available and widely accepted.

Journal ArticleDOI
TL;DR: A dynamic LS generator called Test & Select (TS) is proposed to mitigate LS conflict, which outperforms all eight static methods in terms of both extracting the desired document and finding relevant information, over three different search engines.
Abstract: A lexical signature (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency- (TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator called Test & Select (TS) to mitigate LS conflict. TS outperforms all eight static methods in terms of both extracting the desired document and finding relevant information, over three different search engines. All LS methods show significant performance degradation as documents in the corpus are edited.

Journal ArticleDOI
TL;DR: This work presents a new approach for adaptive presentation of structured information, based on preference-based constrained optimization techniques rooted in qualitative decision-theory, and implemented prototype systems for Web pages and for general media-rich document presentation.
Abstract: We present a new approach for adaptive presentation of structured information, based on preference-based constrained optimization techniques rooted in qualitative decision-theory. In this approach, document presentation is viewed as a configuration problem whose goal is to determine the optimal presentation of a document, while taking into account the preferences of the content provider, viewer interaction with the browser, and, possibly, some layout constraints. The preferences of the content provider are represented by a CP-net, a graphical, qualitative preference model developed in Boutilier et al. [1999]. The layout constraints are represented as geometric constraints, integrated within the optimization process. We discuss the theoretical basis of our approach, as well as implemented prototype systems for Web pages and for general media-rich document presentation.

Journal ArticleDOI
TL;DR: The key new idea of this paper is to model that a relevance judgment is also generated stochastically, and that its data generating function is also governed by those same document and query parameters.
Abstract: A central idea of Language Models is that documents (and perhaps queries) are random variables, generated by data-generating functions that are characterized by document (query) parameters. The key new idea of this paper is to model that a relevance judgment is also generated stochastically, and that its data generating function is also governed by those same document and query parameters. The result of this addition is that any available relevance judgments are easily incorporated as additional evidence about the true document and query model parameters. An additional aspect of this approach is that it also resolves the long-standing problem of document-oriented versus query-oriented probabilities. The general approach can be used with a wide variety of hypothesized distributions for documents, queries, and relevance. We test the approach on Reuters Corpus Volume 1, using one set of possible distributions. Experimental results show that the approach does succeed in incorporating relevance data to improve estimates of both document and query parameters, but on this data and for the specific distributions we hypothesized, performance was no better than two separate one-sided models. We conclude that the model's theoretical contribution is its integration of relevance models, document models, and query models, and that the potential for additional performance improvement over one-sided methods requires refinements.