scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Information Systems in 2011"


Journal ArticleDOI
TL;DR: This article introduces a new concept-based retrieval approach based on Explicit Semantic Analysis (ESA), a recently proposed method that augments keyword-based text representation with concept- based features, automatically extracted from massive human knowledge repositories such as Wikipedia.
Abstract: Information retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keyword-based retrieval may return inaccurate and incomplete results when different keywords are used to describe the same concept in the documents and in the queries. Furthermore, the relationship between these related keywords may be semantic rather than syntactic, and capturing it thus requires access to comprehensive human world knowledge. Concept-based retrieval methods have attempted to tackle these difficulties by using manually built thesauri, by relying on term cooccurrence data, or by extracting latent word relationships and concepts from a corpus. In this article we introduce a new concept-based retrieval approach based on Explicit Semantic Analysis (ESA), a recently proposed method that augments keyword-based text representation with concept-based features, automatically extracted from massive human knowledge repositories such as Wikipedia. Our approach generates new text features automatically, and we have found that high-quality feature selection becomes crucial in this setting to make the retrieval more focused. However, due to the lack of labeled data, traditional feature selection methods cannot be used, hence we propose new methods that use self-generated labeled training data. The resulting system is evaluated on several TREC datasets, showing superior performance over previous state-of-the-art results.

294 citations


Journal ArticleDOI
TL;DR: This article proposes a factor analysis approach based on probabilistic matrix factorization to alleviate the data sparsity and poor prediction accuracy problems by incorporating social contextual information, such as social networks and social tags.
Abstract: Due to their potential commercial value and the associated great research challenges, recommender systems have been extensively studied by both academia and industry recently. However, the data sparsity problem of the involved user-item matrix seriously affects the recommendation quality. Many existing approaches to recommender systems cannot easily deal with users who have made very few ratings. In view of the exponential growth of information generated by online users, social contextual information analysis is becoming important for many Web applications. In this article, we propose a factor analysis approach based on probabilistic matrix factorization to alleviate the data sparsity and poor prediction accuracy problems by incorporating social contextual information, such as social networks and social tags. The complexity analysis indicates that our approach can be applied to very large datasets since it scales linearly with the number of observations. Moreover, the experimental results show that our method performs much better than the state-of-the-art approaches, especially in the circumstance that users have made few ratings.

222 citations


Journal ArticleDOI
TL;DR: This work focuses on course recommendations: the courses taken by a student must satisfy requirements in order for the student to graduate, and develops increasingly expressive models for course requirements, and presents a variety of schemes for both checking if the requirements are satisfied, and for making recommendations that take into account the requirements.
Abstract: We study the problem of making recommendations when the objects to be recommended must also satisfy constraints or requirements. In particular, we focus on course recommendations: the courses taken by a student must satisfy requirements (e.g., take two out of a set of five math courses) in order for the student to graduate. Our work is done in the context of the CourseRank system, used by students to plan their academic program at Stanford University. Our goal is to recommend to these students courses that not only help satisfy constraints, but that are also desirable (e.g., popular or taken by similar students). We develop increasingly expressive models for course requirements, and present a variety of schemes for both checking if the requirements are satisfied, and for making recommendations that take into account the requirements. We show that some types of requirements are inherently expensive to check, and we present exact, as well as heuristic techniques, for those cases. Although our work is specific to course requirements, it provides insights into the design of recommendation systems in the presence of complex constraints found in other applications.

146 citations


Journal ArticleDOI
TL;DR: VideoReach represents one of the first attempts toward contextual recommendation driven by video content and user click-through, without assuming a sufficient collection of user profiles available, and is designed to seamlessly integrate both multimodal relevance and user feedback by relevance feedback and attention fusion.
Abstract: With Internet delivery of video content surging to an unprecedented level, video recommendation, which suggests relevant videos to targeted users according to their historical and current viewings or preferences, has become one of most pervasive online video services. This article presents a novel contextual video recommendation system, called VideoReach, based on multimodal content relevance and user feedback. We consider an online video usually consists of different modalities (i.e., visual and audio track, as well as associated texts such as query, keywords, and surrounding text). Therefore, the recommended videos should be relevant to current viewing in terms of multimodal relevance. We also consider that different parts of videos are with different degrees of interest to a user, as well as different features and modalities have different contributions to the overall relevance. As a result, the recommended videos should also be relevant to current users in terms of user feedback (i.e., user click-through). We then design a unified framework for VideoReach which can seamlessly integrate both multimodal relevance and user feedback by relevance feedback and attention fusion. VideoReach represents one of the first attempts toward contextual recommendation driven by video content and user click-through, without assuming a sufficient collection of user profiles available. We conducted experiments over a large-scale real-world video data and reported the effectiveness of VideoReach.

107 citations


Journal ArticleDOI
TL;DR: A general methodology to analytically and experimentally diagnose the weaknesses of a retrieval function, which provides guidance on how to further improve its performance and proposes two strategies to check how well a retrieved function implements the desired retrieval heuristics.
Abstract: Developing effective retrieval models is a long-standing central challenge in information retrieval research. In order to develop more effective models, it is necessary to understand the deficiencies of the current retrieval models and the relative strengths of each of them. In this article, we propose a general methodology to analytically and experimentally diagnose the weaknesses of a retrieval function, which provides guidance on how to further improve its performance. Our methodology is motivated by the empirical observation that good retrieval performance is closely related to the use of various retrieval heuristics. We connect the weaknesses and strengths of a retrieval function with its implementations of these retrieval heuristics, and propose two strategies to check how well a retrieval function implements the desired retrieval heuristics. The first strategy is to formalize heuristics as constraints, and use constraint analysis to analytically check the implementation of retrieval heuristics. The second strategy is to define a set of relevance-preserving perturbations and perform diagnostic tests to empirically evaluate how well a retrieval function implements retrieval heuristics. Experiments show that both strategies are effective to identify the potential problems in implementations of the retrieval heuristics. The performance of retrieval functions can be improved after we fix these problems.

100 citations


Journal ArticleDOI
TL;DR: This work proposes a general probabilistic framework for entity search to evaluate and provide insights in the many ways of using these types of input for query modeling, and focuses on the use of category information and shows the advantage of a category-based representation over a term- based representation.
Abstract: Users often search for entities instead of documents, and in this setting, are willing to provide extra input, in addition to a series of query terms, such as category information and example entities. We propose a general probabilistic framework for entity search to evaluate and provide insights in the many ways of using these types of input for query modeling. We focus on the use of category information and show the advantage of a category-based representation over a term-based representation, and also demonstrate the effectiveness of category-based expansion using example entities. Our best performing model shows very competitive performance on the INEX-XER entity ranking and list completion tasks.

85 citations


Journal ArticleDOI
TL;DR: Significant performance improvement over plain word-based retrieval, three other language-independent morphological normalizers, as well as rule-based stemmers is demonstrated.
Abstract: A novel graph-based language-independent stemming algorithm suitable for information retrieval is proposed in this article. The main features of the algorithm are retrieval effectiveness, generality, and computational efficiency. We test our approach on seven languages (using collections from the TREC, CLEF, and FIRE evaluation platforms) of varying morphological complexity. Significant performance improvement over plain word-based retrieval, three other language-independent morphological normalizers, as well as rule-based stemmers is demonstrated.

53 citations


Journal ArticleDOI
TL;DR: The results of the benchmark experiments confirm that the proposed semantic granularity based IR model performs significantly better than the similarity-based baseline in both a bio-medical and an agricultural domain and that the perceived relevance of the documents delivered by the granularity-based IR system is significantly higher than that produced by a popular search engine for a number of domain-specific search tasks.
Abstract: Both similarity-based and popularity-based document ranking functions have been successfully applied to information retrieval (IR) in general. However, the dimension of semantic granularity also should be considered for effective retrieval. In this article, we propose a semantic granularity-based IR model that takes into account the three dimensions, namely similarity, popularity, and semantic granularity, to improve domain-specific search. In particular, a concept-based computational model is developed to estimate the semantic granularity of documents with reference to a domain ontology. Semantic granularity refers to the levels of semantic detail carried by an information item. The results of our benchmark experiments confirm that the proposed semantic granularity based IR model performs significantly better than the similarity-based baseline in both a bio-medical and an agricultural domain. In addition, a series of user-oriented studies reveal that the proposed document ranking functions resemble the implicit ranking functions exercised by humans. The perceived relevance of the documents delivered by the granularity-based IR system is significantly higher than that produced by a popular search engine for a number of domain-specific search tasks. To the best of our knowledge, this is the first study regarding the application of semantic granularity to enhance domain-specific IR.

53 citations


Journal ArticleDOI
TL;DR: An in-depth study of duplication and content overlap in YouTube is presented, and various dependencies between content overlap and meta data such as video titles, views, video ratings, and tags are analyzed.
Abstract: The emergence of large-scale social Web communities has enabled users to share online vast amounts of multimedia content. An analysis of YouTube reveals a high amount of redundancy, in the form of videos with overlapping or duplicated content. We use robust content-based video analysis techniques to detect overlapping sequences between videos. Based on the output of these techniques, we present an in-depth study of duplication and content overlap in YouTube, and analyze various dependencies between content overlap and meta data such as video titles, views, video ratings, and tags. As an application, we show that content-based links provide useful information for generating new tag assignments. We propose different tag propagation methods for automatically obtaining richer video annotations. Experiments on video clustering and classification as well as a user evaluation demonstrate the viability of our approach.

42 citations


Journal ArticleDOI
TL;DR: It is proved that the proposed upper-bound approximation does not impact the retrieval effectiveness of several modern weighting models from various different families, and the applicability of the approximation for the Markov Random Field proximity model is shown.
Abstract: Dynamic pruning strategies for information retrieval systems can increase querying efficiency without decreasing effectiveness by using upper bounds to safely omit scoring documents that are unlikely to make the final retrieved set. Often, such upper bounds are pre-calculated at indexing time for a given weighting model. However, this precludes changing, adapting or training the weighting model without recalculating the upper bounds. Instead, upper bounds should be approximated at querying time from various statistics of each term to allow on-the-fly adaptation of the applied retrieval strategy. This article, by using uniform notation, formulates the problem of determining a term upper-bound given a weighting model and discusses the limitations of existing approximations. Moreover, we propose an upper-bound approximation using a constrained nonlinear maximization problem. We prove that our proposed upper-bound approximation does not impact the retrieval effectiveness of several modern weighting models from various different families. We also show the applicability of the approximation for the Markov Random Field proximity model. Finally, we empirically examine how the accuracy of the upper-bound approximation impacts the number of postings scored and the resulting efficiency in the context of several large Web test collections.

40 citations


Journal ArticleDOI
TL;DR: By exhaustively analyzing the potential of text-based features derived from artist-related Web pages, this article constitutes an important contribution to context-based music information research.
Abstract: This article comprehensively addresses the problem of similarity measurement between music artists via text-based features extracted from Web pages. To this end, we present a thorough evaluation of different term-weighting strategies, normalization methods, aggregation functions, and similarity measurement techniques. In large-scale genre classification experiments carried out on real-world artist collections, we analyze several thousand combinations of settings/parameters that influence the similarity calculation process, and investigate in which way they impact the quality of the similarity estimates. Accurate similarity measures for music are vital for many applications, such as automated playlist generation, music recommender systems, music information systems, or intelligent user interfaces to access music collections by means beyond text-based browsing. Therefore, by exhaustively analyzing the potential of text-based features derived from artist-related Web pages, this article constitutes an important contribution to context-based music information research.

Journal ArticleDOI
TL;DR: The intuition is that near-duplicate videos should preserve strong information correlation in spite of intensive content changes, and more effective retrieval with stronger tolerance is achieved by replacing video-content similarity measures with information correlation analysis.
Abstract: The unprecedented and ever-growing number of Web videos nowadays leads to the massive existence of near-duplicate videos. Very often, some near-duplicate videos exhibit great content changes, while the user perceives little information change, for example, color features change significantly when transforming a color video with a blue filter. These feature changes contribute to low-level video similarity computations, making conventional similarity-based near-duplicate video retrieval techniques incapable of accurately capturing the implicit relationship between two near-duplicate videos with fairly large content modifications. In this paper, we introduce a new dimension for near-duplicate video retrieval. Different from existing near-duplicate video retrieval approaches which are based on video-content similarity, we explore the correlation between two videos. The intuition is that near-duplicate videos should preserve strong information correlation in spite of intensive content changes. More effective retrieval with stronger tolerance is achieved by replacing video-content similarity measures with information correlation analysis. Theoretical justification and experimental results prove the effectiveness of correlation-based near-duplicate retrieval.

Journal ArticleDOI
TL;DR: This work proposes algorithms for chemical formula and chemical name tagging using Conditional Random Fields (CRFs) and Support Vector Machines (SVMs) that achieve higher accuracy than existing (published) methods.
Abstract: End-users utilize chemical search engines to search for chemical formulae and chemical names. Chemical search engines identify and index chemical formulae and chemical names appearing in text documents to support efficient search and retrieval in the future. Identifying chemical formulae and chemical names in text automatically has been a hard problem that has met with varying degrees of success in the past. We propose algorithms for chemical formula and chemical name tagging using Conditional Random Fields (CRFs) and Support Vector Machines (SVMs) that achieve higher accuracy than existing (published) methods. After chemical entities have been identified in text documents, they must be indexed. In order to support user-provided search queries that require a partial match between the chemical name segment used as a keyword or a partial chemical formula, all possible (or a significant number of) subformulae of formulae that appear in any document and all possible subterms (e.g., “methyl”) of chemical names (e.g., “methylethyl ketone”) must be indexed. Indexing all possible subformulae and subterms results in an exponential increase in the storage and memory requirements as well as the time taken to process the indices. We propose techniques to prune the indices significantly without reducing the quality of the returned results significantly. Finally, we propose multiple query semantics to allow users to pose different types of partial search queries for chemical entities. We demonstrate empirically that our search engines improve the relevance of the returned results for search queries involving chemical entities.

Journal ArticleDOI
TL;DR: A graph-based model for all types of implicit and explicit feedback, in which the relevant usage information is represented, is proposed, designed to capture the complex interactions of a user with an interactive video retrieval system.
Abstract: We present a model for exploiting community-based usage information for video retrieval, where implicit usage information from past users is exploited in order to provide enhanced assistance in video retrieval tasks, and alleviate the effects of the semantic gap problem. We propose a graph-based model for all types of implicit and explicit feedback, in which the relevant usage information is represented. Our model is designed to capture the complex interactions of a user with an interactive video retrieval system, including the representation of sequences of user-system interaction during a search session. Building upon this model, four recommendation strategies are defined and evaluated. An evaluation strategy is proposed based on simulated user actions, which enables the evaluation of our recommendation strategies over a usage information pool obtained from 24 users performing four different TRECVid tasks. Furthermore, the proposed simulation approach is used to simulate usage information pools with different characteristics, with which the recommendation approaches are further evaluated on a larger set of tasks, and their performance is studied with respect to the scalability and quality of the available implicit information.

Journal ArticleDOI
TL;DR: A new approach called tree-based ranking function adaptation (Trada) is proposed to effectively utilize data sources for training cross-domain ranking functions and is extended to utilize the pairwise preference data from the target domain to further improve the effectiveness of adaptation.
Abstract: Machine-learned ranking functions have shown successes in Web search engines. With the increasing demands on developing effective ranking functions for different search domains, we have seen a big bottleneck, that is, the problem of insufficient labeled training data, which has significantly slowed the development and deployment of machine-learned ranking functions for different domains. There are two possible approaches to address this problem: (1) combining labeled training data from similar domains with the small target-domain labeled data for training or (2) using pairwise preference data extracted from user clickthrough log for the target domain for training. In this article, we propose a new approach called tree-based ranking function adaptation (Trada) to effectively utilize these data sources for training cross-domain ranking functions. Tree adaptation assumes that ranking functions are trained with the Stochastic Gradient Boosting Trees method—a gradient boosting method on regression trees. It takes such a ranking function from one domain and tunes its tree-based structure with a small amount of training data from the target domain. The unique features include (1) automatic identification of the part of the model that needs adjustment for the new domain and (2) appropriate weighing of training examples considering both local and global distributions. Based on a novel pairwise loss function that we developed for pairwise learning, the basic tree adaptation algorithm is also extended (Pairwise Trada) to utilize the pairwise preference data from the target domain to further improve the effectiveness of adaptation. Experiments are performed on real datasets to show that tree adaptation can provide better-quality ranking functions for a new domain than other methods.

Journal ArticleDOI
TL;DR: This work shows that HYB can be constructed at least as fast as INV, and often up to twice as fast, because HYB, by its nature, requires only a half-inversion of the data and allows an efficient in-place instead of the traditional merge-based index construction.
Abstract: As shown in a series of recent works, the HYB index is an alternative to the inverted index (INV) that enables very fast prefix searches, which in turn is the basis for fast processing of many other types of advanced queries, including autocompletion, faceted search, error-tolerant search, database-style select and join, and semantic search. In this work we show that HYB can be constructed at least as fast as INV, and often up to twice as fast. This is because HYB, by its nature, requires only a half-inversion of the data and allows an efficient in-place instead of the traditional merge-based index construction. We also pay particular attention to the cache efficiency of the in-memory posting accumulation, an issue that has not been addressed in previous work, and show that our simple multilevel posting accumulation scheme yields much fewer cache misses compared to related approaches. Finally, we show that HYB supports fast dynamic index updates more easily than INV.