Bio: Wensi Xi is an academic researcher from Virginia Tech. The author has contributed to research in topics: Web page & Ranking (information retrieval). The author has an hindex of 13, co-authored 22 publications receiving 1870 citations.
15 Aug 2005
TL;DR: In this paper, clusters generated from the training data provide the basis for data smoothing and neighborhood selection and show that the new proposed approach consistently outperforms other state-of-art collaborative filtering algorithms.
Abstract: Memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. In the past, the memory-based approach has been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. Alternatively, the model-based approach has been proposed to alleviate these problems, but this approach tends to limit the range of users. In this paper, we present a novel approach that combines the advantages of these two approaches by introducing a smoothing-based method. In our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. As a result, we provide higher accuracy as well as increased efficiency in recommendations. Empirical studies on two datasets (EachMovie and MovieLens) show that our new proposed approach consistently outperforms other state-of-art collaborative filtering algorithms.
••13 Nov 2004
TL;DR: A novel iterative reinforced algorithm to utilize the user click-through data to improve search performance and effectively finds "virtual queries" for web pages and overcomes the challenges discussed above.
Abstract: The performance of web search engines may often deteriorate due to the diversity and noisy information contained within web pages. User click-through data can be used to introduce more accurate description (metadata) for web pages, and to improve the search performance. However, noise and incompleteness, sparseness, and the volatility of web pages and queries are three major challenges for research work on user click-through log mining. In this paper, we propose a novel iterative reinforced algorithm to utilize the user click-through data to improve search performance. The algorithm fully explores the interrelations between queries and web pages, and effectively finds "virtual queries" for web pages and overcomes the challenges discussed above. Experiment results on a large set of MSN click-through log data show a significant improvement on search performance over the naive query log mining algorithm as well as the baseline search engine.
15 Aug 2005
TL;DR: A novel ranking scheme named Affinity Ranking (AR) is proposed to re-rank search results by optimizing two metrics: diversity -- which indicates the variance of topics in a group of documents; and information richness -- which measures the coverage of a single document to its topic.
Abstract: In this paper, we propose a novel ranking scheme named Affinity Ranking (AR) to re-rank search results by optimizing two metrics: (1) diversity -- which indicates the variance of topics in a group of documents; (2) information richness -- which measures the coverage of a single document to its topic. Both of the two metrics are calculated from a directed link graph named Affinity Graph (AG). AG models the structure of a group of documents based on the asymmetric content similarities between each pair of documents. Experimental results in Yahoo! Directory, ODP Data, and Newsgroup data demonstrate that our proposed ranking algorithm significantly improves the search performance. Specifically, the algorithm achieves 31% improvement in diversity and 12% improvement in information richness relatively within the top 10 search results.
TL;DR: An overview of the popularly used feature extraction and selection algorithms under a unified framework is given and two novel dimensionality reduction algorithms based on the orthogonal centroid algorithm (OC) are proposed.
Abstract: Dimensionality reduction is an essential data preprocessing technique for large-scale and streaming data classification tasks. It can be used to improve both the efficiency and the effectiveness of classifiers. Traditional dimensionality reduction approaches fall into two categories: feature extraction and feature selection. Techniques in the feature extraction category are typically more effective than those in feature selection category. However, they may break down when processing large-scale data sets or data streams due to their high computational complexities. Similarly, the solutions provided by the feature selection approaches are mostly solved by greedy strategies and, hence, are not ensured to be optimal according to optimized criteria. In this paper, we give an overview of the popularly used feature extraction and selection algorithms under a unified framework. Moreover, we propose two novel dimensionality reduction algorithms based on the orthogonal centroid algorithm (OC). The first is an incremental OC (IOC) algorithm for feature extraction. The second algorithm is an orthogonal centroid feature selection (OCFS) method which can provide optimal solutions according to the OC criterion. Both are designed under the same optimization criterion. Experiments on Reuters Corpus Volume-1 data set and some public large-scale text data sets indicate that the two algorithms are favorable in terms of their effectiveness and efficiency when compared with other state-of-the-art algorithms.
••15 Aug 2005
TL;DR: It is claimed that iterative computations over the URM can help overcome the data sparseness problem and detect latent relationships among heterogeneous data objects, thus, can improve the quality of information applications that require com- bination of information from heterogeneous sources.
Abstract: In this paper we use a Unified Relationship Matrix (URM) to represent a set of heterogeneous data objects (e.g., web pages, queries) and their interrelationships (e.g., hyperlinks, user click-through sequences). We claim that iterative computations over the URM can help overcome the data sparseness problem and detect latent relationships among heterogeneous data objects, thus, can improve the quality of information applications that require com- bination of information from heterogeneous sources. To support our claim, we present a unified similarity-calculating algorithm, SimFusion. By iteratively computing over the URM, SimFusion can effectively integrate relationships from heterogeneous sources when measuring the similarity of two data objects. Experiments based on a web search engine query log and a web page collection demonstrate that SimFusion can improve similarity measurement of web objects over both traditional content based algorithms and the cutting edge SimRank algorithm.
TL;DR: From basic techniques to the state-of-the-art, this paper attempts to present a comprehensive survey for CF techniques, which can be served as a roadmap for research and practice in this area.
Abstract: As one of the most successful approaches to building recommender systems, collaborative filtering (CF) uses the known preferences of a group of users to make recommendations or predictions of the unknown preferences for other users. In this paper, we first introduce CF tasks and their main challenges, such as data sparsity, scalability, synonymy, gray sheep, shilling attacks, privacy protection, etc., and their possible solutions. We then present three main categories of CF techniques: memory-based, modelbased, and hybrid CF algorithms (that combine CF with other recommendation techniques), with examples for representative algorithms of each category, and analysis of their predictive performance and their ability to address the challenges. From basic techniques to the state-of-the-art, we attempt to present a comprehensive survey for CF techniques, which can be served as a roadmap for research and practice in this area.
•27 Jun 2009
TL;DR: Three major approaches to learning to rank are introduced, i.e., the pointwise, pairwise, and listwise approaches, the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures are analyzed, and the performance of these approaches on the LETOR benchmark datasets is evaluated.
Abstract: This tutorial is concerned with a comprehensive introduction to the research area of learning to rank for information retrieval. In the first part of the tutorial, we will introduce three major approaches to learning to rank, i.e., the pointwise, pairwise, and listwise approaches, analyze the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures, evaluate the performance of these approaches on the LETOR benchmark datasets, and demonstrate how to use these approaches to solve real ranking applications. In the second part of the tutorial, we will discuss some advanced topics regarding learning to rank, such as relational ranking, diverse ranking, semi-supervised ranking, transfer ranking, query-dependent ranking, and training data preprocessing. In the third part, we will briefly mention the recent advances on statistical learning theory for ranking, which explain the generalization ability and statistical consistency of different ranking methods. In the last part, we will conclude the tutorial and show several future research directions.
••20 Jun 2007
TL;DR: It is proposed that learning to rank should adopt the listwise approach in which lists of objects are used as 'instances' in learning, and introduces two probability models, respectively referred to as permutation probability and top k probability, to define a listwise loss function for learning.
Abstract: The paper is concerned with learning to rank, which is to construct a model or a function for ranking objects. Learning to rank is useful for document retrieval, collaborative filtering, and many other applications. Several methods for learning to rank have been proposed, which take object pairs as 'instances' in learning. We refer to them as the pairwise approach in this paper. Although the pairwise approach offers advantages, it ignores the fact that ranking is a prediction task on list of objects. The paper postulates that learning to rank should adopt the listwise approach in which lists of objects are used as 'instances' in learning. The paper proposes a new probabilistic method for the approach. Specifically it introduces two probability models, respectively referred to as permutation probability and top k probability, to define a listwise loss function for learning. Neural Network and Gradient Descent are then employed as model and algorithm in the learning method. Experimental results on information retrieval show that the proposed listwise approach performs better than the pairwise approach.
••09 Feb 2011
TL;DR: This paper proposes a matrix factorization framework with social regularization, which can be easily extended to incorporate other contextual information, like social tags, etc, and demonstrates that the approaches outperform other state-of-the-art methods.
Abstract: Although Recommender Systems have been comprehensively analyzed in the past decade, the study of social-based recommender systems just started. In this paper, aiming at providing a general method for improving recommender systems by incorporating social network information, we propose a matrix factorization framework with social regularization. The contributions of this paper are four-fold: (1) We elaborate how social network information can benefit recommender systems; (2) We interpret the differences between social-based recommender systems and trust-aware recommender systems; (3) We coin the term Social Regularization to represent the social constraints on recommender systems, and we systematically illustrate how to design a matrix factorization objective function with social regularization; and (4) The proposed method is quite general, which can be easily extended to incorporate other contextual information, like social tags, etc. The empirical analysis on two large datasets demonstrates that our approaches outperform other state-of-the-art methods.
03 Jul 2006
TL;DR: Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided.
Abstract: Why doesn't your home page appear on the first page of search results, even when you query your own name? How do other web pages always appear at the top? What creates these powerful rankings? And how? The first book ever about the science of web page rankings, Google's PageRank and Beyond supplies the answers to these and other questions and more. The book serves two very different audiences: the curious science reader and the technical computational reader. The chapters build in mathematical sophistication, so that the first five are accessible to the general academic reader. While other chapters are much more mathematical in nature, each one contains something for both audiences. For example, the authors include entertaining asides such as how search engines make money and how the Great Firewall of China influences research. The book includes an extensive background chapter designed to help readers learn more about the mathematics of search engines, and it contains several MATLAB codes and links to sample web data sets. The philosophy throughout is to encourage readers to experiment with the ideas and algorithms in the text. Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided. Many illustrative examples and entertaining asides MATLAB code Accessible and informal style Complete and self-contained section for mathematics review