Other affiliations: Microsoft
Bio: Gui-Rong Xue is an academic researcher from Shanghai Jiao Tong University. The author has contributed to research in topics: Web page & Web query classification. The author has an hindex of 34, co-authored 83 publications receiving 6904 citations. Previous affiliations of Gui-Rong Xue include Microsoft.
••20 Jun 2007
TL;DR: In this paper, the authors proposed a transfer learning framework called TrAdaBoost, which allows users to utilize a small amount of newly labeled data to leverage the old data to construct a high-quality classification model for the new data.
Abstract: Traditional machine learning makes a basic assumption: the training and test data should be under the same distribution. However, in many cases, this identical-distribution assumption does not hold. The assumption might be violated when a task from one new domain comes, while there are only labeled data from a similar old domain. Labeling the new data can be costly and it would also be a waste to throw away all the old data. In this paper, we present a novel transfer learning framework called TrAdaBoost, which extends boosting-based learning algorithms (Freund & Schapire, 1997). TrAdaBoost allows users to utilize a small amount of newly labeled data to leverage the old data to construct a high-quality classification model for the new data. We show that this method can allow us to learn an accurate model using only a tiny amount of new data and a large amount of old data, even when the new data are not sufficient to train a model alone. We show that TrAdaBoost allows knowledge to be effectively transferred from the old data to the new. The effectiveness of our algorithm is analyzed theoretically and empirically to show that our iterative algorithm can converge well to an accurate model.
15 Aug 2005
TL;DR: In this paper, clusters generated from the training data provide the basis for data smoothing and neighborhood selection and show that the new proposed approach consistently outperforms other state-of-art collaborative filtering algorithms.
Abstract: Memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. In the past, the memory-based approach has been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. Alternatively, the model-based approach has been proposed to alleviate these problems, but this approach tends to limit the range of users. In this paper, we present a novel approach that combines the advantages of these two approaches by introducing a smoothing-based method. In our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. As a result, we provide higher accuracy as well as increased efficiency in recommendations. Empirical studies on two datasets (EachMovie and MovieLens) show that our new proposed approach consistently outperforms other state-of-art collaborative filtering algorithms.
••08 May 2007
TL;DR: Preliminary experimental results show that SSR can find the latent semantic association between queries and annotations, while SPR successfully measures the quality of a webpage from the web users' perspective.
Abstract: This paper explores the use of social annotations to improve websearch. Nowadays, many services, e.g. del.icio.us, have been developed for web users to organize and share their favorite webpages on line by using social annotations. We observe that the social annotations can benefit web search in two aspects: 1) the annotations are usually good summaries of corresponding webpages; 2) the count of annotations indicates the popularity of webpages. Two novel algorithms are proposed to incorporate the above information into page ranking: 1) SocialSimRank (SSR)calculates the similarity between social annotations and webqueries; 2) SocialPageRank (SPR) captures the popularity of webpages. Preliminary experimental results show that SSR can find the latent semantic association between queries and annotations, while SPR successfully measures the quality (popularity) of a webpage from the web users' perspective. We further evaluate the proposed methods empirically with 50 manually constructed queries and 3000 auto-generated queries on a dataset crawledfrom delicious. Experiments show that both SSR and SPRbenefit web search significantly.
•22 Jul 2007
TL;DR: This paper proposes a novel transfer-learning algorithm for text classification based on an EM-based Naive Bayes classifiers and shows that the algorithm outperforms the traditional supervised and semi-supervised learning algorithms when the distributions of the training and test sets are increasingly different.
Abstract: A basic assumption in traditional machine learning is that the training and test data distributions should be identical. This assumption may not hold in many situations in practice, but we may be forced to rely on a different-distribution data to learn a prediction model. For example, this may be the case when it is expensive to label the data in a domain of interest, although in a related but different domain there may be plenty of labeled data available. In this paper, we propose a novel transfer-learning algorithm for text classification based on an EM-based Naive Bayes classifiers. Our solution is to first estimate the initial probabilities under a distribution Dl of one labeled data set, and then use an EM algorithm to revise the model for a different distribution Du of the test data which are unlabeled. We show that our algorithm is very effective in several different pairs of domains, where the distances between the different distributions are measured using the Kullback-Leibler (KL) divergence. Moreover, KL-divergence is used to decide the trade-off parameters in our algorithm. In the experiment, our algorithm outperforms the traditional supervised and semi-supervised learning algorithms when the distributions of the training and test sets are increasingly different.
••12 Aug 2007
TL;DR: This paper proposes a co-clustering based classification (CoCC) algorithm, used as a bridge to propagate the class structure and knowledge from the in-domain to the out-of-domain, and shows that the algorithm greatly improves the classification performance over the traditional learning algorithms.
Abstract: In many real world applications, labeled data are in short supply. It often happens that obtaining labeled data in a new domain is expensive and time consuming, while there may be plenty of labeled data from a related but different domain. Traditional machine learning is not able to cope well with learning across different domains. In this paper, we address this problem for a text-mining task, where the labeled data are under one distribution in one domain known as in-domain data, while the unlabeled data are under a related but different domain known as out-of-domain data. Our general goal is to learn from the in-domain and apply the learned knowledge to out-of-domain. We propose a co-clustering based classification (CoCC) algorithm to tackle this problem. Co-clustering is used as a bridge to propagate the class structure and knowledge from the in-domain to the out-of-domain. We present theoretical and empirical analysis to show that our algorithm is able to produce high quality classification results, even when the distributions between the two data are different. The experimental results show that our algorithm greatly improves the classification performance over the traditional learning algorithms.
TL;DR: The relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift are discussed.
Abstract: A major assumption in many machine learning and data mining algorithms is that the training and future data must be in the same feature space and have the same distribution. However, in many real-world applications, this assumption may not hold. For example, we sometimes have a classification task in one domain of interest, but we only have sufficient training data in another domain of interest, where the latter data may be in a different feature space or follow a different data distribution. In such cases, knowledge transfer, if done successfully, would greatly improve the performance of learning by avoiding much expensive data-labeling efforts. In recent years, transfer learning has emerged as a new learning framework to address this problem. This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression, and clustering problems. In this survey, we discuss the relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift. We also explore some potential future issues in transfer learning research.
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).
•08 Jul 2008
TL;DR: This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems and focuses on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis.
Abstract: An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object. This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. Our focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. We include material on summarization of evaluative text and on broader issues regarding privacy, manipulation, and economic impact that the development of opinion-oriented information-access services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.
TL;DR: Almost 300 key theoretical and empirical contributions in the current decade related to image retrieval and automatic image annotation are surveyed, and the spawning of related subfields are discussed, to discuss the adaptation of existing image retrieval techniques to build systems that can be useful in the real world.
Abstract: We have witnessed great interest and a wealth of promise in content-based image retrieval as an emerging technology. While the last decade laid foundation to such promise, it also paved the way for a large number of new techniques and systems, got many new people involved, and triggered stronger association of weakly related fields. In this article, we survey almost 300 key theoretical and empirical contributions in the current decade related to image retrieval and automatic image annotation, and in the process discuss the spawning of related subfields. We also discuss significant challenges involved in the adaptation of existing image retrieval techniques to build systems that can be useful in the real world. In retrospect of what has been achieved so far, we also conjecture what the future may hold for image retrieval research.
TL;DR: From basic techniques to the state-of-the-art, this paper attempts to present a comprehensive survey for CF techniques, which can be served as a roadmap for research and practice in this area.
Abstract: As one of the most successful approaches to building recommender systems, collaborative filtering (CF) uses the known preferences of a group of users to make recommendations or predictions of the unknown preferences for other users. In this paper, we first introduce CF tasks and their main challenges, such as data sparsity, scalability, synonymy, gray sheep, shilling attacks, privacy protection, etc., and their possible solutions. We then present three main categories of CF techniques: memory-based, modelbased, and hybrid CF algorithms (that combine CF with other recommendation techniques), with examples for representative algorithms of each category, and analysis of their predictive performance and their ability to address the challenges. From basic techniques to the state-of-the-art, we attempt to present a comprehensive survey for CF techniques, which can be served as a roadmap for research and practice in this area.