scispace - formally typeset
Search or ask a question

Showing papers by "Tianbao Yang published in 2012"


Proceedings Article
03 Dec 2012
TL;DR: It is shown that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nystrom method can yield impressively better generalization error bound than random Fourier features based approach.
Abstract: Both random Fourier features and the Nystrom method have been successfully applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features where the basis functions (i.e., cosine and sine functions) are sampled from a distribution independent from the training data, basis functions used by the Nystrom method are randomly sampled from the training examples and are therefore data dependent. By exploring this difference, we show that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nystrom method can yield impressively better generalization error bound than random Fourier features based approach. We empirically verify our theoretical findings on a wide range of large data sets.

328 citations


Proceedings Article
01 Jan 2012
TL;DR: It is shown that for the linear and general smooth convex loss functions, an online algorithm modified from the gradient descend algorithm can achieve a regret which only scales as the square root of the deviation, and as an application, this can also have such a logarithmic regret for the portfolio management problem.
Abstract: We study the online convex optimization problem, in which an online algorithm has to make repeated decisions with convex loss functions and hopes to achieve a small regret. We consider a natural restriction of this problem in which the loss functions have a small deviation, measured by the sum of the distances between every two consecutive loss functions, according to some distance metrics. We show that for the linear and general smooth convex loss functions, an online algorithm modified from the gradient descend algorithm can achieve a regret which only scales as the square root of the deviation. For the closely related problem of prediction with expert advice, we show that an online algorithm modified from the multiplicative update algorithm can also achieve a similar regret bound for a different measure of deviation. Finally, for loss functions which are strictly convex, we show that an online algorithm modified from the online Newton step algorithm can achieve a regret which is only logarithmic in terms of the deviation, and as an application, we can also have such a logarithmic regret for the portfolio management problem.

229 citations


Journal Article
TL;DR: This paper proposes an efficient algorithm which achieves O(√T) regret bound and O(T3/4) bound on the violation of constraints and proposes a multipoint bandit feedback algorithm with the same bounds in expectation as the first algorithm.
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set Κ from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to Κ for all rounds, we only require that the constraints, which define the set Κ, be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, we propose an efficient algorithm which achieves O(√T) regret bound and O(T3/4) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T3/4) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T2/3) bound for both regret and the violation of constraints when the domain Κ can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set Κ and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm.

127 citations


Proceedings Article
03 Dec 2012
TL;DR: A new approach for clustering is proposed, called semi-crowdsourced clustering that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing and outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efficiency.
Abstract: One of the main challenges in data clustering is to define an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by defining the pairwise similarity based on the manual annotations obtained through crowdsourcing. Despite its encouraging results, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are available. To address this limitation, we propose a new approach for clustering, called semi-crowdsourced clustering that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects and from the manual annotations of only a small portion of the data to be clustered. One difficulty in learning the pairwise similarity measure is that there is a significant amount of noise and inter-worker variations in the manual annotations obtained via crowdsourcing. We address this difficulty by developing a metric learning algorithm based on the matrix completion method. Our empirical study with two real-world image data sets shows that the proposed algorithm outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efficiency.

82 citations


Proceedings ArticleDOI
10 Dec 2012
TL;DR: The proposed algorithm constructs a partially observed similarity matrix based on the data pairs whose cluster memberships are agreed upon by most of the clustering algorithms in the ensemble, and deploys the matrix completion algorithm to complete the similarity matrix.
Abstract: Data clustering is an important task and has found applications in numerous real-world problems. Since no single clustering algorithm is able to identify all different types of cluster shapes and structures, ensemble clustering was proposed to combine different partitions of the same data generated by multiple clustering algorithms. The key idea of most ensemble clustering algorithms is to find a partition that is consistent with most of the available partitions of the input data. One problem with these algorithms is their inability to handle uncertain data pairs, i.e. data pairs for which about half of the partitions put them into the same cluster and the other half do the opposite. When the number of uncertain data pairs is large, they can mislead the ensemble clustering algorithm in generating the final partition. To overcome this limitation, we propose an ensemble clustering approach based on the technique of matrix completion. The proposed algorithm constructs a partially observed similarity matrix based on the data pairs whose cluster memberships are agreed upon by most of the clustering algorithms in the ensemble. It then deploys the matrix completion algorithm to complete the similarity matrix. The final data partition is computed by applying an efficient spectral clustering algorithm to the completed matrix. Our empirical studies with multiple real-world datasets show that the proposed algorithm performs significantly better than the state-of-the-art algorithms for ensemble clustering.

80 citations


Proceedings Article
03 Dec 2012
TL;DR: This work develops novel stochastic optimization algorithms that do not need intermediate projections to obtain a feasible solution in the given domain and achieves an O(1/√T) convergence rate for general convex optimization and a O(ln T/T) rate for strongly conveX optimization under mild conditions about the domain and the objective function.
Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, the proposed algorithms achieve an O(1/√T) convergence rate for general convex optimization, and an O(ln T/T) rate for strongly convex optimization under mild conditions about the domain and the objective function.

61 citations


Posted Content
TL;DR: The theoretical analysis shows that with a high probability, the proposed algorithm is able to accurately recover the optimal solution to the original problem, provided that the data matrix is of low rank or can be well approximated by a low rank matrix.
Abstract: Random projection has been widely used in data classification. It maps high-dimensional data into a low-dimensional subspace in order to reduce the computational cost in solving the related optimization problem. While previous studies are focused on analyzing the classification performance of using random projection, in this work, we consider the recovery problem, i.e., how to accurately recover the optimal solution to the original optimization problem in the high-dimensional space based on the solution learned from the subspace spanned by random projections. We present a simple algorithm, termed Dual Random Projection, that uses the dual solution of the low-dimensional optimization problem to recover the optimal solution to the original problem. Our theoretical analysis shows that with a high probability, the proposed algorithm is able to accurately recover the optimal solution to the original problem, provided that the data matrix is of low rank or can be well approximated by a low rank matrix.

59 citations


Posted Content
TL;DR: In this paper, a primal dual prox method was proposed to solve non-smooth optimization problems in machine learning, where the dual form of the loss function is bilinear in primal and dual variables.
Abstract: We study the non-smooth optimization problems in machine learning, where both the loss function and the regularizer are non-smooth functions. Previous studies on efficient empirical loss minimization assume either a smooth loss function or a strongly convex regularizer, making them unsuitable for non-smooth optimization. We develop a simple yet efficient method for a family of non-smooth optimization problems where the dual form of the loss function is bilinear in primal and dual variables. We cast a non-smooth optimization problem into a minimax optimization problem, and develop a primal dual prox method that solves the minimax optimization problem at a rate of $O(1/T)$ {assuming that the proximal step can be efficiently solved}, significantly faster than a standard subgradient descent method that has an $O(1/\sqrt{T})$ convergence rate. Our empirical study verifies the efficiency of the proposed method for various non-smooth optimization problems that arise ubiquitously in machine learning by comparing it to the state-of-the-art first order methods.

27 citations


Posted Content
TL;DR: In this paper, a simple algorithm for semi-supervised regression is proposed, which uses the top eigenfunctions of integral operator derived from both labeled and unlabeled examples as the basis functions and learns the prediction function by a simple linear regression.
Abstract: In this work, we develop a simple algorithm for semi-supervised regression. The key idea is to use the top eigenfunctions of integral operator derived from both labeled and unlabeled examples as the basis functions and learn the prediction function by a simple linear regression. We show that under appropriate assumptions about the integral operator, this approach is able to achieve an improved regression error bound better than existing bounds of supervised learning. We also verify the effectiveness of the proposed algorithm by an empirical study.

24 citations


Proceedings Article
26 Jun 2012
TL;DR: This work develops an efficient algorithm for solving the related convex-concave optimization problem with a fast convergence rate of O(1/T) where T is the number of iterations and a minimax formulation is presented.
Abstract: We study the problem of multiple kernel learning from noisy labels. This is in contrast to most of the previous studies on multiple kernel learning that mainly focus on developing efficient algorithms and assume perfectly labeled training examples. Directly applying the existing multiple kernel learning algorithms to noisily labeled examples often leads to suboptimal performance due to the incorrect class assignments. We address this challenge by casting multiple kernel learning from noisy labels into a stochastic programming problem, and presenting a minimax formulation. We develop an efficient algorithm for solving the related convex-concave optimization problem with a fast convergence rate of O(1/T) where T is the number of iterations. Empirical studies on UCI data sets verify both the effectiveness and the efficiency of the proposed algorithm.

23 citations


Proceedings Article
22 Jul 2012
TL;DR: This paper proposes an efficient online kernel selection algorithm that incrementally learns a weight for each kernel classifier and has a theoretically guaranteed performance compared to the best kernel predictor.
Abstract: Kernel methods have been successfully applied to many machine learning problems. Nevertheless, since the performance of kernel methods depends heavily on the type of kernels being used, identifying good kernels among a set of given kernels is important to the success of kernel methods. A straightforward approach to address this problem is cross-validation by training a separate classifier for each kernel and choosing the best kernel classifier out of them. Another approach is Multiple Kernel Learning (MKL), which aims to learn a single kernel classifier from an optimal combination of multiple kernels. However, both approaches suffer from a high computational cost in computing the full kernel matrices and in training, especially when the number of kernels or the number of training examples is very large. In this paper, we tackle this problem by proposing an efficient online kernel selection algorithm. It incrementally learns a weight for each kernel classifier. The weight for each kernel classifier can help us to select a good kernel among a set of given kernels. The proposed approach is efficient in that (i) it is an online approach and therefore avoids computing all the full kernel matrices before training; (ii) it only updates a single kernel classifier each time by a sampling technique and therefore saves time on updating kernel classifiers with poor performance; (iii) it has a theoretically guaranteed performance compared to the best kernel predictor. Empirical studies on image classification tasks demonstrate the effectiveness of the proposed approach for selecting a good kernel among a set of kernels.

Proceedings Article
26 Jun 2012
TL;DR: A simple algorithm is developed to use the top eigenfunctions of integral operator derived from both labeled and unlabeled examples as the basis functions and learn the prediction function by a simple linear regression to achieve an improved regression error bound better than existing bounds of supervised learning.
Abstract: In this work, we develop a simple algorithm for semi-supervised regression. The key idea is to use the top eigenfunctions of integral operator derived from both labeled and unlabeled examples as the basis functions and learn the prediction function by a simple linear regression. We show that under appropriate assumptions about the integral operator, this approach is able to achieve an improved regression error bound better than existing bounds of supervised learning. We also verify the effectiveness of the proposed algorithm by an empirical study.

Posted Content
Tianbao Yang1, Mehrdad Mahdavi1, Rong Jin1, Lijun Zhang1, Yang Zhou2 
TL;DR: In this paper, the problem of multiple kernel learning from noisy labels was studied and a convex-concave optimization algorithm was proposed to solve the problem with a fast convergence rate of O(1/T) where T is the number of iterations.
Abstract: We study the problem of multiple kernel learning from noisy labels. This is in contrast to most of the previous studies on multiple kernel learning that mainly focus on developing efficient algorithms and assume perfectly labeled training examples. Directly applying the existing multiple kernel learning algorithms to noisily labeled examples often leads to suboptimal performance due to the incorrect class assignments. We address this challenge by casting multiple kernel learning from noisy labels into a stochastic programming problem, and presenting a minimax formulation. We develop an efficient algorithm for solving the related convex-concave optimization problem with a fast convergence rate of $O(1/T)$ where $T$ is the number of iterations. Empirical studies on UCI data sets verify both the effectiveness of the proposed framework and the efficiency of the proposed optimization algorithm.

Posted Content
TL;DR: A novel primal-dual stochastic approximation algorithm which attains the optimal convergence rate of $O(T^{-1/2})$ for general Lipschitz continuous objectives is devised.
Abstract: In this paper we propose a general framework to characterize and solve the stochastic optimization problems with multiple objectives underlying many real world learning applications. We first propose a projection based algorithm which attains an $O(T^{-1/3})$ convergence rate. Then, by leveraging on the theory of Lagrangian in constrained optimization, we devise a novel primal-dual stochastic approximation algorithm which attains the optimal convergence rate of $O(T^{-1/2})$ for general Lipschitz continuous objectives.

Posted Content
TL;DR: Lagrangian exponentially weighted average (LEWA) algorithm is proposed, which is a primal-dual variant of the well known exponentially weightedAverage algorithm, to efficiently solve constrained online decision making problems.
Abstract: Online learning constitutes a mathematical and compelling framework to analyze sequential decision making problems in adversarial environments. The learner repeatedly chooses an action, the environment responds with an outcome, and then the learner receives a reward for the played action. The goal of the learner is to maximize his total reward. However, there are situations in which, in addition to maximizing the cumulative reward, there are some additional constraints on the sequence of decisions that must be satisfied on average by the learner. In this paper we study an extension to the online learning where the learner aims to maximize the total reward given that some additional constraints need to be satisfied. By leveraging on the theory of Lagrangian method in constrained optimization, we propose Lagrangian exponentially weighted average (LEWA) algorithm, which is a primal-dual variant of the well known exponentially weighted average algorithm, to efficiently solve constrained online decision making problems. Using novel theoretical analysis, we establish the regret and the violation of the constraint bounds in full information and bandit feedback models.

Posted Content
TL;DR: An improved bound for the approximation error of the Nystr\"{o}m method is developed under the assumption that there is a large eigengap in the spectrum of kernel matrix.
Abstract: We develop an improved bound for the approximation error of the Nystrom method under the assumption that there is a large eigengap in the spectrum of kernel matrix. This is based on the empirical observation that the eigengap has a significant impact on the approximation error of the Nystrom method. Our approach is based on the concentration inequality of integral operator and the theory of matrix perturbation. Our analysis shows that when there is a large eigengap, we can improve the approximation error of the Nystrom method from $O(N/m^{1/4})$ to $O(N/m^{1/2})$ when measured in Frobenius norm, where $N$ is the size of the kernel matrix, and $m$ is the number of sampled columns.

Posted Content
TL;DR: This study conducts an extensive investigation in regard to influence among bloggers at a Japanese blog web site, BIGLOBE, and proposes a principled framework to detect influence among the members with high confidence level.
Abstract: In this paper we analyze influence in the blogosphere. Recently, influence analysis has become an increasingly important research topic, as online communities, such as social networks and e-commerce sites, playing a more and more significant role in our daily life. However, so far few studies have succeeded in extracting influence from online communities in a satisfactory way. One of the challenges that limited previous researches is that it is difficult to capture user behaviors. Consequently, the influence among users could only be inferred in an indirect and heuristic way, which is inaccurate and noise-prone. In this study, we conduct an extensive investigation in regard to influence among bloggers at a Japanese blog web site, BIGLOBE. By processing the log files of the web servers, we are able to accurately extract the activities of BIGLOBE members in terms of writing their blog posts and reading other member's posts. Based on these activities, we propose a principled framework to detect influence among the members with high confidence level. From the extracted influence, we conduct in-depth analysis on how influence varies over different topics and how influence varies over different members. We also show the potentials of leveraging the extracted influence to make personalized recommendation in BIGLOBE. To our best knowledge, this is one of the first studies that capture and analyze influence in the blogosphere in such a large scale.

01 Jan 2012
TL;DR: This dissertation develops Bayesian approaches that explicitly model the noisy pairwise links by introducing additional hidden variables, besides community memberships, to explain potential inconsistency between the pairwise connections and pairwise class-relationship and proposes a discriminative model for the first time.
Abstract: Machine learning is a discipline of developing computational algorithms for learning predictive models from data. Traditional analytical learning methods treat the data as independent and identically distributed (i.i.d) samples from unknown distributions. However, this assumption is often violated in many real world applications that leading to the challenge of learning predictive models. For example, in electronic commerce website, customers could purchase a product by the recommendation of their friends. Hence the purchasement records of customers are not i.i.d samples but correlated. Nowadays, data become correlated due to collaborations, interactions, communications, and many other types of connections. Effective learning from these connected data not only provides better understanding of the data but also brings significant economic benefits. How to learn from the connected data also brings unique challenges to both supervised learning and unsupervised learning algorithms because these algorithms are designed for i.i.d data and are often sensitive to the noise in the connected data. In this dissertation, I focus on developing theory and algorithms for learning from connected data. In particular, I consider two types of connections: the first type of connection is naturally formed in real wold networks, while the second type of connection is manually created to facilitate the learning process which is called must-and-cannot link. In the first part of this dissertation, I develop efficient algorithms for detecting communities in the first type of connected data. In the second part of this dissertation, I develop clustering algorithms that effectively utilize both must links and cannot links for the second type of connected data. A common approach toward learning from connected data is to assume that if two data points are connected, they are likely to be assigned to the same class/cluster. This assumption is often violated in real-word applications, leading to the noisy connection problems. One key challenge of learning from connected data is how to model the noisy pairwise connections that indicates the pairwise class-relationship between two data points. In the problem of detecting communities in networked data, I develop Bayesian approaches that explicitly model the noisy pairwise links by introducing additional hidden variables, besides community memberships, to explain potential inconsistency between the pairwise connections and pairwise class-relationship. In clustering must-and-cannot linked data, I will try to model how the noise is added into the pairwise connections in the manually generating process. The main contributions of this dissertation include (i) it introduces popularity and productivity for the first time besides the community memberships to model the generation of noisy links in real networks; the effectiveness of these factors is demonstrated through the task of community detection; (ii) it proposes a discriminative model for the first time that combines the content and link analysis together for detecting communities to alleviate the impact of noisy connections in community detection; (iii) it presents a general approach for learning from noisily labeled data, proves the theoretical convergence results for the first time and applies the approach in clustering noisy must-and-cannot linked data.

Posted Content
TL;DR: In this paper, a new link model is presented that introduces a random variable to capture the node popularity when deciding the link between two nodes; a discriminative model is used to determine the community membership of a node by its content.
Abstract: This paper addresses the problem of community detection in networked data that combines link and content analysis. Most existing work combines link and content information by a generative model. There are two major shortcomings with the existing approaches. First, they assume that the probability of creating a link between two nodes is determined only by the community memberships of the nodes; however other factors (e.g. popularity) could also affect the link pattern. Second, they use generative models to model the content of individual nodes, whereas these generative models are vulnerable to the content attributes that are irrelevant to communities. We propose a Bayesian framework for combining link and content information for community detection that explicitly addresses these shortcomings. A new link model is presented that introduces a random variable to capture the node popularity when deciding the link between two nodes; a discriminative model is used to determine the community membership of a node by its content. An approximate inference algorithm is presented for efficient Bayesian inference. Our empirical study shows that the proposed framework outperforms several state-of-theart approaches in combining link and content information for community detection.

Proceedings ArticleDOI
04 Oct 2012
TL;DR: A probabilistic approach for learning the combination of multiple kernels is proposed and it is shown that under appropriate assumptions, the combination weights learned by the proposed approach from the noisy pairwise constraints converge to the optimal weights learned from perfectly labeled pairwise constraint.
Abstract: We consider the problem of learning the combination of multiple kernels given noisy pairwise constraints, which is in contrast to most of the existing studies that assume perfect pairwise constraints. This problem is particularly important when the pairwise constraints are derived from side information such as hyperlinks and paper citations. We propose a probabilistic approach for learning the combination of multiple kernels and show that under appropriate assumptions, the combination weights learned by the proposed approach from the noisy pairwise constraints converge to the optimal weights learned from perfectly labeled pairwise constraints. Empirical studies on data clustering using the learned combined kernel verify the effectiveness of the proposed approach.