Showing papers presented at "International Conference on Artificial Intelligence and Statistics in 2010"

PDF

Open Access

Proceedings Article•

Understanding the difficulty of training deep feedforward neural networks

[...]

Xavier Glorot¹, Yoshua Bengio¹•Institutions (1)

31 Mar 2010

TL;DR: The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.

...read moreread less

Abstract: Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence. 1 Deep Neural Networks Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features. They include Appearing in Proceedings of the 13 International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of JMLR: WC Weston et al., 2008). Much attention has recently been devoted to them (see (Bengio, 2009) for a review), because of their theoretical appeal, inspiration from biology and human cognition, and because of empirical success in vision (Ranzato et al., 2007; Larochelle et al., 2007; Vincent et al., 2008) and natural language processing (NLP) (Collobert & Weston, 2008; Mnih & Hinton, 2009). Theoretical results reviewed and discussed by Bengio (2009), suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one may need deep architectures. Most of the recent experimental results with deep architecture are obtained with models that can be turned into deep supervised neural networks, but with initialization or training schemes different from the classical feedforward neural networks (Rumelhart et al., 1986). Why are these new algorithms working so much better than the standard random initialization and gradient-based optimization of a supervised training criterion? Part of the answer may be found in recent analyses of the effect of unsupervised pretraining (Erhan et al., 2009), showing that it acts as a regularizer that initializes the parameters in a “better” basin of attraction of the optimization procedure, corresponding to an apparent local minimum associated with better generalization. But earlier work (Bengio et al., 2007) had shown that even a purely supervised but greedy layer-wise procedure would give better results. So here instead of focusing on what unsupervised pre-training or semi-supervised criteria bring to deep architectures, we focus on analyzing what may be going wrong with good old (but deep) multilayer neural networks. Our analysis is driven by investigative experiments to monitor activations (watching for saturation of hidden units) and gradients, across layers and across training iterations. We also evaluate the effects on these of choices of activation function (with the idea that it might affect saturation) and initialization procedure (since unsupervised pretraining is a particular form of initialization and it has a drastic impact).

...read moreread less

9,500 citations

Proceedings Article•

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

[...]

Michael U. Gutmann, Aapo Hyvärinen¹•Institutions (1)

Helsinki Institute for Information Technology¹

31 Mar 2010

TL;DR: A new estimation principle is presented to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity, which leads to a consistent (convergent) estimator of the parameters.

...read moreread less

Abstract: We present a new estimation principle for parameterized statistical models. The idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity. We show that this leads to a consistent (convergent) estimator of the parameters, and analyze the asymptotic variance. In particular, the method is shown to directly work for unnormalized models, i.e. models where the density function does not integrate to one. The normalization constant can be estimated just like any other parameter. For a tractable ICA model, we compare the method with other estimation methods that can be used to learn unnormalized models, including score matching, contrastive divergence, and maximum-likelihood where the normalization constant is estimated with importance sampling. Simulations show that noise-contrastive estimation offers the best trade-off between computational and statistical efficiency. The method is then applied to the modeling of natural images: We show that the method can successfully estimate a large-scale two-layer model and a Markov random field.

...read moreread less

1,736 citations

Proceedings Article•

Efficient Reductions for Imitation Learning

[...]

Stephane Ross¹, Drew Bagnell¹•Institutions (1)

Carnegie Mellon University¹

31 Mar 2010

TL;DR: This work proposes two alternative algorithms for imitation learning where training occurs over several episodes of interaction and shows that this leads to stronger performance guarantees and improved performance on two challenging problems: training a learner to play a 3D racing game and Mario Bros.

...read moreread less

Abstract: Imitation Learning, while applied successfully on many large real-world problems, is typically addressed as a standard supervised learning problem, where it is assumed the training and testing data are i.i.d.. This is not true in imitation learning as the learned policy influences the future test inputs (states) upon which it will be tested. We show that this leads to compounding errors and a regret bound that grows quadratically in the time horizon of the task. We propose two alternative algorithms for imitation learning where training occurs over several episodes of interaction. These two approaches share in common that the learner’s policy is slowly modified from executing the expert’s policy to the learned policy. We show that this leads to stronger performance guarantees and demonstrate the improved performance on two challenging problems: training a learner to play 1) a 3D racing game (Super Tux Kart) and 2) Mario Bros.; given input images from the games and corresponding actions taken by a human expert and near-optimal planner respectively.

...read moreread less

634 citations

Proceedings Article•

Why Does Unsupervised Pre-training Help Deep Learning?

[...]

Dumitru Erhan¹, Aaron Courville¹, Yoshua Bengio¹, Pascal Vincent¹•Institutions (1)

Université de Montréal¹

31 Mar 2010

627 citations

Proceedings Article•

Efficient Learning of Deep Boltzmann Machines

[...]

Ruslan Salakhutdinov¹, Hugo Larochelle²•Institutions (2)

Massachusetts Institute of Technology¹, University of Toronto²

31 Mar 2010

TL;DR: A new approximate inference algorithm for Deep Boltzmann Machines (DBM’s), a generative model with many layers of hidden variables, that learns a separate “recognition” model that is used to quickly initialize, in a single bottom-up pass, the values of the latent variables in all hidden layers.

...read moreread less

Abstract: We present a new approximate inference algorithm for Deep Boltzmann Machines (DBM’s), a generative model with many layers of hidden variables. The algorithm learns a separate “recognition” model that is used to quickly initialize, in a single bottom-up pass, the values of the latent variables in all hidden layers. We show that using such a recognition model, followed by a combined top-down and bottom-up pass, it is possible to efficiently learn a good generative model of high-dimensional highly-structured sensory input. We show that the additional computations required by incorporating a top-down feedback plays a critical role in the performance of a DBM, both as a generative and discriminative model. Moreover, inference is only at most three times slower compared to the approximate inference in a Deep Belief Network (DBN), making large-scale learning of DBM’s practical. Finally, we demonstrate that the DBM’s trained using the proposed approximate inference algorithm perform well compared to DBN’s and SVM’s on the MNIST handwritten digit, OCR English letters, and NORB visual object recognition tasks.

...read moreread less

374 citations

Proceedings Article•

Bayesian Gaussian Process Latent Variable Model

[...]

Michalis K. Titsias¹, Neil D. Lawrence²•Institutions (2)

National and Kapodistrian University of Athens¹, University of Sheffield²

31 Mar 2010

TL;DR: In this article, a variational inference framework for training the Gaussian process latent variable model and thus performing Bayesian nonlinear dimensionality reduction is introduced, which can automatically select the dimensionality of the nonlinear latent space.

...read moreread less

Abstract: We introduce a variational inference framework for training the Gaussian process latent variable model and thus performing Bayesian nonlinear dimensionality reduction. This method allows us to variationally integrate out the input variables of the Gaussian process and compute a lower bound on the exact marginal likelihood of the nonlinear latent variable model. The maximization of the variational lower bound provides a Bayesian training procedure that is robust to overfitting and can automatically select the dimensionality of the nonlinear latent space. We demonstrate our method on real world datasets. The focus in this paper is on dimensionality reduction problems, but the methodology is more general. For example, our algorithm is immediately applicable for training Gaussian process models in the presence of missing or uncertain inputs.

...read moreread less

338 citations

Proceedings Article•

On the relation between universality, characteristic kernels and RKHS embedding of measures

[...]

Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R. G. Lanckriet¹•Institutions (1)

University of California, San Diego¹

31 Mar 2010

TL;DR: In this paper, the necessary and sufficient conditions for a kernel to be universal were studied and a relation between universal and characteristic kernels was established between universal kernels and Borel measures in a reproducing kernel Hilbert space.

...read moreread less

Abstract: Universal kernels have been shown to play an important role in the achievability of the Bayes risk by many kernel-based algorithms that include binary classification, regression, etc. In this paper, we propose a notion of universality that generalizes the notions introduced by Steinwart and Micchelli et al. and study the necessary and sufficient conditions for a kernel to be universal. We show that all these notions of universality are closely linked to the injective embedding of a certain class of Borel measures into a reproducing kernel Hilbert space (RKHS). By exploiting this relation between universality and the embedding of Borel measures into an RKHS, we establish the relation between universal and characteristic kernels. The latter have been proposed in the context of the RKHS embedding of probability measures, used in statistical applications like homogeneity testing, independence testing, etc.

...read moreread less

263 citations

Proceedings Article•

Factored 3-Way Restricted Boltzmann Machines For Modeling Natural Images

[...]

Marc'Aurelio Ranzato¹, Alex Krizhevsky¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

31 Mar 2010

TL;DR: A factored 3-way RBM is proposed that uses the states of its hidden units to represent abnormalities in the local covariance structure of an image to provide a probabilistic framework for the widely used simple/complex cell architecture.

...read moreread less

Abstract: Deep belief nets have been successful in modeling handwritten characters, but it has proved more difficult to apply them to real images. The problem lies in the restricted Boltzmann machine (RBM) which is used as a module for learning deep belief nets one layer at a time. The Gaussian-Binary RBMs that have been used to model real-valued data are not a good way to model the covariance structure of natural images. We propose a factored 3-way RBM that uses the states of its hidden units to represent abnormalities in the local covariance structure of an image. This provides a probabilistic framework for the widely used simple/complex cell architecture. Our model learns binary features that work very well for object recognition on the “tiny images” data set. Even better features are obtained by then using standard binary RBM’s to learn a deeper model.

...read moreread less

249 citations

Proceedings Article•

Modeling annotator expertise: Learning when everybody knows a bit of something

[...]

Yan Yan¹, Romer Rosales², Glenn Fung², Mark Schmidt³, Gerardo Hermosillo Valadez², Luca Bogoni², Linda Moy⁴, Jennifer G. Dy¹ - Show less +4 more•Institutions (4)

Northeastern University¹, Siemens², University of British Columbia³, New York University⁴

31 Mar 2010

TL;DR: This paper develops a probabilistic approach to this problem when annotators may be unreliable, but also their expertise varies depending on the data they observe, which provides clear advantages over previously introduced multi-annotator methods.

...read moreread less

Abstract: Supervised learning from multiple labeling sources is an increasingly important problem in machine learning and data mining. This paper develops a probabilistic approach to this problem when annotators may be unreliable (labels are noisy), but also their expertise varies depending on the data they observe (annotators may have knowledge about different parts of the input space). That is, an annotator may not be consistently accurate (or inaccurate) across the task domain. The presented approach produces classification and annotator models that allow us to provide estimates of the true labels and annotator variable expertise. We provide an analysis of the proposed model under various scenarios and show experimentally that annotator expertise can indeed vary in real tasks and that the presented approach provides clear advantages over previously introduced multi-annotator methods, which only consider general annotator characteristics.

...read moreread less

220 citations

Proceedings Article•

Exclusive Lasso for Multi-task Feature Selection

[...]

Yang Zhou¹, Rong Jin¹, Steven C. H. Hoi²•Institutions (2)

Michigan State University¹, Nanyang Technological University²

31 Mar 2010

TL;DR: Experiments with document categorization show that the proposed exclusive lasso regularizer outperforms state-of-theart algorithms for multi-task feature selection and an efficient algorithm is derived to solve the related optimization problem.

...read moreread less

Abstract: We propose a novel group regularization which we call exclusive lasso. Unlike the group lasso regularizer that assumes covarying variables in groups, the proposed exclusive lasso regularizer models the scenario when variables in the same group compete with each other. Analysis is presented to illustrate the properties of the proposed regularizer. We present a framework of kernel based multi-task feature selection algorithm based on the proposed exclusive lasso regularizer. An efficient algorithm is derived to solve the related optimization problem. Experiments with document categorization show that our approach outperforms state-of-theart algorithms for multi-task feature selection.

...read moreread less

209 citations

Proceedings Article•

Contextual Multi-Armed Bandits

[...]

Tyler Lu¹, Dávid Pál², Martin Pál³•Institutions (3)

University of Toronto¹, University of Alberta², Google³

31 Mar 2010

TL;DR: A lower bound is proved for the regret of any algo- rithm where ~ ~ are packing dimensions of the query spaces and the ad space respectively and this gives an almost matching up- per and lower bound for finite spaces or convex bounded subsets of Eu- clidean spaces.

...read moreread less

Abstract: D´ avid P´ al Abstract We study contextual multi-armed bandit prob- lems where the context comes from a metric space and the payoff satisfies a Lipschitz condi- tion with respect to the metric. Abstractly, a con- textual multi-armed bandit problem models a sit- uation where, in a sequence of independent trials, an online algorithm chooses, based on a given context (side information), an action from a set of possible actions so as to maximize the total pay- off of the chosen actions. The payoff depends on both the action chosen and the context. In con- trast, context-free multi-armed bandit problems, a focus of much previous research, model situa- tions where no side information is available and the payoff depends only on the action chosen. Our problem is motivated by sponsored web search, where the task is to display ads to a user of an Internet search engine based on her search query so as to maximize the click-through rate (CTR) of the ads displayed. We cast this prob- lem as a contextual multi-armed bandit problem where queries and ads form metric spaces and the payoff function is Lipschitz with respect to both the metrics. For any > 0 we present an algorithm with regret O(T a+b+1 a+b+2+ ) where a;b are the covering dimensions of the query space and the ad space respectively. We prove a lower bound ( T ~ a+~+1 ~ a+~+2 ) for the regret of any algo- rithm where ~ ~ are packing dimensions of the query spaces and the ad space respectively. For finite spaces or convex bounded subsets of Eu- clidean spaces, this gives an almost matching up- per and lower bound.

...read moreread less

Proceedings Article•

Learning Bayesian Network Structure using LP Relaxations

[...]

Tommi S. Jaakkola¹, David Sontag¹, Amir Globerson², Marina Meila³•Institutions (3)

Massachusetts Institute of Technology¹, Hebrew University of Jerusalem², University of Washington³

31 Mar 2010

TL;DR: In this article, the authors propose to solve the combinatorial problem of nding the highest scoring Bayesian network structure from data, which is viewed as an inference problem where the variables specify the choice of parents for each node in the graph.

...read moreread less

Abstract: We propose to solve the combinatorial problem of nding the highest scoring Bayesian network structure from data. This structure learning problem can be viewed as an inference problem where the variables specify the choice of parents for each node in the graph. The key combinatorial diculty arises from the global constraint that the graph structure has to be acyclic. We cast the structure learning problem as a linear program over the polytope dened by valid acyclic structures. In relaxing this problem, we maintain an outer bound approximation to the polytope and iteratively tighten it by searching over a new class of valid constraints. If an integral solution is found, it is guaranteed to be the optimal Bayesian network. When the relaxation is not tight, the fast dual algorithms we develop remain useful in combination with a branch and bound method. Empirical results suggest that the method is competitive or faster than alternative exact methods based on dynamic programming.

...read moreread less

Proceedings Article•

Inductive Principles for Restricted Boltzmann Machine Learning

[...]

Benjamin M. Marlin¹, Kevin Swersky¹, Bo Chen², Nando de Freitas¹•Institutions (2)

University of British Columbia¹, Duke University²

31 Mar 2010

TL;DR: This paper studies learning methods for binary restricted Boltzmann machines based on ratio matching and generalized score matching and compares them to a range of existing learning methods including stochastic maximum likelihood, contrastive divergence, and pseudo-likelihood.

...read moreread less

Abstract: Recent research has seen the proposal of several new inductive principles designed specifically to avoid the problems associated with maximum likelihood learning in models with intractable partition functions. In this paper, we study learning methods for binary restricted Boltzmann machines (RBMs) based on ratio matching and generalized score matching. We compare these new RBM learning methods to a range of existing learning methods including stochastic maximum likelihood, contrastive divergence, and pseudo-likelihood. We perform an extensive empirical evaluation across multiple tasks and data sets.

...read moreread less

Proceedings Article•

Impossibility Theorems for Domain Adaptation

[...]

Shai Ben-David¹, Tyler Lu², Teresa Luu¹, Dávid Pál³•Institutions (3)

University of Waterloo¹, University of Toronto², University of Alberta³

31 Mar 2010

TL;DR: In this paper, the authors analyze the assumptions in an agnostic PAC-style learning model for a setting in which the learner can access a labeled training data sample and an unlabeled sample generated by the test data distribution and show that without either assumption (i or (ii), the combination of the remaining assumptions is not sufficient to guarantee successful learning.

...read moreread less

Abstract: The domain adaptation problem in machine learning occurs when the test data generating distribution differs from the one that generates the training data. It is clear that the success of learning under such circumstances depends on similarities between the two data distributions. We study assumptions about the relationship between the two distributions that one needed for domain adaptation learning to succeed. We analyze the assumptions in an agnostic PAC-style learning model for a the setting in which the learner can access a labeled training data sample and an unlabeled sample generated by the test data distribution. We focus on three assumptions: (i) similarity between the unlabeled distributions, (ii) existence of a classifier in the hypothesis class with low error on both training and testing distributions, and (iii) the covariate shift assumption. I.e., the assumption that the conditioned label distribution (for each data point) is the same for both the training and test distributions. We show that without either assumption (i) or (ii), the combination of the remaining assumptions is not sufficient to guarantee successful learning. Our negative results hold with respect to any domain adaptation learning algorithm, as long as it does not have access to target labeled examples. In particular, we provide formal proofs that the popular covariate shift assumption is rather weak and does not relieve the necessity of the other assumptions.

...read moreread less

Proceedings Article•

HOP-MAP: Efficient Message Passing with High Order Potentials

[...]

Daniel Tarlow¹, Inmar E. Givoni¹, Richard S. Zemel¹•Institutions (1)

University of Toronto¹

31 Mar 2010

TL;DR: This work introduces two new classes of high order potentials, including composite HOPs that allow us to exibly combine tractable Hops using simple logical switching rules, and presents ecient message update algorithms for the newHOPs, and improves upon the eciency of message updates for a general class of existing HOPS.

...read moreread less

Abstract: There is a growing interest in building probabilistic models with high order potentials (HOPs), or interactions, among discrete variables. Message passing inference in such models generally takes time exponential in the size of the interaction, but in some cases maximum a posteriori (MAP) inference can be carried out eciently. We build upon such results, introducing two new classes, including composite HOPs that allow us to exibly combine tractable HOPs using simple logical switching rules. We present ecient message update algorithms for the new HOPs, and we improve upon the eciency of message updates for a general class of existing HOPs. Importantly, we present both new and existing HOPs in a common representation; performing inference with any combination of these HOPs requires no change of representations or new derivations.

...read moreread less

Proceedings Article•

Neural conditional random fields

[...]

Trinh Minh Tri Do, Thierry Artières

31 Mar 2010

TL;DR: A non-linear graphical model for structured prediction that combines the power of deep neural networks to extract high level features with the graphical framework of Markov networks, yielding a powerful and scalable probabilistic model that is applied to signal labeling tasks.

...read moreread less

Abstract: We propose a non-linear graphical model for structured prediction. It combines the power of deep neural networks to extract high level features with the graphical framework of Markov networks, yielding a powerful and scalable probabilistic model that we apply to signal labeling tasks.

...read moreread less

Proceedings Article•

Online Anomaly Detection under Adversarial Impact

[...]

Marius Kloft¹, Pavel Laskov²•Institutions (2)

University of California, Berkeley¹, University of Tübingen²

31 Mar 2010

TL;DR: This work analyzes the performance of a particular method— online centroid anomaly detection—in the presence of adversarial noise, addressing three key security-related issues: derivation of an optimal attack, analysis of its efficiency and constraints, and tightness of the theoretical bounds.

...read moreread less

Abstract: Security analysis of learning algorithms is gaining increasing importance, especially since they have become target of deliberate obstruction in certain applications Some security-hardened algorithms have been previously proposed for supervised learning; however, very little is known about the behavior of anomaly detection methods in such scenarios In this contribution, we analyze the performance of a particular method— online centroid anomaly detection—in the presence of adversarial noise Our analysis addresses three key security-related issues: derivation of an optimal attack, analysis of its efficiency and constraints Experimental evaluation carried out on real HTTP and exploit traces confirms the tightness of our theoretical bounds

...read moreread less

Proceedings Article•

Efficient Multioutput Gaussian Processes through Variational Inducing Kernels

[...]

Mauricio A. Álvarez¹, David Luengo², Michalis K. Titsias³, Neil D. Lawrence⁴•Institutions (4)

Technological University of Pereira¹, Technical University of Madrid², National and Kapodistrian University of Athens³, University of Sheffield⁴

31 Mar 2010

TL;DR: This paper introduces the concept of variational inducing functions to handle potential non-smooth functions involved in the kernel CP construction and considers an alternative approach to approximate inference based on variational methods.

...read moreread less

Abstract: Interest in multioutput kernel methods is increasing, whether under the guise of multitask learning, multisensor networks or structured output data. From the Gaussian process perspective a multioutput Mercer kernel is a covariance function over correlated output functions. One way of constructing such kernels is based on convolution processes (CP). A key problem for this approach is efficient inference. Alvarez and Lawrence recently presented a sparse approximation for CPs that enabled efficient inference. In this paper, we extend this work in two directions: we introduce the concept of variational inducing functions to handle potential non-smooth functions involved in the kernel CP construction and we consider an alternative approach to approximate inference based on variational methods, extending the work by Titsias (2009) to the multiple output case. We demonstrate our approaches on prediction of school marks, compiler performance and financial time series.

...read moreread less

Proceedings Article•

Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines

[...]

Guillaume Desjardins¹, Aaron Courville¹, Yoshua Bengio¹, Pascal Vincent¹, Olivier Delalleau¹ - Show less +1 more•Institutions (1)

Université de Montréal¹

31 Mar 2010

TL;DR: This work explores the use of tempered Markov Chain Monte-Carlo for sampling in RBMs and finds both through visualization of samples and measures of likelihood that it helps both sampling and learning.

...read moreread less

Abstract: Alternating Gibbs sampling is the most common scheme used for sampling from Restricted Boltzmann Machines (RBM), a crucial component in deep architectures such as Deep Belief Networks. However, we find that it often does a very poor job of rendering the diversity of modes captured by the trained model. We suspect that this hinders the advantage that could in principle be brought by training algorithms relying on Gibbs sampling for uncovering spurious modes, such as the Persistent Contrastive Divergence algorithm. To alleviate this problem, we explore the use of tempered Markov Chain Monte-Carlo for sampling in RBMs. We find both through visualization of samples and measures of likelihood that it helps both sampling and learning.

...read moreread less

Proceedings Article•

On the Impact of Kernel Approximation on Learning Accuracy

[...]

Corinna Cortes, Mehryar Mohri¹, Ameet Talwalkar•Institutions (1)

New York University¹

31 Mar 2010

TL;DR: The authors analyzes the effect of such approximations in the kernel matrix on the hypothesis generated by several widely used learning algorithms, including SVMs, KRR, and graph Laplacian-based regularization algorithms.

...read moreread less

Abstract: Kernel approximation is commonly used to scale kernel-based algorithms to applications containing as many as several million instances. This paper analyzes the effect of such approximations in the kernel matrix on the hypothesis generated by several widely used learning algorithms. We give stability bounds based on the norm of the kernel approximation for these algorithms, including SVMs, KRR, and graph Laplacian-based regularization algorithms. These bounds help determine the degree of approximation that can be tolerated in the estimation of the kernel matrix. Our analysis is general and applies to arbitrary approximations of the kernel matrix. However, we also give a specific analysis of the Nystr ¨ om low-rank approximation in this context and report the results of experiments evaluating the ′ − �| =

...read moreread less

Proceedings Article•

Factorized Orthogonal Latent Spaces

[...]

Mathieu Salzmann¹, Carl Henrik Ek², Raquel Urtasun¹, Trevor Darrell¹•Institutions (2)

University of California, Berkeley¹, Royal Institute of Technology²

31 Mar 2010

TL;DR: This paper proposes a robust approach to factorizing the latent space into shared and private spaces by introducing orthogonality constraints, which penalize redundant latent representations.

...read moreread less

Abstract: Existing approaches to multi-view learning are particularly effective when the views are either independent (i.e, multi-kernel approaches) or fully dependent (i.e., shared latent spaces). However, in real scenarios, these assumptions are almost never truly satisfied. Recently, two methods have attempted to tackle this problem by factorizing the information and learn separate latent spaces for modeling the shared (i.e., correlated) and private (i.e., independent) parts of the data. However, these approaches are very sensitive to parameters setting or initialization. In this paper we propose a robust approach to factorizing the latent space into shared and private spaces by introducing orthogonality constraints, which penalize redundant latent representations. Furthermore, unlike previous approaches, we simultaneously learn the structure and dimensionality of the latent spaces by relying on a regularizer that encourages the latent space of each data stream to be low dimensional. To demonstrate the benefits of our approach, we apply it to two existing shared latent space models that assume full dependence of the views, the sGPLVM and the sKIE, and show that our constraints improve the performance of these models on the task of pose estimation from monocular images.

...read moreread less

Proceedings Article•

Reduced-Rank Hidden Markov Models

[...]

Sajid M. Siddiqi¹, Byron Boots², Geoffrey J. Gordon²•Institutions (2)

Google¹, Carnegie Mellon University²

31 Mar 2010

TL;DR: In this article, the authors relax their assumptions and prove a tighter nite-sample error bound for the case of Reduced-Rank HMMs, i.e., HMMs with low-rank transition matrices.

...read moreread less

Abstract: Hsu et al. (2009) recently proposed an efcient, accurate spectral learning algorithm for Hidden Markov Models (HMMs). In this paper we relax their assumptions and prove a tighter nite-sample error bound for the case of Reduced-Rank HMMs, i.e., HMMs with low-rank transition matrices. Since rank-k RR-HMMs are a larger class of models than k-state HMMs while being equally ecient to work with, this relaxation greatly increases the learning algorithm’s scope. In addition, we generalize the algorithm and bounds to models where multiple observations are needed to disambiguate state, and to models that emit multivariate real-valued observations. Finally we prove consistency for learning Predictive State Representations, an even larger class of models. Experiments on synthetic data and a toy video, as well as on dicult robot vision data, yield accurate models that compare favorably with alternatives in simulation quality and prediction accuracy.

...read moreread less

Proceedings Article•

On the Convergence Properties of Contrastive Divergence

[...]

Ilya Sutskever¹, Tijmen Tieleman¹•Institutions (1)

University of Toronto¹

31 Mar 2010

TL;DR: This paper analyzes the CD1 update rule for Restricted Boltzmann Machines with binary variables, and shows that the regularized CD update has a fixed point for a large class of regularization functions using Brower’s fixed point theorem.

...read moreread less

Abstract: Contrastive Divergence (CD) is a popular method for estimating the parameters of Markov Random Fields (MRFs) by rapidly approximating an intractable term in the gradient of the log probability Despite CD’s empirical success, little is known about its theoretical convergence properties In this paper, we analyze the CD1 update rule for Restricted Boltzmann Machines (RBMs) with binary variables We show that this update is not the gradient of any function, and construct a counterintuitive “regularization function” that causes CD learning to cycle indefinitely Nonetheless, we show that the regularized CD update has a fixed point for a large class of regularization functions using Brower’s fixed point theorem

...read moreread less

Proceedings Article•

Regret Bounds for Gaussian Process Bandit Problems

[...]

Steffen Grünewälder¹, Jean-Yves Audibert, Manfred Opper, John Shawe-Taylor¹•Institutions (1)

University College London¹

31 Mar 2010

TL;DR: The main result is to bound the regret experienced by algorithms relative to the a posteriori optimal strategy of playing the best arm throughout based on benign assumptions about the covariance function dening the Gaussian process.

...read moreread less

Abstract: Bandit algorithms are concerned with trading exploration with exploitation where a number of options are available but we can only learn their quality by experimenting with them. We consider the scenario in which the reward distribution for arms is modelled by a Gaussian process and there is no noise in the observed reward. Our main result is to bound the regret experienced by algorithms relative to the a posteriori optimal strategy of playing the best arm throughout based on benign assumptions about the covariance function dening the Gaussian process. We further complement these upper bounds with corresponding lower bounds for particular covariance functions demonstrating that in general there is at most a logarithmic looseness in our upper bounds.

...read moreread less

Proceedings Article•

Structured Prediction Cascades

[...]

David J. Weiss¹, Benjamin Taskar•Institutions (1)

University of Minnesota¹

31 Mar 2010

TL;DR: It is shown that the learned cascades are capable of reducing the complexity of inference by up to ve orders of magnitude, enabling the use of models which incorporate higher order features and yield higher accuracy.

...read moreread less

Abstract: Structured prediction tasks pose a fundamental trade-o between the need for model complexity to increase predictive power and the limited computational resources for inference in the exponentially-sized output spaces such models require. We formulate and develop structured prediction cascades: a sequence of increasingly complex models that progressively lter the space of possible outputs. We represent an exponentially large set of ltered outputs using max marginals and propose a novel convex loss function that balances ltering error with ltering eciency. We provide generalization bounds for these loss functions and evaluate our approach on handwriting recognition and part-of-speech tagging. We nd that the learned cascades are capable of reducing the complexity of inference by up to ve orders of magnitude, enabling the use of models which incorporate higher order features and yield higher accuracy.

...read moreread less

Proceedings Article•

State-Space Inference and Learning with Gaussian Processes

[...]

Ryan Turner¹, Marc Peter Deisenroth², Carl Edward Rasmussen³•Institutions (3)

University of Cambridge¹, University of Washington², Max Planck Society³

01 Dec 2010

TL;DR: In this paper, the authors apply the expectation maximization algorithm to iterate between inference in the latent state-space and learning the parameters of the underlying GP dynamics model, and propose a new general methodology for inference and learning in nonlinear statespace models that are described probabilistically by non-parametric GP models.

...read moreread less

Abstract: State-space inference and learning with Gaussian processes (GPs) is an unsolved problem. We propose a new, general methodology for inference and learning in nonlinear state-space models that are described probabilistically by non-parametric GP models. We apply the expectation maximization algorithm to iterate between inference in the latent state-space and learning the parameters of the underlying GP dynamics model. Copyright 2010 by the authors.

...read moreread less

Proceedings Article•

Online Passive-Aggressive Algorithms on a Budget

[...]

Zhuang Wang¹, Slobodan Vucetic¹•Institutions (1)

Temple University¹

31 Mar 2010

TL;DR: A kernel-based online learning algorithm, which has both constant space and update time, is proposed, based on the popular online PassiveAggressive algorithm, and it is shown that they are superior to the existing budgeted online algorithms.

...read moreread less

Abstract: In this paper a kernel-based online learning algorithm, which has both constant space and update time, is proposed. The approach is based on the popular online PassiveAggressive (PA) algorithm. When used in conjunction with kernel function, the number of support vectors in PA grows without bounds when learning from noisy data streams. This implies unlimited memory and ever increasing model update and prediction time. To address this issue, the proposed budgeted PA algorithm maintains only a fixed number of support vectors. By introducing an additional constraint to the original PA optimization problem, a closed-form solution was derived for the support vector removal and model update. Using the hinge loss we developed several budgeted PA algorithms that can trade between accuracy and update cost. We also developed the ramp loss versions of both original and budgeted PA and showed that the resulting algorithms can be interpreted as the combination of active learning and hinge loss PA. All proposed algorithms were comprehensively tested on 7 benchmark data sets. The experiments showed that they are superior to the existing budgeted online algorithms. Even with modest budgets, the budgeted PA achieved very competitive accuracies to the non-budgeted PA and kernel perceptron algorithms. Appearing in Proceedings of the 13 International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copyright 2010 by the authors.

...read moreread less

Proceedings Article•

Multitask Learning for Brain-Computer Interfaces

[...]

Morteza Alamgir¹, Moritz Grosse-Wentrup¹, Yasemin Altun¹•Institutions (1)

Max Planck Society¹

31 Mar 2010

TL;DR: In this article, the authors utilize the framework of multitask learning to construct a BCI that can be used without any subject-specific calibration process, and demonstrate that satisfactory classification results can be achieved with zero training data, and combining prior recordings with subjectspecific calibration data substantially outperforms using subject specific data only.

...read moreread less

Abstract: Brain-computer interfaces (BCIs) are limited in their applicability in everyday settings by the current necessity to record subjectspecific calibration data prior to actual use of the BCI for communication. In this paper, we utilize the framework of multitask learning to construct a BCI that can be used without any subject-specific calibration process. We discuss how this out-of-the-box BCI can be further improved in a computationally efficient manner as subject-specific data becomes available. The feasibility of the approach is demonstrated on two sets of experimental EEG data recorded during a standard two-class motor imagery paradigm from a total of 19 healthy subjects. Specifically, we show that satisfactory classification results can be achieved with zero training data, and combining prior recordings with subjectspecific calibration data substantially outperforms using subject-specific data only. Our results further show that transfer between recordings under slightly different experimental setups is feasible.

...read moreread less

Proceedings Article•

Convex Structure Learning in Log-Linear Models: Beyond Pairwise Potentials

[...]

Mark Schmidt¹, Kevin Murphy¹•Institutions (1)

University of British Columbia¹

31 Mar 2010

TL;DR: This work uses a spectral projected gradient method as a subroutine for solving the overlapping group ‘1regularization problem, and makes use of a sparse version of Dykstra’s algorithm to compute the projection.

...read moreread less

Abstract: Previous work has examined structure learning in log-linear models with ‘1regularization, largely focusing on the case of pairwise potentials. In this work we consider the case of models with potentials of arbitrary order, but that satisfy a hierarchical constraint. We enforce the hierarchical constraint using group ‘1-regularization with overlapping groups. An active set method that enforces hierarchical inclusion allows us to tractably consider the exponential number of higher-order potentials. We use a spectral projected gradient method as a subroutine for solving the overlapping group ‘1regularization problem, and make use of a sparse version of Dykstra’s algorithm to compute the projection. Our experiments indicate that this model gives equal or better test set likelihood compared to previous models.

...read moreread less

Proceedings Article•

Infinite Predictor Subspace Models for Multitask Learning

[...]

Piyush Rai¹, Hal Daumé¹•Institutions (1)

University of Utah¹

31 Mar 2010

TL;DR: An augmented model which can make use of (labeled, and additionally unlabeled if available) inputs to assist learning this subspace, leading to further improvements in the performance, and an extension of the proposed framework where a nonparametric mixture of linear subspaces can be used to learn a nonlinear manifold over the task parameters.

...read moreread less

Abstract: Given several related learning tasks, we propose a nonparametric Bayesian model that captures task relatedness by assuming that the task parameters (i.e., predictors) share a latent subspace. More specifically, the intrinsic dimensionality of the task subspace is not assumed to be known a priori. We use an infinite latent feature model to automatically infer this number (depending on and limited by only the number of tasks). Furthermore, our approach is applicable when the underlying task parameter subspace is inherently sparse, drawing parallels with ‘1 regularization and LASSO-style models. We also propose an augmented model which can make use of (labeled, and additionally unlabeled if available) inputs to assist learning this subspace, leading to further improvements in the performance. Experimental results demonstrate the efficacy of both the proposed approaches, especially when the number of examples per task is small. Finally, we discuss an extension of the proposed framework where a nonparametric mixture of linear subspaces can be used to learn a nonlinear manifold over the task parameters, and also deal with the issue of negative transfer from unrelated tasks.

...read moreread less