scispace - formally typeset
Search or ask a question

Showing papers presented at "International Conference on Artificial Intelligence and Statistics in 2010"


Proceedings Article
31 Mar 2010
TL;DR: The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.
Abstract: Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence. 1 Deep Neural Networks Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features. They include Appearing in Proceedings of the 13 International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of JMLR: WC Weston et al., 2008). Much attention has recently been devoted to them (see (Bengio, 2009) for a review), because of their theoretical appeal, inspiration from biology and human cognition, and because of empirical success in vision (Ranzato et al., 2007; Larochelle et al., 2007; Vincent et al., 2008) and natural language processing (NLP) (Collobert & Weston, 2008; Mnih & Hinton, 2009). Theoretical results reviewed and discussed by Bengio (2009), suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one may need deep architectures. Most of the recent experimental results with deep architecture are obtained with models that can be turned into deep supervised neural networks, but with initialization or training schemes different from the classical feedforward neural networks (Rumelhart et al., 1986). Why are these new algorithms working so much better than the standard random initialization and gradient-based optimization of a supervised training criterion? Part of the answer may be found in recent analyses of the effect of unsupervised pretraining (Erhan et al., 2009), showing that it acts as a regularizer that initializes the parameters in a “better” basin of attraction of the optimization procedure, corresponding to an apparent local minimum associated with better generalization. But earlier work (Bengio et al., 2007) had shown that even a purely supervised but greedy layer-wise procedure would give better results. So here instead of focusing on what unsupervised pre-training or semi-supervised criteria bring to deep architectures, we focus on analyzing what may be going wrong with good old (but deep) multilayer neural networks. Our analysis is driven by investigative experiments to monitor activations (watching for saturation of hidden units) and gradients, across layers and across training iterations. We also evaluate the effects on these of choices of activation function (with the idea that it might affect saturation) and initialization procedure (since unsupervised pretraining is a particular form of initialization and it has a drastic impact).

9,500 citations


Proceedings Article
31 Mar 2010
TL;DR: A new estimation principle is presented to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity, which leads to a consistent (convergent) estimator of the parameters.
Abstract: We present a new estimation principle for parameterized statistical models. The idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity. We show that this leads to a consistent (convergent) estimator of the parameters, and analyze the asymptotic variance. In particular, the method is shown to directly work for unnormalized models, i.e. models where the density function does not integrate to one. The normalization constant can be estimated just like any other parameter. For a tractable ICA model, we compare the method with other estimation methods that can be used to learn unnormalized models, including score matching, contrastive divergence, and maximum-likelihood where the normalization constant is estimated with importance sampling. Simulations show that noise-contrastive estimation offers the best trade-off between computational and statistical efficiency. The method is then applied to the modeling of natural images: We show that the method can successfully estimate a large-scale two-layer model and a Markov random field.

1,736 citations


Proceedings Article
31 Mar 2010
TL;DR: This work proposes two alternative algorithms for imitation learning where training occurs over several episodes of interaction and shows that this leads to stronger performance guarantees and improved performance on two challenging problems: training a learner to play a 3D racing game and Mario Bros.
Abstract: Imitation Learning, while applied successfully on many large real-world problems, is typically addressed as a standard supervised learning problem, where it is assumed the training and testing data are i.i.d.. This is not true in imitation learning as the learned policy influences the future test inputs (states) upon which it will be tested. We show that this leads to compounding errors and a regret bound that grows quadratically in the time horizon of the task. We propose two alternative algorithms for imitation learning where training occurs over several episodes of interaction. These two approaches share in common that the learner’s policy is slowly modified from executing the expert’s policy to the learned policy. We show that this leads to stronger performance guarantees and demonstrate the improved performance on two challenging problems: training a learner to play 1) a 3D racing game (Super Tux Kart) and 2) Mario Bros.; given input images from the games and corresponding actions taken by a human expert and near-optimal planner respectively.

634 citations



Proceedings Article
31 Mar 2010
TL;DR: A new approximate inference algorithm for Deep Boltzmann Machines (DBM’s), a generative model with many layers of hidden variables, that learns a separate “recognition” model that is used to quickly initialize, in a single bottom-up pass, the values of the latent variables in all hidden layers.
Abstract: We present a new approximate inference algorithm for Deep Boltzmann Machines (DBM’s), a generative model with many layers of hidden variables. The algorithm learns a separate “recognition” model that is used to quickly initialize, in a single bottom-up pass, the values of the latent variables in all hidden layers. We show that using such a recognition model, followed by a combined top-down and bottom-up pass, it is possible to efficiently learn a good generative model of high-dimensional highly-structured sensory input. We show that the additional computations required by incorporating a top-down feedback plays a critical role in the performance of a DBM, both as a generative and discriminative model. Moreover, inference is only at most three times slower compared to the approximate inference in a Deep Belief Network (DBN), making large-scale learning of DBM’s practical. Finally, we demonstrate that the DBM’s trained using the proposed approximate inference algorithm perform well compared to DBN’s and SVM’s on the MNIST handwritten digit, OCR English letters, and NORB visual object recognition tasks.

374 citations


Proceedings Article
31 Mar 2010
TL;DR: In this article, a variational inference framework for training the Gaussian process latent variable model and thus performing Bayesian nonlinear dimensionality reduction is introduced, which can automatically select the dimensionality of the nonlinear latent space.
Abstract: We introduce a variational inference framework for training the Gaussian process latent variable model and thus performing Bayesian nonlinear dimensionality reduction. This method allows us to variationally integrate out the input variables of the Gaussian process and compute a lower bound on the exact marginal likelihood of the nonlinear latent variable model. The maximization of the variational lower bound provides a Bayesian training procedure that is robust to overfitting and can automatically select the dimensionality of the nonlinear latent space. We demonstrate our method on real world datasets. The focus in this paper is on dimensionality reduction problems, but the methodology is more general. For example, our algorithm is immediately applicable for training Gaussian process models in the presence of missing or uncertain inputs.

338 citations


Proceedings Article
31 Mar 2010
TL;DR: In this paper, the necessary and sufficient conditions for a kernel to be universal were studied and a relation between universal and characteristic kernels was established between universal kernels and Borel measures in a reproducing kernel Hilbert space.
Abstract: Universal kernels have been shown to play an important role in the achievability of the Bayes risk by many kernel-based algorithms that include binary classification, regression, etc. In this paper, we propose a notion of universality that generalizes the notions introduced by Steinwart and Micchelli et al. and study the necessary and sufficient conditions for a kernel to be universal. We show that all these notions of universality are closely linked to the injective embedding of a certain class of Borel measures into a reproducing kernel Hilbert space (RKHS). By exploiting this relation between universality and the embedding of Borel measures into an RKHS, we establish the relation between universal and characteristic kernels. The latter have been proposed in the context of the RKHS embedding of probability measures, used in statistical applications like homogeneity testing, independence testing, etc.

263 citations


Proceedings Article
31 Mar 2010
TL;DR: A factored 3-way RBM is proposed that uses the states of its hidden units to represent abnormalities in the local covariance structure of an image to provide a probabilistic framework for the widely used simple/complex cell architecture.
Abstract: Deep belief nets have been successful in modeling handwritten characters, but it has proved more difficult to apply them to real images. The problem lies in the restricted Boltzmann machine (RBM) which is used as a module for learning deep belief nets one layer at a time. The Gaussian-Binary RBMs that have been used to model real-valued data are not a good way to model the covariance structure of natural images. We propose a factored 3-way RBM that uses the states of its hidden units to represent abnormalities in the local covariance structure of an image. This provides a probabilistic framework for the widely used simple/complex cell architecture. Our model learns binary features that work very well for object recognition on the “tiny images” data set. Even better features are obtained by then using standard binary RBM’s to learn a deeper model.

249 citations


Proceedings Article
31 Mar 2010
TL;DR: This paper develops a probabilistic approach to this problem when annotators may be unreliable, but also their expertise varies depending on the data they observe, which provides clear advantages over previously introduced multi-annotator methods.
Abstract: Supervised learning from multiple labeling sources is an increasingly important problem in machine learning and data mining. This paper develops a probabilistic approach to this problem when annotators may be unreliable (labels are noisy), but also their expertise varies depending on the data they observe (annotators may have knowledge about different parts of the input space). That is, an annotator may not be consistently accurate (or inaccurate) across the task domain. The presented approach produces classification and annotator models that allow us to provide estimates of the true labels and annotator variable expertise. We provide an analysis of the proposed model under various scenarios and show experimentally that annotator expertise can indeed vary in real tasks and that the presented approach provides clear advantages over previously introduced multi-annotator methods, which only consider general annotator characteristics.

220 citations


Proceedings Article
31 Mar 2010
TL;DR: Experiments with document categorization show that the proposed exclusive lasso regularizer outperforms state-of-theart algorithms for multi-task feature selection and an efficient algorithm is derived to solve the related optimization problem.
Abstract: We propose a novel group regularization which we call exclusive lasso. Unlike the group lasso regularizer that assumes covarying variables in groups, the proposed exclusive lasso regularizer models the scenario when variables in the same group compete with each other. Analysis is presented to illustrate the properties of the proposed regularizer. We present a framework of kernel based multi-task feature selection algorithm based on the proposed exclusive lasso regularizer. An efficient algorithm is derived to solve the related optimization problem. Experiments with document categorization show that our approach outperforms state-of-theart algorithms for multi-task feature selection.

209 citations


Proceedings Article
31 Mar 2010
TL;DR: A lower bound is proved for the regret of any algo- rithm where ~ ~ are packing dimensions of the query spaces and the ad space respectively and this gives an almost matching up- per and lower bound for finite spaces or convex bounded subsets of Eu- clidean spaces.
Abstract: D´ avid P´ al Abstract We study contextual multi-armed bandit prob- lems where the context comes from a metric space and the payoff satisfies a Lipschitz condi- tion with respect to the metric. Abstractly, a con- textual multi-armed bandit problem models a sit- uation where, in a sequence of independent trials, an online algorithm chooses, based on a given context (side information), an action from a set of possible actions so as to maximize the total pay- off of the chosen actions. The payoff depends on both the action chosen and the context. In con- trast, context-free multi-armed bandit problems, a focus of much previous research, model situa- tions where no side information is available and the payoff depends only on the action chosen. Our problem is motivated by sponsored web search, where the task is to display ads to a user of an Internet search engine based on her search query so as to maximize the click-through rate (CTR) of the ads displayed. We cast this prob- lem as a contextual multi-armed bandit problem where queries and ads form metric spaces and the payoff function is Lipschitz with respect to both the metrics. For any > 0 we present an algorithm with regret O(T a+b+1 a+b+2+ ) where a;b are the covering dimensions of the query space and the ad space respectively. We prove a lower bound ( T ~ a+~+1 ~ a+~+2 ) for the regret of any algo- rithm where ~ ~ are packing dimensions of the query spaces and the ad space respectively. For finite spaces or convex bounded subsets of Eu- clidean spaces, this gives an almost matching up- per and lower bound.

Proceedings Article
31 Mar 2010
TL;DR: In this article, the authors propose to solve the combinatorial problem of nding the highest scoring Bayesian network structure from data, which is viewed as an inference problem where the variables specify the choice of parents for each node in the graph.
Abstract: We propose to solve the combinatorial problem of nding the highest scoring Bayesian network structure from data. This structure learning problem can be viewed as an inference problem where the variables specify the choice of parents for each node in the graph. The key combinatorial diculty arises from the global constraint that the graph structure has to be acyclic. We cast the structure learning problem as a linear program over the polytope dened by valid acyclic structures. In relaxing this problem, we maintain an outer bound approximation to the polytope and iteratively tighten it by searching over a new class of valid constraints. If an integral solution is found, it is guaranteed to be the optimal Bayesian network. When the relaxation is not tight, the fast dual algorithms we develop remain useful in combination with a branch and bound method. Empirical results suggest that the method is competitive or faster than alternative exact methods based on dynamic programming.

Proceedings Article
31 Mar 2010
TL;DR: This paper studies learning methods for binary restricted Boltzmann machines based on ratio matching and generalized score matching and compares them to a range of existing learning methods including stochastic maximum likelihood, contrastive divergence, and pseudo-likelihood.
Abstract: Recent research has seen the proposal of several new inductive principles designed specifically to avoid the problems associated with maximum likelihood learning in models with intractable partition functions. In this paper, we study learning methods for binary restricted Boltzmann machines (RBMs) based on ratio matching and generalized score matching. We compare these new RBM learning methods to a range of existing learning methods including stochastic maximum likelihood, contrastive divergence, and pseudo-likelihood. We perform an extensive empirical evaluation across multiple tasks and data sets.

Proceedings Article
31 Mar 2010
TL;DR: In this paper, the authors analyze the assumptions in an agnostic PAC-style learning model for a setting in which the learner can access a labeled training data sample and an unlabeled sample generated by the test data distribution and show that without either assumption (i or (ii), the combination of the remaining assumptions is not sufficient to guarantee successful learning.
Abstract: The domain adaptation problem in machine learning occurs when the test data generating distribution differs from the one that generates the training data. It is clear that the success of learning under such circumstances depends on similarities between the two data distributions. We study assumptions about the relationship between the two distributions that one needed for domain adaptation learning to succeed. We analyze the assumptions in an agnostic PAC-style learning model for a the setting in which the learner can access a labeled training data sample and an unlabeled sample generated by the test data distribution. We focus on three assumptions: (i) similarity between the unlabeled distributions, (ii) existence of a classifier in the hypothesis class with low error on both training and testing distributions, and (iii) the covariate shift assumption. I.e., the assumption that the conditioned label distribution (for each data point) is the same for both the training and test distributions. We show that without either assumption (i) or (ii), the combination of the remaining assumptions is not sufficient to guarantee successful learning. Our negative results hold with respect to any domain adaptation learning algorithm, as long as it does not have access to target labeled examples. In particular, we provide formal proofs that the popular covariate shift assumption is rather weak and does not relieve the necessity of the other assumptions.

Proceedings Article
31 Mar 2010
TL;DR: This work introduces two new classes of high order potentials, including composite HOPs that allow us to exibly combine tractable Hops using simple logical switching rules, and presents ecient message update algorithms for the newHOPs, and improves upon the eciency of message updates for a general class of existing HOPS.
Abstract: There is a growing interest in building probabilistic models with high order potentials (HOPs), or interactions, among discrete variables. Message passing inference in such models generally takes time exponential in the size of the interaction, but in some cases maximum a posteriori (MAP) inference can be carried out eciently. We build upon such results, introducing two new classes, including composite HOPs that allow us to exibly combine tractable HOPs using simple logical switching rules. We present ecient message update algorithms for the new HOPs, and we improve upon the eciency of message updates for a general class of existing HOPs. Importantly, we present both new and existing HOPs in a common representation; performing inference with any combination of these HOPs requires no change of representations or new derivations.

Proceedings Article
31 Mar 2010
TL;DR: A non-linear graphical model for structured prediction that combines the power of deep neural networks to extract high level features with the graphical framework of Markov networks, yielding a powerful and scalable probabilistic model that is applied to signal labeling tasks.
Abstract: We propose a non-linear graphical model for structured prediction. It combines the power of deep neural networks to extract high level features with the graphical framework of Markov networks, yielding a powerful and scalable probabilistic model that we apply to signal labeling tasks.

Proceedings Article
31 Mar 2010
TL;DR: This work analyzes the performance of a particular method— online centroid anomaly detection—in the presence of adversarial noise, addressing three key security-related issues: derivation of an optimal attack, analysis of its efficiency and constraints, and tightness of the theoretical bounds.
Abstract: Security analysis of learning algorithms is gaining increasing importance, especially since they have become target of deliberate obstruction in certain applications Some security-hardened algorithms have been previously proposed for supervised learning; however, very little is known about the behavior of anomaly detection methods in such scenarios In this contribution, we analyze the performance of a particular method— online centroid anomaly detection—in the presence of adversarial noise Our analysis addresses three key security-related issues: derivation of an optimal attack, analysis of its efficiency and constraints Experimental evaluation carried out on real HTTP and exploit traces confirms the tightness of our theoretical bounds

Proceedings Article
31 Mar 2010
TL;DR: This paper introduces the concept of variational inducing functions to handle potential non-smooth functions involved in the kernel CP construction and considers an alternative approach to approximate inference based on variational methods.
Abstract: Interest in multioutput kernel methods is increasing, whether under the guise of multitask learning, multisensor networks or structured output data. From the Gaussian process perspective a multioutput Mercer kernel is a covariance function over correlated output functions. One way of constructing such kernels is based on convolution processes (CP). A key problem for this approach is efficient inference. Alvarez and Lawrence recently presented a sparse approximation for CPs that enabled efficient inference. In this paper, we extend this work in two directions: we introduce the concept of variational inducing functions to handle potential non-smooth functions involved in the kernel CP construction and we consider an alternative approach to approximate inference based on variational methods, extending the work by Titsias (2009) to the multiple output case. We demonstrate our approaches on prediction of school marks, compiler performance and financial time series.

Proceedings Article
31 Mar 2010
TL;DR: This work explores the use of tempered Markov Chain Monte-Carlo for sampling in RBMs and finds both through visualization of samples and measures of likelihood that it helps both sampling and learning.
Abstract: Alternating Gibbs sampling is the most common scheme used for sampling from Restricted Boltzmann Machines (RBM), a crucial component in deep architectures such as Deep Belief Networks. However, we find that it often does a very poor job of rendering the diversity of modes captured by the trained model. We suspect that this hinders the advantage that could in principle be brought by training algorithms relying on Gibbs sampling for uncovering spurious modes, such as the Persistent Contrastive Divergence algorithm. To alleviate this problem, we explore the use of tempered Markov Chain Monte-Carlo for sampling in RBMs. We find both through visualization of samples and measures of likelihood that it helps both sampling and learning.

Proceedings Article
31 Mar 2010
TL;DR: The authors analyzes the effect of such approximations in the kernel matrix on the hypothesis generated by several widely used learning algorithms, including SVMs, KRR, and graph Laplacian-based regularization algorithms.
Abstract: Kernel approximation is commonly used to scale kernel-based algorithms to applications containing as many as several million instances. This paper analyzes the effect of such approximations in the kernel matrix on the hypothesis generated by several widely used learning algorithms. We give stability bounds based on the norm of the kernel approximation for these algorithms, including SVMs, KRR, and graph Laplacian-based regularization algorithms. These bounds help determine the degree of approximation that can be tolerated in the estimation of the kernel matrix. Our analysis is general and applies to arbitrary approximations of the kernel matrix. However, we also give a specific analysis of the Nystr ¨ om low-rank approximation in this context and report the results of experiments evaluating the ′ − �| =

Proceedings Article
31 Mar 2010
TL;DR: This paper proposes a robust approach to factorizing the latent space into shared and private spaces by introducing orthogonality constraints, which penalize redundant latent representations.
Abstract: Existing approaches to multi-view learning are particularly effective when the views are either independent (i.e, multi-kernel approaches) or fully dependent (i.e., shared latent spaces). However, in real scenarios, these assumptions are almost never truly satisfied. Recently, two methods have attempted to tackle this problem by factorizing the information and learn separate latent spaces for modeling the shared (i.e., correlated) and private (i.e., independent) parts of the data. However, these approaches are very sensitive to parameters setting or initialization. In this paper we propose a robust approach to factorizing the latent space into shared and private spaces by introducing orthogonality constraints, which penalize redundant latent representations. Furthermore, unlike previous approaches, we simultaneously learn the structure and dimensionality of the latent spaces by relying on a regularizer that encourages the latent space of each data stream to be low dimensional. To demonstrate the benefits of our approach, we apply it to two existing shared latent space models that assume full dependence of the views, the sGPLVM and the sKIE, and show that our constraints improve the performance of these models on the task of pose estimation from monocular images.

Proceedings Article
31 Mar 2010
TL;DR: In this article, the authors relax their assumptions and prove a tighter nite-sample error bound for the case of Reduced-Rank HMMs, i.e., HMMs with low-rank transition matrices.
Abstract: Hsu et al. (2009) recently proposed an efcient, accurate spectral learning algorithm for Hidden Markov Models (HMMs). In this paper we relax their assumptions and prove a tighter nite-sample error bound for the case of Reduced-Rank HMMs, i.e., HMMs with low-rank transition matrices. Since rank-k RR-HMMs are a larger class of models than k-state HMMs while being equally ecient to work with, this relaxation greatly increases the learning algorithm’s scope. In addition, we generalize the algorithm and bounds to models where multiple observations are needed to disambiguate state, and to models that emit multivariate real-valued observations. Finally we prove consistency for learning Predictive State Representations, an even larger class of models. Experiments on synthetic data and a toy video, as well as on dicult robot vision data, yield accurate models that compare favorably with alternatives in simulation quality and prediction accuracy.

Proceedings Article
31 Mar 2010
TL;DR: This paper analyzes the CD1 update rule for Restricted Boltzmann Machines with binary variables, and shows that the regularized CD update has a fixed point for a large class of regularization functions using Brower’s fixed point theorem.
Abstract: Contrastive Divergence (CD) is a popular method for estimating the parameters of Markov Random Fields (MRFs) by rapidly approximating an intractable term in the gradient of the log probability Despite CD’s empirical success, little is known about its theoretical convergence properties In this paper, we analyze the CD1 update rule for Restricted Boltzmann Machines (RBMs) with binary variables We show that this update is not the gradient of any function, and construct a counterintuitive “regularization function” that causes CD learning to cycle indefinitely Nonetheless, we show that the regularized CD update has a fixed point for a large class of regularization functions using Brower’s fixed point theorem

Proceedings Article
31 Mar 2010
TL;DR: The main result is to bound the regret experienced by algorithms relative to the a posteriori optimal strategy of playing the best arm throughout based on benign assumptions about the covariance function dening the Gaussian process.
Abstract: Bandit algorithms are concerned with trading exploration with exploitation where a number of options are available but we can only learn their quality by experimenting with them. We consider the scenario in which the reward distribution for arms is modelled by a Gaussian process and there is no noise in the observed reward. Our main result is to bound the regret experienced by algorithms relative to the a posteriori optimal strategy of playing the best arm throughout based on benign assumptions about the covariance function dening the Gaussian process. We further complement these upper bounds with corresponding lower bounds for particular covariance functions demonstrating that in general there is at most a logarithmic looseness in our upper bounds.

Proceedings Article
31 Mar 2010
TL;DR: It is shown that the learned cascades are capable of reducing the complexity of inference by up to ve orders of magnitude, enabling the use of models which incorporate higher order features and yield higher accuracy.
Abstract: Structured prediction tasks pose a fundamental trade-o between the need for model complexity to increase predictive power and the limited computational resources for inference in the exponentially-sized output spaces such models require. We formulate and develop structured prediction cascades: a sequence of increasingly complex models that progressively lter the space of possible outputs. We represent an exponentially large set of ltered outputs using max marginals and propose a novel convex loss function that balances ltering error with ltering eciency. We provide generalization bounds for these loss functions and evaluate our approach on handwriting recognition and part-of-speech tagging. We nd that the learned cascades are capable of reducing the complexity of inference by up to ve orders of magnitude, enabling the use of models which incorporate higher order features and yield higher accuracy.

Proceedings Article
01 Dec 2010
TL;DR: In this paper, the authors apply the expectation maximization algorithm to iterate between inference in the latent state-space and learning the parameters of the underlying GP dynamics model, and propose a new general methodology for inference and learning in nonlinear statespace models that are described probabilistically by non-parametric GP models.
Abstract: State-space inference and learning with Gaussian processes (GPs) is an unsolved problem. We propose a new, general methodology for inference and learning in nonlinear state-space models that are described probabilistically by non-parametric GP models. We apply the expectation maximization algorithm to iterate between inference in the latent state-space and learning the parameters of the underlying GP dynamics model. Copyright 2010 by the authors.

Proceedings Article
31 Mar 2010
TL;DR: A kernel-based online learning algorithm, which has both constant space and update time, is proposed, based on the popular online PassiveAggressive algorithm, and it is shown that they are superior to the existing budgeted online algorithms.
Abstract: In this paper a kernel-based online learning algorithm, which has both constant space and update time, is proposed. The approach is based on the popular online PassiveAggressive (PA) algorithm. When used in conjunction with kernel function, the number of support vectors in PA grows without bounds when learning from noisy data streams. This implies unlimited memory and ever increasing model update and prediction time. To address this issue, the proposed budgeted PA algorithm maintains only a fixed number of support vectors. By introducing an additional constraint to the original PA optimization problem, a closed-form solution was derived for the support vector removal and model update. Using the hinge loss we developed several budgeted PA algorithms that can trade between accuracy and update cost. We also developed the ramp loss versions of both original and budgeted PA and showed that the resulting algorithms can be interpreted as the combination of active learning and hinge loss PA. All proposed algorithms were comprehensively tested on 7 benchmark data sets. The experiments showed that they are superior to the existing budgeted online algorithms. Even with modest budgets, the budgeted PA achieved very competitive accuracies to the non-budgeted PA and kernel perceptron algorithms. Appearing in Proceedings of the 13 International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copyright 2010 by the authors.

Proceedings Article
31 Mar 2010
TL;DR: In this article, the authors utilize the framework of multitask learning to construct a BCI that can be used without any subject-specific calibration process, and demonstrate that satisfactory classification results can be achieved with zero training data, and combining prior recordings with subjectspecific calibration data substantially outperforms using subject specific data only.
Abstract: Brain-computer interfaces (BCIs) are limited in their applicability in everyday settings by the current necessity to record subjectspecific calibration data prior to actual use of the BCI for communication. In this paper, we utilize the framework of multitask learning to construct a BCI that can be used without any subject-specific calibration process. We discuss how this out-of-the-box BCI can be further improved in a computationally efficient manner as subject-specific data becomes available. The feasibility of the approach is demonstrated on two sets of experimental EEG data recorded during a standard two-class motor imagery paradigm from a total of 19 healthy subjects. Specifically, we show that satisfactory classification results can be achieved with zero training data, and combining prior recordings with subjectspecific calibration data substantially outperforms using subject-specific data only. Our results further show that transfer between recordings under slightly different experimental setups is feasible.

Proceedings Article
31 Mar 2010
TL;DR: This work uses a spectral projected gradient method as a subroutine for solving the overlapping group ‘1regularization problem, and makes use of a sparse version of Dykstra’s algorithm to compute the projection.
Abstract: Previous work has examined structure learning in log-linear models with ‘1regularization, largely focusing on the case of pairwise potentials. In this work we consider the case of models with potentials of arbitrary order, but that satisfy a hierarchical constraint. We enforce the hierarchical constraint using group ‘1-regularization with overlapping groups. An active set method that enforces hierarchical inclusion allows us to tractably consider the exponential number of higher-order potentials. We use a spectral projected gradient method as a subroutine for solving the overlapping group ‘1regularization problem, and make use of a sparse version of Dykstra’s algorithm to compute the projection. Our experiments indicate that this model gives equal or better test set likelihood compared to previous models.

Proceedings Article
31 Mar 2010
TL;DR: An augmented model which can make use of (labeled, and additionally unlabeled if available) inputs to assist learning this subspace, leading to further improvements in the performance, and an extension of the proposed framework where a nonparametric mixture of linear subspaces can be used to learn a nonlinear manifold over the task parameters.
Abstract: Given several related learning tasks, we propose a nonparametric Bayesian model that captures task relatedness by assuming that the task parameters (i.e., predictors) share a latent subspace. More specifically, the intrinsic dimensionality of the task subspace is not assumed to be known a priori. We use an infinite latent feature model to automatically infer this number (depending on and limited by only the number of tasks). Furthermore, our approach is applicable when the underlying task parameter subspace is inherently sparse, drawing parallels with ‘1 regularization and LASSO-style models. We also propose an augmented model which can make use of (labeled, and additionally unlabeled if available) inputs to assist learning this subspace, leading to further improvements in the performance. Experimental results demonstrate the efficacy of both the proposed approaches, especially when the number of examples per task is small. Finally, we discuss an extension of the proposed framework where a nonparametric mixture of linear subspaces can be used to learn a nonlinear manifold over the task parameters, and also deal with the issue of negative transfer from unrelated tasks.