Showing papers in "arXiv: Machine Learning in 2013"

PDF

Open Access

Posted Content•

[...]

Diederik P. Kingma¹, Max Welling¹•Institutions (1)

20 Dec 2013-arXiv: Machine Learning

TL;DR: In this paper, a stochastic variational inference and learning algorithm was proposed for directed probabilistic models with intractable posterior distributions and large datasets, which scales to large datasets.

...read moreread less

Abstract: How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.

...read moreread less

4,883 citations

Posted Content•

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

[...]

Ian Goodfellow¹, Mehdi Mirza¹, Da Xiao², Aaron Courville¹, Yoshua Bengio¹ - Show less +1 more•Institutions (2)

Université de Montréal¹, Beijing University of Posts and Telecommunications²

21 Dec 2013-arXiv: Machine Learning

TL;DR: In this article, the authors investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions and find that the dropout algorithm is consistently best at adapting to the new task, remembering the old task and has the best tradeoff curve between these two extremes.

...read moreread less

Abstract: Catastrophic forgetting is a problem faced by many machine learning models and algorithms. When trained on one task, then trained on a second task, many machine learning models "forget" how to perform the first task. This is widely believed to be a serious problem for neural networks. Here, we investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions. We also examine the effect of the relationship between the first task and the second task on catastrophic forgetting. We find that it is always best to train using the dropout algorithm--the dropout algorithm is consistently best at adapting to the new task, remembering the old task, and has the best tradeoff curve between these two extremes. We find that different tasks and relationships between tasks result in very different rankings of activation function performance. This suggests the choice of activation function should always be cross-validated.

...read moreread less

755 citations

Journal Article•DOI•

Challenges of Big Data Analysis

[...]

Jianqing Fan¹, Fang Han², Han Liu¹•Institutions (2)

Princeton University¹, Johns Hopkins University²

07 Aug 2013-arXiv: Machine Learning

TL;DR: In this article, the authors provide an overview of the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures, and provide various new perspectives on the Big Data analysis and computation.

...read moreread less

Abstract: Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

...read moreread less

733 citations

Posted Content•

Maxout Networks

[...]

Ian Goodfellow¹, David Warde-Farley¹, Mehdi Mirza¹, Aaron Courville¹, Yoshua Bengio¹ - Show less +1 more•Institutions (1)

Université de Montréal¹

18 Feb 2013-arXiv: Machine Learning

TL;DR: In this article, a simple new model called maxout is proposed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique, which is a natural companion to dropout.

...read moreread less

Abstract: We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique We empirically verify that the model successfully accomplishes both of these tasks We use maxout and dropout to demonstrate state of the art classification performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN

...read moreread less

672 citations

Posted Content•

Domain Generalization via Invariant Feature Representation

[...]

Krikamol Muandet, David Balduzzi, Bernhard Schölkopf

10 Jan 2013-arXiv: Machine Learning

TL;DR: Domain-Invariant Component Analysis (DICA), a kernel-based optimization algorithm that learns an invariant transformation by minimizing the dissimilarity across domains, whilst preserving the functional relationship between input and output variables is proposed.

...read moreread less

Abstract: This paper investigates domain generalization: How to take knowledge acquired from an arbitrary number of related domains and apply it to previously unseen domains? We propose Domain-Invariant Component Analysis (DICA), a kernel-based optimization algorithm that learns an invariant transformation by minimizing the dissimilarity across domains, whilst preserving the functional relationship between input and output variables. A learning-theoretic analysis shows that reducing dissimilarity improves the expected generalization ability of classifiers on new domains, motivating the proposed algorithm. Experimental results on synthetic and real-world datasets demonstrate that DICA successfully learns invariant features and improves classifier performance in practice.

...read moreread less

651 citations

Posted Content•

Black Box Variational Inference

[...]

Rajesh Ranganath¹, Sean Gerrish¹, David M. Blei¹•Institutions (1)

Princeton University¹

31 Dec 2013-arXiv: Machine Learning

TL;DR: The authors proposed a black box variational inference algorithm based on a stochastic optimization of the variational objective, where the noisy gradient is computed from Monte Carlo samples from the Variational distribution, which can be applied to many models with little additional derivation.

...read moreread less

Abstract: Variational inference has become a widely used method to approximate posteriors in complex latent variables models. However, deriving a variational inference algorithm generally requires significant model-specific analysis, and these efforts can hinder and deter us from quickly developing and exploring a variety of models for a problem at hand. In this paper, we present a "black box" variational inference algorithm, one that can be quickly applied to many models with little additional derivation. Our method is based on a stochastic optimization of the variational objective where the noisy gradient is computed from Monte Carlo samples from the variational distribution. We develop a number of methods to reduce the variance of the gradient, always maintaining the criterion that we want to avoid difficult model-based derivations. We evaluate our method against the corresponding black box sampling based methods. We find that our method reaches better predictive likelihoods much faster than sampling methods. Finally, we demonstrate that Black Box Variational Inference lets us easily explore a wide space of models by quickly constructing and evaluating several models of longitudinal healthcare data.

...read moreread less

582 citations

Posted Content•

Challenges in Representation Learning: A report on three machine learning contests

[...]

Ian Goodfellow¹, Dumitru Erhan², Pierre Luc Carrier¹, Aaron Courville¹, Mehdi Mirza¹, Ben Hamner¹, William Cukierski¹, Yichuan Tang¹, David Thaler¹, Dong-Hyun Lee¹, Yingbo Zhou¹, Chetan Ramaiah¹, Fangxiang Feng¹, Ruifan Li¹, Xiaojie Wang¹, Dimitris Athanasakis¹, John Shawe-Taylor¹, Maxim Milakov¹, John Park¹, Radu Ionescu¹, Marius Popescu¹, Cristian Grozea¹, James Bergstra¹, Jingjing Xie¹, Lukasz Romaszko¹, Bing Xu¹, Zhang Chuang¹, Yoshua Bengio¹ - Show less +24 more•Institutions (2)

Université de Montréal¹, Google²

01 Jul 2013-arXiv: Machine Learning

TL;DR: The ICML 2013 Workshop on Challenges in Representation Learning focused on three challenges: the black box learning challenge, the facial expression recognition challenge, and the multimodal learning challenge as mentioned in this paper.

...read moreread less

Abstract: The ICML 2013 Workshop on Challenges in Representation Learning focused on three challenges: the black box learning challenge, the facial expression recognition challenge, and the multimodal learning challenge. We describe the datasets created for these challenges and summarize the results of the competitions. We provide suggestions for organizers of future challenges and some comments on what kind of knowledge can be gained from machine learning competitions.

...read moreread less

510 citations

Posted Content•

Structure Discovery in Nonparametric Regression through Compositional Kernel Search

[...]

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, Zoubin Ghahramani - Show less +1 more

20 Feb 2013-arXiv: Machine Learning

TL;DR: This work defines a space of kernel structures which are built compositionally by adding and multiplying a small number of base kernels, and presents a method for searching over this space of structures which mirrors the scientific discovery process.

...read moreread less

Abstract: Despite its importance, choosing the structural form of the kernel in nonparametric regression remains a black art. We define a space of kernel structures which are built compositionally by adding and multiplying a small number of base kernels. We present a method for searching over this space of structures which mirrors the scientific discovery process. The learned structures can often decompose functions into interpretable components and enable long-range extrapolation on time-series datasets. Our structure search method outperforms many widely used kernels and kernel combination methods on a variety of prediction tasks.

...read moreread less

396 citations

Posted Content•

Gaussian Process Kernels for Pattern Discovery and Extrapolation

[...]

Andrew Gordon Wilson¹, Ryan P. Adams²•Institutions (2)

University of Cambridge¹, Harvard University²

18 Feb 2013-arXiv: Machine Learning

TL;DR: In this paper, simple closed-form kernels are derived by modelling a spectral density with a Gaussian mixture, which can be used with Gaussian processes to discover patterns and enable extrapolation, and demonstrate the proposed kernels by discovering patterns and performing long range extrapolation on synthetic examples, as well as atmospheric CO2 trends and airline passenger data.

...read moreread less

Abstract: Gaussian processes are rich distributions over functions, which provide a Bayesian nonparametric approach to smoothing and interpolation. We introduce simple closed form kernels that can be used with Gaussian processes to discover patterns and enable extrapolation. These kernels are derived by modelling a spectral density -- the Fourier transform of a kernel -- with a Gaussian mixture. The proposed kernels support a broad class of stationary covariances, but Gaussian process inference remains simple and analytic. We demonstrate the proposed kernels by discovering patterns and performing long range extrapolation on synthetic examples, as well as atmospheric CO2 trends and airline passenger data. We also show that we can reconstruct standard covariances within our framework.

...read moreread less

356 citations

Posted Content•

More) Efficient Reinforcement Learning via Posterior Sampling

[...]

Ian Osband¹, Daniel Russo¹, Benjamin Van Roy¹•Institutions (1)

Stanford University¹

04 Jun 2013-arXiv: Machine Learning

TL;DR: An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.

...read moreread less

Abstract: Most provably-efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an $\tilde{O}(\tau S \sqrt{AT})$ bound on the expected regret, where $T$ is time, $\tau$ is the episode length and $S$ and $A$ are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.

...read moreread less

316 citations

Journal Article•DOI•

Frequency Recognition in SSVEP-based BCI using Multiset Canonical Correlation Analysis

[...]

Yu Zhang¹, Guoxu Zhou², Jing Jin¹, Xingyu Wang¹, Andrzej Cichocki³, Andrzej Cichocki² - Show less +2 more•Institutions (3)

East China University of Science and Technology¹, RIKEN Brain Science Institute², Systems Research Institute³

26 Aug 2013-arXiv: Machine Learning

TL;DR: Experimental study with EEG data from 10 healthy subjects demonstrates that the proposed MsetCCA method improves the recognition accuracy of SSVEP frequency in comparison with the CCA method and other two competing methods (multiway CCA (MwayCCA), especially for a small number of channels and a short time window length.

...read moreread less

Abstract: Canonical correlation analysis (CCA) has been one of the most popular methods for frequency recognition in steady-state visual evoked potential (SSVEP)-based brain-computer interfaces (BCIs). Despite its efficiency, a potential problem is that using pre-constructed sine-cosine waves as the required reference signals in the CCA method often does not result in the optimal recognition accuracy due to their lack of features from the real EEG data. To address this problem, this study proposes a novel method based on multiset canonical correlation analysis (MsetCCA) to optimize the reference signals used in the CCA method for SSVEP frequency recognition. The MsetCCA method learns multiple linear transforms that implement joint spatial filtering to maximize the overall correlation among canonical variates, and hence extracts SSVEP common features from multiple sets of EEG data recorded at the same stimulus frequency. The optimized reference signals are formed by combination of the common features and completely based on training data. Experimental study with EEG data from ten healthy subjects demonstrates that the MsetCCA method improves the recognition accuracy of SSVEP frequency in comparison with the CCA method and other two competing methods (multiway CCA (MwayCCA) and phase constrained CCA (PCCA)), especially for a small number of channels and a short time window length. The superiority indicates that the proposed MsetCCA method is a new promising candidate for frequency recognition in SSVEP-based BCIs.

...read moreread less

Posted Content•

Streaming Variational Bayes

[...]

Tamara Broderick¹, Nicholas Boyd¹, Andre Wibisono¹, Ashia C. Wilson¹, Michael I. Jordan¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

25 Jul 2013-arXiv: Machine Learning

TL;DR: SDA-Bayes as mentioned in this paper is a framework for streaming and distributed computation of a Bayesian posterior, which makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive.

...read moreread less

Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data---a case where SVI may be applied---and in the streaming setting, where SVI does not apply.

...read moreread less

Posted Content•

Causal Discovery with Continuous Additive Noise Models

[...]

Jonas Peters¹, Joris M. Mooij², Dominik Janzing³, Bernhard Schölkopf³•Institutions (3)

ETH Zurich¹, Radboud University Nijmegen², Max Planck Society³

26 Sep 2013-arXiv: Machine Learning

TL;DR: In this article, the authors consider the problem of learning causal directed acyclic graphs from an observational joint distribution, and they show that if the observational distribution follows a structural equation model with an additive noise structure, the graph becomes identifiable from the distribution under mild conditions.

...read moreread less

Abstract: We consider the problem of learning causal directed acyclic graphs from an observational joint distribution. One can use these graphs to predict the outcome of interventional experiments, from which data are often not available. We show that if the observational distribution follows a structural equation model with an additive noise structure, the directed acyclic graph becomes identifiable from the distribution under mild conditions. This constitutes an interesting alternative to traditional methods that assume faithfulness and identify only the Markov equivalence class of the graph, thus leaving some edges undirected. We provide practical algorithms for finitely many samples, RESIT (Regression with Subsequent Independence Test) and two methods based on an independence score. We prove that RESIT is correct in the population setting and provide an empirical evaluation.

...read moreread less

Posted Content•

Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel

[...]

Tai Qin¹, Karl Rohe¹•Institutions (1)

University of Wisconsin-Madison¹

16 Sep 2013-arXiv: Machine Learning

TL;DR: The paper characterizes and justifies several of the variations of the spectral clustering algorithm in terms of the Degree-Corrected Stochastic Blockmodel and the Extended Planted Partition model, two statistical models that allow for highly heterogeneous degrees.

...read moreread less

Abstract: Spectral clustering is a fast and popular algorithm for finding clusters in networks. Recently, Chaudhuri et al. (2012) and Amini et al.(2012) proposed inspired variations on the algorithm that artificially inflate the node degrees for improved statistical performance. The current paper extends the previous statistical estimation results to the more canonical spectral clustering algorithm in a way that removes any assumption on the minimum degree and provides guidance on the choice of the tuning parameter. Moreover, our results show how the "star shape" in the eigenvectors--a common feature of empirical networks--can be explained by the Degree-Corrected Stochastic Blockmodel and the Extended Planted Partition model, two statistical models that allow for highly heterogeneous degrees. Throughout, the paper characterizes and justifies several of the variations of the spectral clustering algorithm in terms of these models.

...read moreread less

Posted Content•

Pylearn2: a machine learning research library

[...]

Ian Goodfellow, David Warde-Farley, Pascal Lamblin, Vincent Dumoulin, Mehdi Mirza, Razvan Pascanu, James Bergstra, Frédéric Bastien, Yoshua Bengio - Show less +5 more

20 Aug 2013-arXiv: Machine Learning

TL;DR: A brief history of the library, an overview of its basic philosophy, a summary of the Library's architecture, and a description of how the Pylearn2 community functions socially are given.

...read moreread less

Abstract: Pylearn2 is a machine learning research library. This does not just mean that it is a collection of machine learning algorithms that share a common API; it means that it has been designed for flexibility and extensibility in order to facilitate research projects that involve new or unusual use cases. In this paper we give a brief history of the library, an overview of its basic philosophy, a summary of the library's architecture, and a description of how the Pylearn2 community functions socially.

...read moreread less

Posted Content•

Petuum: A New Platform for Distributed Machine Learning on Big Data

[...]

Eric P. Xing¹, Qirong Ho², Wei Dai¹, Jin-Kyu Kim¹, Jinliang Wei¹, Seunghak Lee¹, Xun Zheng¹, Pengtao Xie¹, Abhimanu Kumar¹, Yaoliang Yu¹ - Show less +6 more•Institutions (2)

Carnegie Mellon University¹, Agency for Science, Technology and Research²

30 Dec 2013-arXiv: Machine Learning

TL;DR: In this article, the authors propose a general-purpose framework that systematically addresses data and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions.

...read moreread less

Abstract: What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized graph-based execution that relies on graph representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of ML programs at scale. We propose a general-purpose framework that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions. This presents unique opportunities for an integrative system design, such as bounded-error network synchronization and dynamic scheduling based on ML program structure. We demonstrate the efficacy of these system designs versus well-known implementations of modern ML algorithms, allowing ML programs to run in much less time and at considerably larger model sizes, even on modestly-sized compute clusters.

...read moreread less

Posted Content•

Early stopping and non-parametric regression: An optimal data-dependent stopping rule

[...]

Garvesh Raskutti¹, Martin J. Wainwright², Bin Yu²•Institutions (2)

University of Wisconsin-Madison¹, University of California, Berkeley²

15 Jun 2013-arXiv: Machine Learning

TL;DR: This paper derives upper bounds on both the L²(Pn) and L(P) error for arbitrary RKHSs, and provides an explicit and easily computable data-dependent stopping rule that depends only on the sum of step-sizes and the eigenvalues of the empirical kernel matrix for the RK HS.

...read moreread less

Abstract: The strategy of early stopping is a regularization technique based on choosing a stopping time for an iterative algorithm. Focusing on non-parametric regression in a reproducing kernel Hilbert space, we analyze the early stopping strategy for a form of gradient-descent applied to the least-squares loss function. We propose a data-dependent stopping rule that does not involve hold-out or cross-validation data, and we prove upper bounds on the squared error of the resulting function estimate, measured in either the $L^2(P)$ and $L^2(P_n)$ norm. These upper bounds lead to minimax-optimal rates for various kernel classes, including Sobolev smoothness classes and other forms of reproducing kernel Hilbert spaces. We show through simulation that our stopping rule compares favorably to two other stopping rules, one based on hold-out data and the other based on Stein's unbiased risk estimate. We also establish a tight connection between our early stopping strategy and the solution path of a kernel ridge regression estimator.

...read moreread less

Posted Content•

bartMachine: Machine Learning with Bayesian Additive Regression Trees

[...]

Adam Kapelner, Justin Bleich

08 Dec 2013-arXiv: Machine Learning

TL;DR: The package introduces many new features for data analysis using BART such as variable selection, interaction detection, model diagnostic plots, incorporation of missing data and the ability to save trees for future prediction.

...read moreread less

Abstract: We present a new package in R implementing Bayesian additive regression trees (BART). The package introduces many new features for data analysis using BART such as variable selection, interaction detection, model diagnostic plots, incorporation of missing data and the ability to save trees for future prediction. It is significantly faster than the current R implementation, parallelized, and capable of handling both large sample sizes and high-dimensional data.

...read moreread less

Posted Content•

Phase Retrieval using Alternating Minimization

[...]

Praneeth Netrapalli¹, Prateek Jain¹, Sujay Sanghavi²•Institutions (2)

Microsoft¹, University of Texas at Austin²

02 Jun 2013-arXiv: Machine Learning

TL;DR: In this article, the authors show that a resampling variant of the alternating minimization algorithm converges geometrically to the solution of a nonconvex phase retrieval problem.

...read moreread less

Abstract: Phase retrieval problems involve solving linear equations, but with missing sign (or phase, for complex numbers) information. More than four decades after it was first proposed, the seminal error reduction algorithm of (Gerchberg and Saxton 1972) and (Fienup 1982) is still the popular choice for solving many variants of this problem. The algorithm is based on alternating minimization; i.e. it alternates between estimating the missing phase information, and the candidate solution. Despite its wide usage in practice, no global convergence guarantees for this algorithm are known. In this paper, we show that a (resampling) variant of this approach converges geometrically to the solution of one such problem -- finding a vector $\mathbf{x}$ from $\mathbf{y},\mathbf{A}$, where $\mathbf{y} = \left|\mathbf{A}^{\top}\mathbf{x}\right|$ and $|\mathbf{z}|$ denotes a vector of element-wise magnitudes of $\mathbf{z}$ -- under the assumption that $\mathbf{A}$ is Gaussian. Empirically, we demonstrate that alternating minimization performs similar to recently proposed convex techniques for this problem (which are based on "lifting" to a convex matrix problem) in sample complexity and robustness to noise. However, it is much more efficient and can scale to large problems. Analytically, for a resampling version of alternating minimization, we show geometric convergence to the solution, and sample complexity that is off by log factors from obvious lower bounds. We also establish close to optimal scaling for the case when the unknown vector is sparse. Our work represents the first theoretical guarantee for alternating minimization (albeit with resampling) for any variant of phase retrieval problems in the non-convex setting.

...read moreread less

Posted Content•

Fast Computation of Wasserstein Barycenters

[...]

Marco Cuturi¹, Arnaud Doucet²•Institutions (2)

Kyoto University¹, University of Oxford²

16 Oct 2013-arXiv: Machine Learning

TL;DR: The Wasserstein distance is proposed to be smoothed with an entropic regularizer and recover in doing so a strictly convex objective whose gradients can be computed for a considerably cheaper computational cost using matrix scaling algorithms.

...read moreread less

Abstract: We present new algorithms to compute the mean of a set of empirical probability measures under the optimal transport metric. This mean, known as the Wasserstein barycenter, is the measure that minimizes the sum of its Wasserstein distances to each element in that set. We propose two original algorithms to compute Wasserstein barycenters that build upon the subgradient method. A direct implementation of these algorithms is, however, too costly because it would require the repeated resolution of large primal and dual optimal transport problems to compute subgradients. Extending the work of Cuturi (2013), we propose to smooth the Wasserstein distance used in the definition of Wasserstein barycenters with an entropic regularizer and recover in doing so a strictly convex objective whose gradients can be computed for a considerably cheaper computational cost using matrix scaling algorithms. We use these algorithms to visualize a large family of images and to solve a constrained clustering problem.

...read moreread less

Posted Content•

RNADE: The real-valued neural autoregressive density-estimator

[...]

Benigno Uria¹, Iain Murray¹, Hugo Larochelle²•Institutions (2)

University of Edinburgh¹, Université de Sherbrooke²

02 Jun 2013-arXiv: Machine Learning

TL;DR: RNADE as mentioned in this paper calculates the density of a datapoint as the product of one-dimensional conditionals modeled using mixture density networks with shared parameters, and it outperforms mixture models in all but one case.

...read moreread less

Abstract: We introduce RNADE, a new model for joint density estimation of real-valued vectors. Our model calculates the density of a datapoint as the product of one-dimensional conditionals modeled using mixture density networks with shared parameters. RNADE learns a distributed representation of the data, while having a tractable expression for the calculation of densities. A tractable likelihood allows direct comparison with other methods and training by standard gradient-based optimizers. We compare the performance of RNADE on several datasets of heterogeneous and perceptual data, finding it outperforms mixture models in all but one case.

...read moreread less

Posted Content•

Optimization with First-Order Surrogate Functions

[...]

Julien Mairal¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

14 May 2013-arXiv: Machine Learning

TL;DR: In this paper, the authors provide a unified viewpoint for several first-order optimization techniques such as accelerated proximal gradient, block coordinate descent, or Frank-Wolfe algorithms, and introduce a new incremental scheme that experimentally matches or outperforms state-of-the-art solvers for large-scale optimization problems typically arising in machine learning.

...read moreread less

Abstract: In this paper, we study optimization methods consisting of iteratively minimizing surrogates of an objective function. By proposing several algorithmic variants and simple convergence analyses, we make two main contributions. First, we provide a unified viewpoint for several first-order optimization techniques such as accelerated proximal gradient, block coordinate descent, or Frank-Wolfe algorithms. Second, we introduce a new incremental scheme that experimentally matches or outperforms state-of-the-art solvers for large-scale optimization problems typically arising in machine learning.

...read moreread less

Posted Content•

Semi-Stochastic Gradient Descent Methods

[...]

Jakub Konečný¹, Peter Richtárik¹•Institutions (1)

University of Edinburgh¹

05 Dec 2013-arXiv: Machine Learning

TL;DR: In this paper, the authors proposed a semi-stochastic gradient descent (S2GD) method, which runs for one or several epochs in each of which a single full gradient and a random number of stochastic gradients is computed, following a geometric law.

...read moreread less

Abstract: In this paper we study the problem of minimizing the average of a large number ($n$) of smooth convex loss functions. We propose a new method, S2GD (Semi-Stochastic Gradient Descent), which runs for one or several epochs in each of which a single full gradient and a random number of stochastic gradients is computed, following a geometric law. The total work needed for the method to output an $\varepsilon$-accurate solution in expectation, measured in the number of passes over data, or equivalently, in units equivalent to the computation of a single gradient of the loss, is $O((\kappa/n)\log(1/\varepsilon))$, where $\kappa$ is the condition number. This is achieved by running the method for $O(\log(1/\varepsilon))$ epochs, with a single gradient evaluation and $O(\kappa)$ stochastic gradient evaluations in each. The SVRG method of Johnson and Zhang arises as a special case. If our method is limited to a single epoch only, it needs to evaluate at most $O((\kappa/\varepsilon)\log(1/\varepsilon))$ stochastic gradients. In contrast, SVRG requires $O(\kappa/\varepsilon^2)$ stochastic gradients. To illustrate our theoretical results, S2GD only needs the workload equivalent to about 2.1 full gradient evaluations to find an $10^{-6}$-accurate solution for a problem with $n=10^9$ and $\kappa=10^3$.

...read moreread less

Posted Content•

Memory Limited, Streaming PCA

[...]

Ioannis Mitliagkas¹, Constantine Caramanis¹, Prateek Jain²•Institutions (2)

University of Texas at Austin¹, Microsoft²

28 Jun 2013-arXiv: Machine Learning

TL;DR: An algorithm is presented that uses O(kp) memory and is able to compute the k-dimensional spike with O(p log p) sample-complexity - the first algorithm of its kind.

...read moreread less

Abstract: We consider streaming, one-pass principal component analysis (PCA), in the high-dimensional regime, with limited memory. Here, $p$-dimensional samples are presented sequentially, and the goal is to produce the $k$-dimensional subspace that best approximates these points. Standard algorithms require $O(p^2)$ memory; meanwhile no algorithm can do better than $O(kp)$ memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the {\em spiked covariance model}, where $p$-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, $n$, scales proportionally with the dimension, $p$. Yet, all algorithms that provably achieve this, have memory complexity $O(p^2)$. Meanwhile, algorithms with memory-complexity $O(kp)$ do not have provable bounds on sample complexity comparable to $p$. We present an algorithm that achieves both: it uses $O(kp)$ memory (meaning storage of any kind) and is able to compute the $k$-dimensional spike with $O(p \log p)$ sample-complexity -- the first algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data.

...read moreread less

Posted Content•

The Randomized Dependence Coefficient

[...]

David Lopez-Paz¹, Philipp Hennig¹, Bernhard Schölkopf¹•Institutions (1)

Max Planck Society¹

29 Apr 2013-arXiv: Machine Learning

TL;DR: The Randomized Dependence Coefficient is introduced, a measure of nonlinear dependence between random variables of arbitrary dimension based on the Hirschfeld-Gebelein-Renyi Maximum Correlation Coefficient, which has low computational cost and is easy to implement.

...read moreread less

Abstract: We introduce the Randomized Dependence Coefficient (RDC), a measure of non-linear dependence between random variables of arbitrary dimension based on the Hirschfeld-Gebelein-Renyi Maximum Correlation Coefficient. RDC is defined in terms of correlation of random non-linear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just five lines of R code, included at the end of the paper.

...read moreread less

Posted Content•

Estimating Time-varying Brain Connectivity Networks from Functional MRI Time Series

[...]

Ricardo Pio Monti¹, Peter J. Hellyer¹, David J. Sharp¹, Robert Leech¹, Christoforos Anagnostopoulos¹, Giovanni Montana¹, Giovanni Montana² - Show less +3 more•Institutions (2)

Imperial College London¹, St Thomas' Hospital²

14 Oct 2013-arXiv: Machine Learning

TL;DR: In this article, the Smooth Incremental Graphical Lasso Estimation (SINGLE) algorithm was proposed to estimate dynamic brain networks from fMRI data, which showed that the Right Inferior Frontal Gyrus, frequently reported as playing an important role in cognitive control, dynamically changes with the task.

...read moreread less

Abstract: Understanding the functional architecture of the brain in terms of networks is becoming increasingly common. In most fMRI applications functional networks are assumed to be stationary, resulting in a single network estimated for the entire time course. However recent results suggest that the connectivity between brain regions is highly non-stationary even at rest. As a result, there is a need for new brain imaging methodologies that comprehensively account for the dynamic (i.e., non-stationary) nature of the fMRI data. In this work we propose the Smooth Incremental Graphical Lasso Estimation (SINGLE) algorithm which estimates dynamic brain networks from fMRI data. We apply the SINGLE algorithm to functional MRI data from 24 healthy patients performing a choice-response task to demonstrate the dynamic changes in network structure that accompany a simple but attentionally demanding cognitive task. Using graph theoretic measures we show that the Right Inferior Frontal Gyrus, frequently reported as playing an important role in cognitive control, dynamically changes with the task. Our results suggest that the Right Inferior Frontal Gyrus plays a fundamental role in the attention and executive function during cognitively demanding tasks and may play a key role in regulating the balance between other brain regions.

...read moreread less

Posted Content•

Asymptotically Exact, Embarrassingly Parallel MCMC

[...]

Willie Neiswanger¹, Chong Wang, Eric P. Xing¹•Institutions (1)

Carnegie Mellon University¹

19 Nov 2013-arXiv: Machine Learning

TL;DR: In this article, the authors present a parallel Markov chain Monte Carlo (MCMC) algorithm in which subsets of data are processed independently, with very little communication, and prove that their algorithm generates asymptotically exact samples and empirically demonstrate its ability to parallelize burn-in and sampling.

...read moreread less

Abstract: Communication costs, resulting from synchronization requirements during learning, can greatly slow down many parallel machine learning algorithms. In this paper, we present a parallel Markov chain Monte Carlo (MCMC) algorithm in which subsets of data are processed independently, with very little communication. First, we arbitrarily partition data onto multiple machines. Then, on each machine, any classical MCMC method (e.g., Gibbs sampling) may be used to draw samples from a posterior distribution given the data subset. Finally, the samples from each machine are combined to form samples from the full posterior. This embarrassingly parallel algorithm allows each machine to act independently on a subset of the data (without communication) until the final combination stage. We prove that our algorithm generates asymptotically exact samples and empirically demonstrate its ability to parallelize burn-in and sampling in several models.

...read moreread less

Posted Content•

Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

[...]

Roger Frigola¹, Fredrik Lindsten², Thomas B. Schön², Carl Edward Rasmussen¹•Institutions (2)

University of Cambridge¹, Linköping University²

12 Jun 2013-arXiv: Machine Learning

TL;DR: This work presents a fully Bayesian approach to inference and learning in nonlinear nonparametric state-space models and places a Gaussian process prior over the state transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena.

...read moreread less

Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference \emph{and learning} (i.e. state estimation and system identification) in nonlinear nonparametric state-space models. We place a Gaussian process prior over the state transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. To enable efficient inference, we marginalize over the transition dynamics function and infer directly the joint smoothing distribution using specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. Our approach preserves the full nonparametric expressivity of the model and can make use of sparse Gaussian processes to greatly reduce computational complexity.

...read moreread less

Posted Content•

Automated Variational Inference in Probabilistic Programming

[...]

David Wingate, Theophane Weber

07 Jan 2013-arXiv: Machine Learning

TL;DR: A new algorithm for approximate inference in probabilistic programs, based on a stochastic gradient for variational programs, is presented, which is efficient without restrictions on the probabilists and improves inference efficiency over other algorithms.

...read moreread less

Abstract: We present a new algorithm for approximate inference in probabilistic programs, based on a stochastic gradient for variational programs. This method is efficient without restrictions on the probabilistic program; it is particularly practical for distributions which are not analytically tractable, including highly structured distributions that arise in probabilistic programs. We show how to automatically derive mean-field probabilistic programs and optimize them, and demonstrate that our perspective improves inference efficiency over other algorithms.

...read moreread less

Posted Content•

Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances

[...]

Marco Cuturi

04 Jun 2013-arXiv: Machine Learning

TL;DR: This work smooths the classical optimal transportation problem with an entropic regularization term, and shows that the resulting optimum is also a distance which can be computed through Sinkhorn-Knopp's matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transportation solvers.

...read moreread less

Abstract: Optimal transportation distances are a fundamental family of parameterized distances for histograms Despite their appealing theoretical properties, excellent performance in retrieval tasks and intuitive formulation, their computation involves the resolution of a linear program whose cost is prohibitive whenever the histograms' dimension exceeds a few hundreds We propose in this work a new family of optimal transportation distances that look at transportation problems from a maximum-entropy perspective We smooth the classical optimal transportation problem with an entropic regularization term, and show that the resulting optimum is also a distance which can be computed through Sinkhorn-Knopp's matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transportation solvers We also report improved performance over classical optimal transportation distances on the MNIST benchmark problem

...read moreread less

Collapse