Showing papers in "arXiv: Learning in 2014"

PDF

Open Access

Posted Content•

Adam: A Method for Stochastic Optimization

[...]

Diederik P. Kingma¹, Jimmy Ba²•Institutions (2)

University of Amsterdam¹, University of Toronto²

22 Dec 2014-arXiv: Learning

TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.

...read moreread less

Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

...read moreread less

23,486 citations

Posted Content•

Conditional Generative Adversarial Nets

[...]

Mehdi Mirza, Simon Osindero

06 Nov 2014-arXiv: Learning

TL;DR: The conditional version of generative adversarial nets is introduced, which can be constructed by simply feeding the data, y, to the generator and discriminator, and it is shown that this model can generate MNIST digits conditioned on class labels.

...read moreread less

Abstract: Generative Adversarial Nets [8] were recently introduced as a novel way to train generative models. In this work we introduce the conditional version of generative adversarial nets, which can be constructed by simply feeding the data, y, we wish to condition on to both the generator and discriminator. We show that this model can generate MNIST digits conditioned on class labels. We also illustrate how this model could be used to learn a multi-modal model, and provide preliminary examples of an application to image tagging in which we demonstrate how this approach can generate descriptive tags which are not part of training labels.

...read moreread less

7,987 citations

Posted Content•

How transferable are features in deep neural networks

[...]

Jason Yosinski¹, Jeff Clune², Yoshua Bengio³, Hod Lipson¹•Institutions (3)

Cornell University¹, University of Wyoming², Université de Montréal³

06 Nov 2014-arXiv: Learning

TL;DR: This paper quantifies the generality versus specificity of neurons in each layer of a deep convolutional neural network and reports a few surprising results, including that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.

...read moreread less

Abstract: Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs. Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. Features must eventually transition from general to specific by the last layer of the network, but this transition has not been studied extensively. In this paper we experimentally quantify the generality versus specificity of neurons in each layer of a deep convolutional neural network and report a few surprising results. Transferability is negatively affected by two distinct issues: (1) the specialization of higher layer neurons to their original task at the expense of performance on the target task, which was expected, and (2) optimization difficulties related to splitting networks between co-adapted neurons, which was not expected. In an example network trained on ImageNet, we demonstrate that either of these two issues may dominate, depending on whether features are transferred from the bottom, middle, or top of the network. We also document that the transferability of features decreases as the distance between the base task and target task increases, but that transferring features even from distant tasks can be better than using random features. A final surprising result is that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.

...read moreread less

4,663 citations

Posted Content•

A Tutorial on Principal Component Analysis.

[...]

Jonathon Shlens¹•Institutions (1)

Salk Institute for Biological Studies¹

03 Apr 2014-arXiv: Learning

TL;DR: This manuscript focuses on building a solid intuition for how and why principal component analysis works, and crystallizes this knowledge by deriving from simple intuitions, the mathematics behind PCA.

...read moreread less

Abstract: Principal component analysis (PCA) is a mainstay of modern data analysis - a black box that is widely used but (sometimes) poorly understood. The goal of this paper is to dispel the magic behind this black box. This manuscript focuses on building a solid intuition for how and why principal component analysis works. This manuscript crystallizes this knowledge by deriving from simple intuitions, the mathematics behind PCA. This tutorial does not shy away from explaining the ideas informally, nor does it shy away from the mathematics. The hope is that by addressing both aspects, readers of all levels will be able to gain a better understanding of PCA as well as the when, the how and the why of applying this technique.

...read moreread less

2,281 citations

Posted Content•

Semi-Supervised Learning with Deep Generative Models

[...]

Diederik P. Kingma, Danilo Jimenez Rezende, Shakir Mohamed, Max Welling

20 Jun 2014-arXiv: Learning

TL;DR: It is shown that deep generative models and approximate Bayesian inference exploiting recent advances in variational methods can be used to provide significant improvements, making generative approaches highly competitive for semi-supervised learning.

...read moreread less

Abstract: The ever-increasing size of modern data sets combined with the difficulty of obtaining label information has made semi-supervised learning one of the problems of significant practical importance in modern data analysis. We revisit the approach to semi-supervised learning with generative models and develop new models that allow for effective generalisation from small labelled data sets to large unlabelled ones. Generative approaches have thus far been either inflexible, inefficient or non-scalable. We show that deep generative models and approximate Bayesian inference exploiting recent advances in variational methods can be used to provide significant improvements, making generative approaches highly competitive for semi-supervised learning.

...read moreread less

2,194 citations

Posted Content•

Recurrent Models of Visual Attention

[...]

Volodymyr Mnih¹, Nicolas Heess¹, Alex Graves¹, Koray Kavukcuoglu¹•Institutions (1)

Google¹

24 Jun 2014-arXiv: Learning

TL;DR: In this article, a recurrent neural network (RNN) model is proposed to extract information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution.

...read moreread less

Abstract: Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.

...read moreread less

2,107 citations

Posted Content•

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

[...]

Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel¹•Institutions (1)

University of Toronto¹

10 Nov 2014-arXiv: Learning

TL;DR: This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.

...read moreread less

Abstract: Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We introduce the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. Using LSTM to encode sentences, we match the state-of-the-art performance on Flickr8K and Flickr30K without using object detections. We also set new best results when using the 19-layer Oxford convolutional network. Furthermore we show that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic e.g. *image of a blue car* - "blue" + "red" is near images of red cars. Sample captions generated for 800 images are made available for comparison.

...read moreread less

1,412 citations

Posted Content•

Striving for Simplicity: The All Convolutional Net

[...]

Jost Tobias Springenberg¹, Alexey Dosovitskiy¹, Thomas Brox¹, Martin Riedmiller¹•Institutions (1)

University of Freiburg¹

21 Dec 2014-arXiv: Learning

TL;DR: In this paper, the authors re-evaluate the state-of-the-art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline.

...read moreread less

Abstract: Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks. Following this finding -- and building on other recent work for finding simple network structures -- we propose a new architecture that consists solely of convolutional layers and yields competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet). To analyze the network we introduce a new variant of the "deconvolution approach" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches.

...read moreread less

1,008 citations

Posted Content•

Multiple Object Recognition with Visual Attention

[...]

Jimmy Ba¹, Volodymyr Mnih², Koray Kavukcuoglu²•Institutions (2)

University of Toronto¹, Google²

24 Dec 2014-arXiv: Learning

TL;DR: The model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image and it is shown that the model learns to both localize and recognize multiple objects despite being given only class labels during training.

...read moreread less

Abstract: We present an attention-based model for recognizing multiple objects in images. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. We show that the model learns to both localize and recognize multiple objects despite being given only class labels during training. We evaluate the model on the challenging task of transcribing house number sequences from Google Street View images and show that it is both more accurate than the state-of-the-art convolutional networks and uses fewer parameters and less computation.

...read moreread less

940 citations

Posted Content•

Deep metric learning using Triplet network

[...]

Elad Hoffer¹, Nir Ailon¹•Institutions (1)

Technion – Israel Institute of Technology¹

20 Dec 2014-arXiv: Learning

TL;DR: In this paper, Wang et al. proposed the triplet network model, which aims to learn useful representations by distance comparisons, and demonstrate using various datasets that their model learns a better representation than that of its immediate competitor, the Siamese network.

...read moreread less

Abstract: Deep learning has proven itself as a successful set of models for learning useful semantic representations of data. These, however, are mostly implicitly learned as part of a classification task. In this paper we propose the triplet network model, which aims to learn useful representations by distance comparisons. A similar model was defined by Wang et al. (2014), tailor made for learning a ranking for image information retrieval. Here we demonstrate using various datasets that our model learns a better representation than that of its immediate competitor, the Siamese network. We also discuss future possible usage as a framework for unsupervised learning.

...read moreread less

824 citations

Posted Content•

NICE: Non-linear Independent Components Estimation

[...]

Laurent Dinh¹, David Krueger¹, Yoshua Bengio²•Institutions (2)

Université de Montréal¹, Alcatel-Lucent²

30 Oct 2014-arXiv: Learning

TL;DR: Non-linear Independent Component Estimation (NICE) as discussed by the authors is a deep learning framework for modeling complex high-dimensional densities based on the idea that a good representation is one in which the data has a distribution that is easy to model.

...read moreread less

Abstract: We propose a deep learning framework for modeling complex high-dimensional densities called Non-linear Independent Component Estimation (NICE). It is based on the idea that a good representation is one in which the data has a distribution that is easy to model. For this purpose, a non-linear deterministic transformation of the data is learned that maps it to a latent space so as to make the transformed data conform to a factorized distribution, i.e., resulting in independent latent variables. We parametrize this transformation so that computing the Jacobian determinant and inverse transform is trivial, yet we maintain the ability to learn complex non-linear transformations, via a composition of simple building blocks, each based on a deep neural network. The training criterion is simply the exact log-likelihood, which is tractable. Unbiased ancestral sampling is also easy. We show that this approach yields good generative models on four image datasets and can be used for inpainting.

...read moreread less

Posted Content•

Towards Deep Neural Network Architectures Robust to Adversarial Examples

[...]

Shixiang Gu¹, Luca Rigazio²•Institutions (2)

Max Planck Society¹, Panasonic²

11 Dec 2014-arXiv: Learning

TL;DR: Deep Contractive Network as mentioned in this paper proposes a new end-to-end training procedure that includes a smoothness penalty inspired by the contractive autoencoder (CAE), which increases the network robustness to adversarial examples, without a significant performance penalty.

...read moreread less

Abstract: Recent work has shown deep neural networks (DNNs) to be highly susceptible to well-designed, small perturbations at the input layer, or so-called adversarial examples. Taking images as an example, such distortions are often imperceptible, but can result in 100% mis-classification for a state of the art DNN. We study the structure of adversarial examples and explore network topology, pre-processing and training strategies to improve the robustness of DNNs. We perform various experiments to assess the removability of adversarial examples by corrupting with additional noise and pre-processing with denoising autoencoders (DAEs). We find that DAEs can remove substantial amounts of the adversarial noise. How- ever, when stacking the DAE with the original DNN, the resulting network can again be attacked by new adversarial examples with even smaller distortion. As a solution, we propose Deep Contractive Network, a model with a new end-to-end training procedure that includes a smoothness penalty inspired by the contractive autoencoder (CAE). This increases the network robustness to adversarial examples, without a significant performance penalty.

...read moreread less

Posted Content•

SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

[...]

Aaron Defazio¹, Francis Bach², Simon Lacoste-Julien²•Institutions (2)

Australian National University¹, French Institute for Research in Computer Science and Automation²

01 Jul 2014-arXiv: Learning

TL;DR: This work introduces a new optimisation method called SAGA, which improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser.

...read moreread less

Abstract: In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem. We give experimental results showing the effectiveness of our method.

...read moreread less

Posted Content•

FitNets: Hints for Thin Deep Nets

[...]

Adriana Romero¹, Nicolas Ballas², Samira Ebrahimi Kahou³, Antoine Chassang², Carlo Gatta, Yoshua Bengio² - Show less +2 more•Institutions (3)

University of Barcelona¹, Canadian Institute for Advanced Research², École Polytechnique³

19 Dec 2014-arXiv: Learning

TL;DR: In this article, the authors extend the knowledge distillation approach to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance.

...read moreread less

Abstract: While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.

...read moreread less

Posted Content•

Neural Variational Inference and Learning in Belief Networks

[...]

Andriy Mnih¹, Karol Gregor¹•Institutions (1)

Google¹

31 Jan 2014-arXiv: Learning

TL;DR: This article proposed a non-iterative approximate inference method that uses a feed-forward network to implement efficient exact sampling from the variational posterior, which achieves state-of-the-art results on the Reuters RCV1 document dataset.

...read moreread less

Abstract: Highly expressive directed latent variable models, such as sigmoid belief networks, are difficult to train on large datasets because exact inference in them is intractable and none of the approximate inference methods that have been applied to them scale well. We propose a fast non-iterative approximate inference method that uses a feedforward network to implement efficient exact sampling from the variational posterior. The model and this inference network are trained jointly by maximizing a variational lower bound on the log-likelihood. Although the naive estimator of the inference model gradient is too high-variance to be useful, we make it practical by applying several straightforward model-independent variance reduction techniques. Applying our approach to training sigmoid belief networks and deep autoregressive networks, we show that it outperforms the wake-sleep algorithm on MNIST and achieves state-of-the-art results on the Reuters RCV1 document dataset.

...read moreread less

Posted Content•

Training deep neural networks with low precision multiplications

[...]

Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David

22 Dec 2014-arXiv: Learning

TL;DR: It is found that very low precision is sufficient not just for running trained networks but also for training them, and it is possible to train Maxout networks with 10 bits multiplications.

...read moreread less

Abstract: Multipliers are the most space and power-hungry arithmetic operators of the digital implementation of deep neural networks. We train a set of state-of-the-art neural networks (Maxout networks) on three benchmark datasets: MNIST, CIFAR-10 and SVHN. They are trained with three distinct formats: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those formats, we assess the impact of the precision of the multiplications on the final error after training. We find that very low precision is sufficient not just for running trained networks but also for training them. For example, it is possible to train Maxout networks with 10 bits multiplications.

...read moreread less

Posted Content•

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

[...]

Yann N. Dauphin¹, Razvan Pascanu¹, Caglar Gulcehre¹, Kyunghyun Cho¹, Surya Ganguli², Yoshua Bengio¹ - Show less +2 more•Institutions (2)

Université de Montréal¹, Stanford University²

10 Jun 2014-arXiv: Learning

TL;DR: This paper proposes a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods, and applies this algorithm to deep or recurrent neural network training, and provides numerical evidence for its superior optimization performance.

...read moreread less

Abstract: A central challenge to many fields of science and engineering involves minimizing non-convex error functions over continuous, high dimensional spaces. Gradient descent or quasi-Newton methods are almost ubiquitously used to perform such minimizations, and it is often thought that a main source of difficulty for these local methods to find the global minimum is the proliferation of local minima with much higher error than the global minimum. Here we argue, based on results from statistical physics, random matrix theory, neural network theory, and empirical evidence, that a deeper and more profound difficulty originates from the proliferation of saddle points, not local minima, especially in high dimensional problems of practical interest. Such saddle points are surrounded by high error plateaus that can dramatically slow down learning, and give the illusory impression of the existence of a local minimum. Motivated by these arguments, we propose a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods. We apply this algorithm to deep or recurrent neural network training, and provide numerical evidence for its superior optimization performance.

...read moreread less

Posted Content•

Video (language) modeling: a baseline for generative models of natural videos.

[...]

Marc'Aurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, Sumit Chopra - Show less +2 more

20 Dec 2014-arXiv: Learning

TL;DR: For the first time, it is shown that a strong baseline model for unsupervised feature learning using video data can predict non-trivial motions over short video sequences.

...read moreread less

Abstract: We propose a strong baseline model for unsupervised feature learning using video data. By learning to predict missing frames or extrapolate future frames from an input video sequence, the model discovers both spatial and temporal correlations which are useful to represent complex deformations and motion patterns. The models we propose are largely borrowed from the language modeling literature, and adapted to the vision domain by quantizing the space of image patches into a large dictionary. We demonstrate the approach on both a filling and a generation task. For the first time, we show that, after training on natural videos, such a model can predict non-trivial motions over short video sequences.

...read moreread less

Posted Content•

Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures

[...]

John R. Hershey, Jonathan Le Roux, Felix Weninger

09 Sep 2014-arXiv: Learning

TL;DR: This work starts with a model-based approach and an associated inference algorithm, and folds the inference iterations as layers in a deep network, and shows how this framework allows to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm.

...read moreread less

Abstract: Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.

...read moreread less

Journal Article•DOI•

Guaranteed Matrix Completion via Non-convex Factorization

[...]

Ruoyu Sun¹, Zhi-Quan Luo¹•Institutions (1)

University of Minnesota¹

28 Nov 2014-arXiv: Learning

TL;DR: This paper establishes a theoretical guarantee for the factorization based formulation to correctly recover the underlying low-rank matrix, and is the first one that provides exact recovery guarantee for many standard algorithms such as gradient descent, SGD and block coordinate gradient descent.

...read moreread less

Abstract: Matrix factorization is a popular approach for large-scale matrix completion. The optimization formulation based on matrix factorization can be solved very efficiently by standard algorithms in practice. However, due to the non-convexity caused by the factorization model, there is a limited theoretical understanding of this formulation. In this paper, we establish a theoretical guarantee for the factorization formulation to correctly recover the underlying low-rank matrix. In particular, we show that under similar conditions to those in previous works, many standard optimization algorithms converge to the global optima of a factorization formulation, and recover the true low-rank matrix. We study the local geometry of a properly regularized factorization formulation and prove that any stationary point in a certain local region is globally optimal. A major difference of our work from the existing results is that we do not need resampling in either the algorithm or its analysis. Compared to other works on nonconvex optimization, one extra difficulty lies in analyzing nonconvex constrained optimization when the constraint (or the corresponding regularizer) is not "consistent" with the gradient direction. One technical contribution is the perturbation analysis for non-symmetric matrix factorization.

...read moreread less

Journal Article•DOI•

OpenML: networked science in machine learning

[...]

Joaquin Vanschoren¹, Jan N. van Rijn², Bernd Bischl³, Luís Torgo⁴•Institutions (4)

Eindhoven University of Technology¹, Leiden University², Technical University of Dortmund³, University of Porto⁴

29 Jul 2014-arXiv: Learning

TL;DR: This paper introduces OpenML, a place for machine learning researchers to share and organize data in fine detail, so that they can work more effectively, be more visible, and collaborate with others to tackle harder problems.

...read moreread less

Abstract: Many sciences have made significant breakthroughs by adopting online tools that help organize, structure and mine information that is too detailed to be printed in journals. In this paper, we introduce OpenML, a place for machine learning researchers to share and organize data in fine detail, so that they can work more effectively, be more visible, and collaborate with others to tackle harder problems. We discuss how OpenML relates to other examples of networked science and what benefits it brings for machine learning research, individual scientists, as well as students and practitioners.

...read moreread less

Posted Content•

Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds

[...]

Raef Bassily, Adam Smith, Abhradeep Thakurta

27 May 2014-arXiv: Learning

TL;DR: This work provides new algorithms and matching lower bounds for differentially private convex empirical risk minimization assuming only that each data point's contribution to the loss function is Lipschitz and that the domain of optimization is bounded.

...read moreread less

Abstract: In this paper, we initiate a systematic investigation of differentially private algorithms for convex empirical risk minimization. Various instantiations of this problem have been studied before. We provide new algorithms and matching lower bounds for private ERM assuming only that each data point's contribution to the loss function is Lipschitz bounded and that the domain of optimization is bounded. We provide a separate set of algorithms and matching lower bounds for the setting in which the loss functions are known to also be strongly convex. Our algorithms run in polynomial time, and in some cases even match the optimal non-private running time (as measured by oracle complexity). We give separate algorithms (and lower bounds) for $(\epsilon,0)$- and $(\epsilon,\delta)$-differential privacy; perhaps surprisingly, the techniques used for designing optimal algorithms in the two cases are completely different. Our lower bounds apply even to very simple, smooth function families, such as linear and quadratic functions. This implies that algorithms from previous work can be used to obtain optimal error rates, under the additional assumption that the contributions of each data point to the loss function is smooth. We show that simple approaches to smoothing arbitrary loss functions (in order to apply previous techniques) do not yield optimal error rates. In particular, optimal algorithms were not previously known for problems such as training support vector machines and the high-dimensional median.

...read moreread less

Posted Content•

New insights and perspectives on the natural gradient method

[...]

James Martens

03 Dec 2014-arXiv: Learning

TL;DR: This paper critically analyze this method and its properties, and shows how it can be viewed as a type of approximate 2nd-order optimization method, where the Fisher information matrix can be view as an approximation of the Hessian.

...read moreread less

Abstract: Natural gradient descent is an optimization method traditionally motivated from the perspective of information geometry, and works well for many applications as an alternative to stochastic gradient descent. In this paper we critically analyze this method and its properties, and show how it can be viewed as a type of 2nd-order optimization method, with the Fisher information matrix acting as a substitute for the Hessian. In many important cases, the Fisher information matrix is shown to be equivalent to the Generalized Gauss-Newton matrix, which both approximates the Hessian, but also has certain properties that favor its use over the Hessian. This perspective turns out to have significant implications for the design of a practical and robust natural gradient optimizer, as it motivates the use of techniques like trust regions and Tikhonov regularization. Additionally, we make a series of contributions to the understanding of natural gradient and 2nd-order methods, including: a thorough analysis of the convergence speed of stochastic natural gradient descent (and more general stochastic 2nd-order methods) as applied to convex quadratics, a critical examination of the oft-used "empirical" approximation of the Fisher matrix, and an analysis of the (approximate) parameterization invariance property possessed by natural gradient methods (which we show also holds for certain other curvature, but notably not the Hessian).

...read moreread less

Posted Content•

Preserving Statistical Validity in Adaptive Data Analysis

[...]

Cynthia Dwork¹, Vitaly Feldman², Moritz Hardt², Toniann Pitassi³, Omer Reingold⁴, Aaron Roth⁵ - Show less +2 more•Institutions (5)

Microsoft¹, IBM², University of Toronto³, Samsung⁴, University of Pennsylvania⁵

10 Nov 2014-arXiv: Learning

TL;DR: It is shown that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively, and this gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates.

...read moreread less

Abstract: A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of $m$ adaptively chosen functions on an unknown distribution given $n$ random samples. We show that, surprisingly, there is a way to estimate an exponential in $n$ number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.

...read moreread less

Posted Content•

An Information-Theoretic Analysis of Thompson Sampling

[...]

Daniel Russo¹, Benjamin Van Roy¹•Institutions (1)

Stanford University¹

21 Mar 2014-arXiv: Learning

TL;DR: This analysis inherits the simplicity and elegance of information theory and leads to regret bounds that scale with the entropy of the optimal-action distribution, which strengthens preexisting results and yields new insight into how information improves performance.

...read moreread less

Abstract: We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bounds that scale with the entropy of the optimal-action distribution. This strengthens preexisting results and yields new insight into how information improves performance.

...read moreread less

Journal Article•DOI•

Highly comparative feature-based time-series classification

[...]

Ben D. Fulcher¹, Nick S. Jones²•Institutions (2)

University of Oxford¹, Imperial College London²

15 Jan 2014-arXiv: Learning

TL;DR: In this paper, a highly comparative feature-based approach to time series classification is introduced that uses an extensive database of algorithms to extract thousands of interpretable features from time series and selects those features that are most informative of the class structure using greedy forward feature selection with a linear classifier.

...read moreread less

Abstract: A highly comparative, feature-based approach to time series classification is introduced that uses an extensive database of algorithms to extract thousands of interpretable features from time series. These features are derived from across the scientific time-series analysis literature, and include summaries of time series in terms of their correlation structure, distribution, entropy, stationarity, scaling properties, and fits to a range of time-series models. After computing thousands of features for each time series in a training set, those that are most informative of the class structure are selected using greedy forward feature selection with a linear classifier. The resulting feature-based classifiers automatically learn the differences between classes using a reduced number of time-series properties, and circumvent the need to calculate distances between time series. Representing time series in this way results in orders of magnitude of dimensionality reduction, allowing the method to perform well on very large datasets containing long time series or time series of different lengths. For many of the datasets studied, classification performance exceeded that of conventional instance-based classifiers, including one nearest neighbor classifiers using Euclidean distances and dynamic time warping and, most importantly, the features selected provide an understanding of the properties of the dataset, insight that can guide further scientific investigation.

...read moreread less

Posted Content•

Reinforcement and Imitation Learning via Interactive No-Regret Learning

[...]

Stephane Ross, J. Andrew Bagnell

23 Jun 2014-arXiv: Learning

TL;DR: This work develops an interactive imitation learning approach that leverages cost information and extends the technique to address reinforcement learning, suggesting a broad new family of algorithms and providing a unifying view of existing techniques for imitation and reinforcement learning.

...read moreread less

Abstract: Recent work has demonstrated that problems-- particularly imitation learning and structured prediction-- where a learner's predictions influence the input-distribution it is tested on can be naturally addressed by an interactive approach and analyzed using no-regret online learning. These approaches to imitation learning, however, neither require nor benefit from information about the cost of actions. We extend existing results in two directions: first, we develop an interactive imitation learning approach that leverages cost information; second, we extend the technique to address reinforcement learning. The results provide theoretical support to the commonly observed successes of online approximate policy iteration. Our approach suggests a broad new family of algorithms and provides a unifying view of existing techniques for imitation and reinforcement learning.

...read moreread less

Posted Content•

Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

[...]

Nicolas Vasilache¹, Jeff Johnson¹, Michael Mathieu¹, Soumith Chintala¹, Serkan Piantino¹, Yann LeCun¹ - Show less +2 more•Institutions (1)

Facebook¹

24 Dec 2014-arXiv: Learning

TL;DR: In this article, two new Fast Fourier Transform (FFT) implementations are introduced: one based on NVIDIA's cuFFT library, and another based on a Facebook authored FFT implementation, fbfft.

...read moreread less

Abstract: We examine the performance profile of Convolutional Neural Network training on the current generation of NVIDIA Graphics Processing Units. We introduce two new Fast Fourier Transform convolution implementations: one based on NVIDIA's cuFFT library, and another based on a Facebook authored FFT implementation, fbfft, that provides significant speedups over cuFFT (over 1.5x) for whole CNNs. Both of these convolution implementations are available in open source, and are faster than NVIDIA's cuDNN implementation for many common convolutional layers (up to 23.5x for some synthetic kernel configurations). We discuss different performance regimes of convolutions, comparing areas where straightforward time domain convolutions outperform Fourier frequency domain convolutions. Details on algorithmic applications of NVIDIA GPU hardware specifics in the implementation of fbfft are also provided.

...read moreread less

Posted Content•

Deep Fried Convnets

[...]

Zichao Yang¹, Marcin Moczulski², Misha Denil², Nando de Freitas³, Alexander J. Smola³, Le Song³, Ziyu Wang² - Show less +3 more•Institutions (3)

Carnegie Mellon University¹, University of Oxford², Georgia Institute of Technology³

22 Dec 2014-arXiv: Learning

TL;DR: In this article, a Fastfood layer is proposed to replace all fully connected layers in a deep convolutional neural network, which substantially reduces the memory footprint of CNNs trained on MNIST and ImageNet.

...read moreread less

Abstract: The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.

...read moreread less

Posted Content•

Differential Privacy and Machine Learning: a Survey and Review.

[...]

Zhanglong Ji, Zachary C. Lipton, Charles Elkan

24 Dec 2014-arXiv: Learning

TL;DR: This paper explores the interplay between machine learning and differential privacy, namely privacy-preserving machine learning algorithms and learning-based data release mechanisms, and describes some theoretical results that address what can be learned differentially privately and upper bounds of loss functions for differentially private algorithms.

...read moreread less

Abstract: The objective of machine learning is to extract useful information from data, while privacy is preserved by concealing information. Thus it seems hard to reconcile these competing interests. However, they frequently must be balanced when mining sensitive data. For example, medical research represents an important application where it is necessary both to extract useful information and protect patient privacy. One way to resolve the conflict is to extract general characteristics of whole populations without disclosing the private information of individuals. In this paper, we consider differential privacy, one of the most popular and powerful definitions of privacy. We explore the interplay between machine learning and differential privacy, namely privacy-preserving machine learning algorithms and learning-based data release mechanisms. We also describe some theoretical results that address what can be learned differentially privately and upper bounds of loss functions for differentially private algorithms. Finally, we present some open questions, including how to incorporate public data, how to deal with missing data in private datasets, and whether, as the number of observed samples grows arbitrarily large, differentially private machine learning algorithms can be achieved at no cost to utility as compared to corresponding non-differentially private algorithms.

...read moreread less

Collapse