Top 45 papers published by Yee Whye Teh from University of Oxford in 2019

Proceedings Article•

[...]

Emilien Dupont¹, Arnaud Doucet¹, Yee Whye Teh¹•Institutions (1)

02 Apr 2019

TL;DR: Augmented Neural ODEs are introduced which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural Odes.

...read moreread less

Abstract: We show that Neural Ordinary Differential Equations (ODEs) learn representations that preserve the topology of the input space and prove that this implies the existence of functions Neural ODEs cannot represent. To address these limitations, we introduce Augmented Neural ODEs which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural ODEs.

...read moreread less

359 citations

Proceedings Article•

Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks

[...]

Juho Lee¹, Yoonho Lee, Jungtaek Kim², Adam R. Kosiorek¹, Seungjin Choi², Yee Whye Teh¹ - Show less +2 more•Institutions (2)

University of Oxford¹, Pohang University of Science and Technology²

24 May 2019

TL;DR: The Set Transformer as discussed by the authors is an attention-based neural network module, specifically designed to model interactions among elements in the input set, consisting of an encoder and a decoder, both of which rely on attention mechanisms.

...read moreread less

Abstract: Many machine learning tasks such as multiple instance learning, 3D shape recognition, and few-shot image classification are defined on sets of instances. Since solutions to such problems do not depend on the order of elements of the set, models used to address them should be permutation invariant. We present an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set. The model consists of an encoder and a decoder, both of which rely on attention mechanisms. In an effort to reduce computational complexity, we introduce an attention scheme inspired by inducing point methods from sparse Gaussian process literature. It reduces the computation time of self-attention from quadratic to linear in the number of elements in the set. We show that our model is theoretically attractive and we evaluate it on a range of tasks, demonstrating the state-of-the-art performance compared to recent methods for set-structured data.

...read moreread less

250 citations

Proceedings Article•

Stacked capsule autoencoders

[...]

Adam R. Kosiorek¹, Sara Sabour², Yee Whye Teh¹, Geoffrey E. Hinton²•Institutions (2)

University of Oxford¹, Google²

17 Jun 2019

TL;DR: This work introduces an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects, and finds that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for un supervised classification on SVHN and MNIST.

...read moreread less

Abstract: Objects are composed of a set of geometrically organized parts. We introduce an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects. Since these relationships do not depend on the viewpoint, our model is robust to viewpoint changes. SCAE consists of two stages. In the first stage, the model predicts presences and poses of part templates directly from the image and tries to reconstruct the image by appropriately arranging the templates. In the second stage, the SCAE predicts parameters of a few object capsules, which are then used to reconstruct part poses. Inference in this model is amortized and performed by off-the-shelf neural encoders, unlike in previous capsule networks. We find that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for unsupervised classification on SVHN (55%) and MNIST (98.7%).

...read moreread less

180 citations

Proceedings Article•

Continual Unsupervised Representation Learning.

[...]

Dushyant Rao¹, Francesco Visin², Andrei Rusu³, Razvan Pascanu², Yee Whye Teh⁴, Raia Hadsell² - Show less +2 more•Institutions (4)

Carnegie Mellon University¹, Google², West University of Timișoara³, University of Oxford⁴

01 Jan 2019

TL;DR: The proposed approach (CURL) performs task inference directly within the model, is able to dynamically expand to capture new concepts over its lifetime, and incorporates additional rehearsal-based techniques to deal with catastrophic forgetting.

...read moreread less

Abstract: Continual learning aims to improve the ability of modern learning systems to deal with non-stationary distributions, typically by attempting to learn a series of tasks sequentially. Prior art in the field has largely considered supervised or reinforcement learning tasks, and often assumes full knowledge of task labels and boundaries. In this work, we propose an approach (CURL) to tackle a more general problem that we will refer to as unsupervised continual learning. The focus is on learning representations without any knowledge about task identity, and we explore scenarios when there are abrupt changes between tasks, smooth transitions from one task to another, or even when the data is shuffled. The proposed approach performs task inference directly within the model, is able to dynamically expand to capture new concepts over its lifetime, and incorporates additional rehearsal-based techniques to deal with catastrophic forgetting. We demonstrate the efficacy of CURL in an unsupervised learning setting with MNIST and Omniglot, where the lack of labels ensures no information is leaked about the task. Further, we demonstrate strong performance compared to prior art in an i.i.d setting, or when adapting the technique to supervised tasks such as incremental class learning.

...read moreread less

170 citations

Proceedings Article•

Attentive Neural Processes

[...]

Hyunjik Kim¹, Andriy Mnih¹, Jonathan Schwarz¹, Marta Garnelo¹, S. M. Ali Eslami¹, Dan Rosenbaum², Oriol Vinyals¹, Yee Whye Teh³ - Show less +4 more•Institutions (3)

Google¹, Hebrew University of Jerusalem², University of Oxford³

01 Jan 2019

TL;DR: This paper proposed to incorporate attention into NPs, allowing each input location to attend to relevant context points for the prediction, which greatly improves the accuracy of predictions, results in noticeably faster training, and expands the range of functions that can be modelled.

...read moreread less

Abstract: Neural Processes (NPs) (Garnelo et al 2018a;b) approach regression by learning to map a context set of observed input-output pairs to a distribution over regression functions. Each function models the distribution of the output given an input, conditioned on the context. NPs have the benefit of fitting observed data efficiently with linear complexity in the number of context input-output pairs, and can learn a wide family of conditional distributions; they learn predictive distributions conditioned on context sets of arbitrary size. Nonetheless, we show that NPs suffer a fundamental drawback of underfitting, giving inaccurate predictions at the inputs of the observed data they condition on. We address this issue by incorporating attention into NPs, allowing each input location to attend to the relevant context points for the prediction. We show that this greatly improves the accuracy of predictions, results in noticeably faster training, and expands the range of functions that can be modelled.

...read moreread less

125 citations

Posted Content•

Attentive Neural Processes

[...]

Hyunjik Kim¹, Andriy Mnih¹, Jonathan Schwarz¹, Marta Garnelo¹, S. M. Ali Eslami¹, Dan Rosenbaum², Oriol Vinyals¹, Yee Whye Teh³ - Show less +4 more•Institutions (3)

Google¹, Hebrew University of Jerusalem², University of Oxford³

17 Jan 2019-arXiv: Learning

TL;DR: Attention is incorporated into NPs, allowing each input location to attend to the relevant context points for the prediction, which greatly improves the accuracy of predictions, results in noticeably faster training, and expands the range of functions that can be modelled.

...read moreread less

Abstract: Neural Processes (NPs) (Garnelo et al 2018a;b) approach regression by learning to map a context set of observed input-output pairs to a distribution over regression functions. Each function models the distribution of the output given an input, conditioned on the context. NPs have the benefit of fitting observed data efficiently with linear complexity in the number of context input-output pairs, and can learn a wide family of conditional distributions; they learn predictive distributions conditioned on context sets of arbitrary size. Nonetheless, we show that NPs suffer a fundamental drawback of underfitting, giving inaccurate predictions at the inputs of the observed data they condition on. We address this issue by incorporating attention into NPs, allowing each input location to attend to the relevant context points for the prediction. We show that this greatly improves the accuracy of predictions, results in noticeably faster training, and expands the range of functions that can be modelled.

...read moreread less

117 citations

Proceedings Article•

Disentangling Disentanglement in Variational Autoencoders

[...]

Emile Mathieu¹, Tom Rainforth¹, N. Siddharth¹, Yee Whye Teh¹•Institutions (1)

University of Oxford¹

24 May 2019

TL;DR: In this paper, a generalization of disentanglement in VAEs is proposed, which is based on decomposition of the latent representation, i.e., the latent encodings of the data having an appropriate level of overlap, represented through the prior.

...read moreread less

Abstract: We develop a generalisation of disentanglement in VAEs---decomposition of the latent representation---characterising it as the fulfilment of two factors: a) the latent encodings of the data having an appropriate level of overlap, and b) the aggregate encoding of the data conforming to a desired structure, represented through the prior. Decomposition permits disentanglement, i.e. explicit independence between latents, as a special case, but also allows for a much richer class of properties to be imposed on the learnt representation, such as sparsity, clustering, independent subspaces, or even intricate hierarchical dependency relationships. We show that the $\beta$-VAE varies from the standard VAE predominantly in its control of latent overlap and that for the standard choice of an isotropic Gaussian prior, its objective is invariant to rotations of the latent representation. Viewed from the decomposition perspective, breaking this invariance with simple manipulations of the prior can yield better disentanglement with little or no detriment to reconstructions. We further demonstrate how other choices of prior can assist in producing different decompositions and introduce an alternative training objective that allows the control of both decomposition factors in a principled manner.

...read moreread less

109 citations

Posted Content•

Meta reinforcement learning as task inference

[...]

Jan Humplik, Alexandre Galashov, Leonard Hasenclever, Pedro A. Ortega, Yee Whye Teh, Nicolas Heess - Show less +2 more

15 May 2019-arXiv: Learning

TL;DR: This work proposes a method that separately learns the policy and the task belief by taking advantage of various kinds of privileged information, which can be very effective at solving standard meta-RL environments, as well as a complex continuous control environment with sparse rewards and requiring long-term memory.

...read moreread less

Abstract: Humans achieve efficient learning by relying on prior knowledge about the structure of naturally occurring tasks. There is considerable interest in designing reinforcement learning (RL) algorithms with similar properties. This includes proposals to learn the learning algorithm itself, an idea also known as meta learning. One formal interpretation of this idea is as a partially observable multi-task RL problem in which task information is hidden from the agent. Such unknown task problems can be reduced to Markov decision processes (MDPs) by augmenting an agent's observations with an estimate of the belief about the task based on past experience. However estimating the belief state is intractable in most partially-observed MDPs. We propose a method that separately learns the policy and the task belief by taking advantage of various kinds of privileged information. Our approach can be very effective at solving standard meta-RL environments, as well as a complex continuous control environment with sparse rewards and requiring long-term memory.

...read moreread less

86 citations

Posted Content•

Task Agnostic Continual Learning via Meta Learning

[...]

Xu He, Jakub Sygnowski, Alexandre Galashov, Andrei Rusu, Yee Whye Teh, Razvan Pascanu¹ - Show less +2 more•Institutions (1)

Google¹

12 Jun 2019-arXiv: Machine Learning

TL;DR: This work proposes a framework specific for the scenario where no information about task boundaries or task identity is given, and proposes a separation of concerns into what task is being solved and how the task should be solved, which opens the door to combining meta-learning and continual learning techniques, leveraging their complementary advantages.

...read moreread less

Abstract: While neural networks are powerful function approximators, they suffer from catastrophic forgetting when the data distribution is not stationary. One particular formalism that studies learning under non-stationary distribution is provided by continual learning, where the non-stationarity is imposed by a sequence of distinct tasks. Most methods in this space assume, however, the knowledge of task boundaries, and focus on alleviating catastrophic forgetting. In this work, we depart from this view and move the focus towards faster remembering -- i.e measuring how quickly the network recovers performance rather than measuring the network's performance without any adaptation. We argue that in many settings this can be more effective and that it opens the door to combining meta-learning and continual learning techniques, leveraging their complementary advantages. We propose a framework specific for the scenario where no information about task boundaries or task identity is given. It relies on a separation of concerns into what task is being solved and how the task should be solved. This framework is implemented by differentiating task specific parameters from task agnostic parameters, where the latter are optimized in a continual meta learning fashion, without access to multiple tasks at the same time. We showcase this framework in a supervised learning scenario and discuss the implication of the proposed formalism.

...read moreread less

82 citations

Posted Content•

Detecting Out-of-Distribution Inputs to Deep Generative Models Using a Test for Typicality.

[...]

Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Balaji Lakshminarayanan

07 Jun 2019

TL;DR: This work proposes a statistically principled, easy-to-implement test using the empirical distribution of model likelihoods to determine whether or not inputs reside in the typical set.

...read moreread less

Abstract: Recent work has shown that deep generative models can assign higher likelihood to out-of-distribution data sets than to their training data. We posit that this phenomenon is caused by a mismatch between the model's typical set and its areas of high probability density. In-distribution inputs should reside in the former but not necessarily in the latter, as previous work has presumed. To determine whether or not inputs reside in the typical set, we propose a statistically principled, easy-to-implement test using the empirical distribution of model likelihoods. The test is model agnostic and widely applicable, only requiring that the likelihood can be computed or closely approximated. We report experiments showing that our procedure can successfully detect the out-of-distribution sets in several of the challenging cases reported by Nalisnick et al. (2019).

...read moreread less

82 citations

Posted Content•

Continuous Hierarchical Representations with Poincar\'e Variational Auto-Encoders

[...]

Emile Mathieu¹, Charline Le Lan¹, Chris J. Maddison¹, Ryota Tomioka², Yee Whye Teh¹ - Show less +1 more•Institutions (2)

University of Oxford¹, Microsoft²

17 Jan 2019-arXiv: Machine Learning

TL;DR: The authors endow VAEs with a Poincare ball model of hyperbolic geometry as a latent space and rigorously derive the necessary methods to work with two main Gaussian generalisations on that space.

...read moreread less

Abstract: The variational auto-encoder (VAE) is a popular method for learning a generative model and embeddings of the data. Many real datasets are hierarchically structured. However, traditional VAEs map data in a Euclidean latent space which cannot efficiently embed tree-like structures. Hyperbolic spaces with negative curvature can. We therefore endow VAEs with a Poincare ball model of hyperbolic geometry as a latent space and rigorously derive the necessary methods to work with two main Gaussian generalisations on that space. We empirically show better generalisation to unseen data than the Euclidean counterpart, and can qualitatively and quantitatively better recover hierarchical structures.

...read moreread less

Proceedings Article•

Information asymmetry in KL-regularized RL

[...]

Alexandre Galashov¹, Siddhant M. Jayakumar, Leonard Hasenclever², Dhruva Tirumala¹, Jonathan Schwarz¹, Guillaume Desjardins¹, Wojciech Marian Czarnecki³, Yee Whye Teh², Razvan Pascanu¹, Nicolas Heess¹ - Show less +6 more•Institutions (3)

Google¹, University of Oxford², Jagiellonian University³

01 Jan 2019

TL;DR: This work starts from the KL regularized expected reward objective and introduces an additional component, a default policy, but crucially restricts the amount of information the default policy receives, forcing it to learn reusable behaviors that help the policy learn faster.

...read moreread less

Abstract: Many real world tasks exhibit rich structure that is repeated across different parts of the state space or in time. In this work we study the possibility of leveraging such repeated structure to speed up and regularize learning. We start from the KL regularized expected reward objective which introduces an additional component, a default policy. Instead of relying on a fixed default policy, we learn it from data. But crucially, we restrict the amount of information the default policy receives, forcing it to learn reusable behaviors that help the policy learn faster. We formalize this strategy and discuss connections to information bottleneck approaches and to the variational EM algorithm. We present empirical results in both discrete and continuous action domains and demonstrate that, for certain tasks, learning a default policy alongside the policy can significantly speed up and improve learning.

...read moreread less

Posted Content•

Detecting Out-of-Distribution Inputs to Deep Generative Models Using Typicality

[...]

Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Balaji Lakshminarayanan¹•Institutions (1)

Google¹

25 Sep 2019-arXiv: Machine Learning

TL;DR: A statistically principled, easy-to-implement test using the empirical distribution of model likelihoods to determine whether or not inputs reside in the typical set, only requiring that the likelihood can be computed or closely approximated.

...read moreread less

Abstract: Recent work has shown that deep generative models can assign higher likelihood to out-of-distribution data sets than to their training data (Nalisnick et al., 2019; Choi et al., 2019). We posit that this phenomenon is caused by a mismatch between the model's typical set and its areas of high probability density. In-distribution inputs should reside in the former but not necessarily in the latter, as previous work has presumed. To determine whether or not inputs reside in the typical set, we propose a statistically principled, easy-to-implement test using the empirical distribution of model likelihoods. The test is model agnostic and widely applicable, only requiring that the likelihood can be computed or closely approximated. We report experiments showing that our procedure can successfully detect the out-of-distribution sets in several of the challenging cases reported by Nalisnick et al. (2019).

...read moreread less

Posted Content•

Meta-learning of Sequential Strategies.

[...]

08 May 2019-arXiv: Learning

TL;DR: This report recast memory-based meta-learning within a Bayesian framework, showing that the meta-learned strategies are near-optimal because they amortize Bayes-filtered data, where the adaptation is implemented in the memory dynamics as a state-machine of sufficient statistics.

...read moreread less

Abstract: In this report we review memory-based meta-learning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building near-optimal predictors and reinforcement learners which behave as if they had a probabilistic model that allowed them to efficiently exploit task structure. Furthermore, we recast memory-based meta-learning within a Bayesian framework, showing that the meta-learned strategies are near-optimal because they amortize Bayes-filtered data, where the adaptation is implemented in the memory dynamics as a state-machine of sufficient statistics. Essentially, memory-based meta-learning translates the hard problem of probabilistic sequential inference into a regression problem.

...read moreread less

Posted Content•

Augmented Neural ODEs

[...]

Emilien Dupont¹, Arnaud Doucet¹, Yee Whye Teh¹•Institutions (1)

University of Oxford¹

02 Apr 2019-arXiv: Machine Learning

TL;DR: In this article, it was shown that augmented neural ODEs are empirically more stable, generalize better and have a lower computational cost than Neural ODE, which implies the existence of functions that Neural Ordinary Differential Equations cannot represent.

...read moreread less

Abstract: We show that Neural Ordinary Differential Equations (ODEs) learn representations that preserve the topology of the input space and prove that this implies the existence of functions Neural ODEs cannot represent. To address these limitations, we introduce Augmented Neural ODEs which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural ODEs.

...read moreread less

Proceedings Article•

Continuous Hierarchical Representations with Poincaré Variational Auto-Encoders

[...]

Emile Mathieu¹, Charline Le Lan¹, Chris J. Maddison¹, Ryota Tomioka², Yee Whye Teh¹ - Show less +1 more•Institutions (2)

University of Oxford¹, Microsoft²

01 Jan 2019

TL;DR: This work endow VAEs with a Poincare ball model of hyperbolic geometry as a latent space and rigorously derive the necessary methods to work with two main Gaussian generalisations on that space.

...read moreread less

Abstract: The Variational Auto-Encoder (VAE) is a popular method for learning a generative model and embeddings of the data. Many real datasets are hierarchically structured. However, traditional VAEs map data in a Euclidean latent space which cannot efficiently embed tree-like structures. Hyperbolic spaces with negative curvature can. We therefore endow VAEs with a Poincare ball model of hyperbolic geometry as a latent space and rigorously derive the necessary methods to work with two main Gaussian generalisations on that space. We empirically show better generalisation to unseen data than the Euclidean counterpart, and can qualitatively and quantitatively better recover hierarchical structures.

...read moreread less

Posted Content•

Variational Bayesian Optimal Experimental Design

[...]

Adam Foster¹, Martin Jankowiak², Eli Bingham², Paul Horsfall³, Yee Whye Teh¹, Tom Rainforth¹, Noah D. Goodman³ - Show less +3 more•Institutions (3)

University of Oxford¹, Uber ², Stanford University³

13 Mar 2019-arXiv: Machine Learning

TL;DR: This work introduces several classes of fast EIG estimators by building on ideas from amortized variational inference, and shows theoretically and empirically that these estimators can provide significant gains in speed and accuracy over previous approaches.

...read moreread less

Abstract: Bayesian optimal experimental design (BOED) is a principled framework for making efficient use of limited experimental resources. Unfortunately, its applicability is hampered by the difficulty of obtaining accurate estimates of the expected information gain (EIG) of an experiment. To address this, we introduce several classes of fast EIG estimators by building on ideas from amortized variational inference. We show theoretically and empirically that these estimators can provide significant gains in speed and accuracy over previous approaches. We further demonstrate the practicality of our approach on a number of end-to-end experiments.

...read moreread less

Proceedings Article•

Hybrid Models with Deep and Invertible Features

[...]

Eric Nalisnick¹, Akihiro Matsukawa², Yee Whye Teh³, Dilan Gorur⁴, Balaji Lakshminarayanan⁴ - Show less +1 more•Institutions (4)

University of Cambridge¹, University of California, Berkeley², University of Oxford³, Google⁴

24 May 2019

TL;DR: The hybrid model, despite the invertibility constraints, achieves similar accuracy to purely predictive models, and the generative component remains a good model of the input features despite the hybrid optimization objective.

...read moreread less

Abstract: We propose a neural hybrid model consisting of a linear model defined on a set of features computed by a deep, invertible transformation (i.e. a normalizing flow). An attractive property of our model is that both p(features), the density of the features, and p(targets | features), the predictive distribution, can be computed exactly in a single feed-forward pass. We show that our hybrid model, despite the invertibility constraints, achieves similar accuracy to purely predictive models. Moreover the generative component remains a good model of the input features despite the hybrid optimization objective. This offers additional capabilities such as detection of out-of-distribution inputs and enabling semi-supervised learning. The availability of the exact joint density p(targets, features) also allows us to compute many quantities readily, making our hybrid model a useful building block for downstream applications of probabilistic deep learning.

...read moreread less

Posted Content•

Hybrid Models with Deep and Invertible Features

[...]

Eric Nalisnick¹, Akihiro Matsukawa², Yee Whye Teh³, Dilan Gorur⁴, Balaji Lakshminarayanan⁴ - Show less +1 more•Institutions (4)

University of Cambridge¹, University of California, Berkeley², University of Oxford³, Google⁴

07 Feb 2019-arXiv: Learning

TL;DR: This article proposed a neural hybrid model consisting of a linear model defined on a set of features computed by a deep, invertible transformation (i.e., a normalizing flow).

...read moreread less

Abstract: We propose a neural hybrid model consisting of a linear model defined on a set of features computed by a deep, invertible transformation (i.e. a normalizing flow). An attractive property of our model is that both p(features), the density of the features, and p(targets | features), the predictive distribution, can be computed exactly in a single feed-forward pass. We show that our hybrid model, despite the invertibility constraints, achieves similar accuracy to purely predictive models. Moreover the generative component remains a good model of the input features despite the hybrid optimization objective. This offers additional capabilities such as detection of out-of-distribution inputs and enabling semi-supervised learning. The availability of the exact joint density p(targets, features) also allows us to compute many quantities readily, making our hybrid model a useful building block for downstream applications of probabilistic deep learning.

...read moreread less

Posted Content•

Stacked Capsule Autoencoders

[...]

Adam R. Kosiorek¹, Sara Sabour², Yee Whye Teh¹, Geoffrey E. Hinton²•Institutions (2)

University of Oxford¹, Google²

17 Jun 2019-arXiv: Machine Learning

TL;DR: Zhang et al. as discussed by the authors introduced an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects. But their model is robust to viewpoint changes.

...read moreread less

Abstract: Objects are composed of a set of geometrically organized parts. We introduce an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects. Since these relationships do not depend on the viewpoint, our model is robust to viewpoint changes. SCAE consists of two stages. In the first stage, the model predicts presences and poses of part templates directly from the image and tries to reconstruct the image by appropriately arranging the templates. In the second stage, SCAE predicts parameters of a few object capsules, which are then used to reconstruct part poses. Inference in this model is amortized and performed by off-the-shelf neural encoders, unlike in previous capsule networks. We find that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for unsupervised classification on SVHN (55%) and MNIST (98.7%). The code is available at this https URL

...read moreread less

Posted Content•

Probabilistic symmetries and invariant neural networks.

[...]

Benjamin Bloem-Reddy¹, Yee Whye Teh²•Institutions (2)

University of British Columbia¹, University of Oxford²

18 Jan 2019-arXiv: Machine Learning

TL;DR: In this paper, the authors consider group invariance from the perspective of probabilistic symmetry, and obtain generative functional representations of probability distributions that are invariant or equivariant under the action of a compact group.

...read moreread less

Abstract: Treating neural network inputs and outputs as random variables, we characterize the structure of neural networks that can be used to model data that are invariant or equivariant under the action of a compact group. Much recent research has been devoted to encoding invariance under symmetry transformations into neural network architectures, in an effort to improve the performance of deep neural networks in data-scarce, non-i.i.d., or unsupervised settings. By considering group invariance from the perspective of probabilistic symmetry, we establish a link between functional and probabilistic symmetry, and obtain generative functional representations of probability distributions that are invariant or equivariant under the action of a compact group. Our representations completely characterize the structure of neural networks that can be used to model such distributions and yield a general program for constructing invariant stochastic or deterministic neural networks. We demonstrate that examples from the recent literature are special cases, and develop the details of the general program for exchangeable sequences and arrays.

...read moreread less

Posted Content•

Functional Regularisation for Continual Learning with Gaussian Processes

[...]

Michalis K. Titsias¹, Jonathan Schwarz¹, Alexander G. de G. Matthews¹, Razvan Pascanu¹, Yee Whye Teh¹ - Show less +1 more•Institutions (1)

Google¹

31 Jan 2019-arXiv: Machine Learning

TL;DR: The functional regularization for Continual Learning (CL) framework as discussed by the authors uses a Gaussian process obtained by treating the weights of the last layer of a neural network as random and Gaussian distributed to avoid forgetting a previous task by constructing an approximate posterior belief over the underlying task-specific function.

...read moreread less

Abstract: We introduce a framework for Continual Learning (CL) based on Bayesian inference over the function space rather than the parameters of a deep neural network. This method, referred to as functional regularisation for Continual Learning, avoids forgetting a previous task by constructing and memorising an approximate posterior belief over the underlying task-specific function. To achieve this we rely on a Gaussian process obtained by treating the weights of the last layer of a neural network as random and Gaussian distributed. Then, the training algorithm sequentially encounters tasks and constructs posterior beliefs over the task-specific functions by using inducing point sparse Gaussian process methods. At each step a new task is first learnt and then a summary is constructed consisting of (i) inducing inputs -- a fixed-size subset of the task inputs selected such that it optimally represents the task -- and (ii) a posterior distribution over the function values at these inputs. This summary then regularises learning of future tasks, through Kullback-Leibler regularisation terms. Our method thus unites approaches focused on (pseudo-)rehearsal with those derived from a sequential Bayesian inference perspective in a principled way, leading to strong results on accepted benchmarks.

...read moreread less

Posted Content•

Exploiting Hierarchy for Learning and Transfer in KL-regularized RL

[...]

Dhruva Tirumala, Hyeonwoo Noh, Alexandre Galashov, Leonard Hasenclever, Arun Ahuja, Greg Wayne, Razvan Pascanu, Yee Whye Teh, Nicolas Heess - Show less +5 more

18 Mar 2019-arXiv: Learning

TL;DR: This work considers the implications of the KL-regularized expected reward objective framework in cases where both the policy and default behavior are augmented with latent variables and discusses how the resulting hierarchical structures can be used to implement different inductive biases and how their modularity can benefit transfer.

...read moreread less

Abstract: As reinforcement learning agents are tasked with solving more challenging and diverse tasks, the ability to incorporate prior knowledge into the learning system and to exploit reusable structure in solution space is likely to become increasingly important. The KL-regularized expected reward objective constitutes one possible tool to this end. It introduces an additional component, a default or prior behavior, which can be learned alongside the policy and as such partially transforms the reinforcement learning problem into one of behavior modelling. In this work we consider the implications of this framework in cases where both the policy and default behavior are augmented with latent variables. We discuss how the resulting hierarchical structures can be used to implement different inductive biases and how their modularity can benefit transfer. Empirically we find that they can lead to faster learning and transfer on a range of continuous control tasks.

...read moreread less

Proceedings Article•

Variational Bayesian Optimal Experimental Design

[...]

Adam Foster¹, Martin Jankowiak², Eli Bingham², Paul Horsfall³, Yee Whye Teh¹, Tom Rainforth¹, Noah D. Goodman³ - Show less +3 more•Institutions (3)

University of Oxford¹, Uber ², Stanford University³

01 Jan 2019

TL;DR: In this paper, the authors introduce several classes of fast EIG estimators by building on ideas from amortized variational inference, which can provide significant gains in speed and accuracy over previous approaches.

...read moreread less

Abstract: Bayesian optimal experimental design (BOED) is a principled framework for making efficient use of limited experimental resources. Unfortunately, its applicability is hampered by the difficulty of obtaining accurate estimates of the expected information gain (EIG) of an experiment. To address this, we introduce several classes of fast EIG estimators by building on ideas from amortized variational inference. We show theoretically and empirically that these estimators can provide significant gains in speed and accuracy over previous approaches. We further demonstrate the practicality of our approach on a number of end-to-end experiments.

...read moreread less

Proceedings Article•

Revisiting Reweighted Wake-Sleep for Models with Stochastic Control Flow.

[...]

Tuan Anh Le¹, Adam R. Kosiorek², N. Siddharth², Yee Whye Teh², Frank Wood³ - Show less +1 more•Institutions (3)

Massachusetts Institute of Technology¹, University of Oxford², University of British Columbia³

01 Jan 2019

TL;DR: In this article, Bornschein and Bengio revisited the reweighted wake-sleep (RWS) algorithm, and through extensive evaluations, show that it outperforms current state-of-the-art methods in learning stochastic control-flow models.

...read moreread less

Abstract: Stochastic control-flow models (SCFMs) are a class of generative models that involve branching on choices from discrete random variables. Amortized gradient-based learning of SCFMs is challenging as most approaches targeting discrete variables rely on their continuous relaxations---which can be intractable in SCFMs, as branching on relaxations requires evaluating all (exponentially many) branching paths. Tractable alternatives mainly combine REINFORCE with complex control-variate schemes to improve the variance of naive estimators. Here, we revisit the reweighted wake-sleep (RWS) (Bornschein and Bengio, 2015) algorithm, and through extensive evaluations, show that it outperforms current state-of-the-art methods in learning SCFMs. Further, in contrast to the importance weighted autoencoder, we observe that RWS learns better models and inference networks with increasing numbers of particles. Our results suggest that RWS is a competitive, often preferable, alternative for learning SCFMs.

...read moreread less

Posted Content•

A Unified Stochastic Gradient Approach to Designing Bayesian-Optimal Experiments

[...]

Adam Foster¹, Martin Jankowiak², Matthew J. O’Meara³, Yee Whye Teh¹, Tom Rainforth¹ - Show less +1 more•Institutions (3)

University of Oxford¹, Broad Institute², University of Michigan³

01 Nov 2019-arXiv: Machine Learning

TL;DR: This work introduces a fully stochastic gradient based approach to Bayesian optimal experimental design (BOED) that utilizes variational lower bounds on the expected information gain of an experiment that can be simultaneously optimized with respect to both the variational and design parameters.

...read moreread less

Abstract: We introduce a fully stochastic gradient based approach to Bayesian optimal experimental design (BOED). Our approach utilizes variational lower bounds on the expected information gain (EIG) of an experiment that can be simultaneously optimized with respect to both the variational and design parameters. This allows the design process to be carried out through a single unified stochastic gradient ascent procedure, in contrast to existing approaches that typically construct a pointwise EIG estimator, before passing this estimator to a separate optimizer. We provide a number of different variational objectives including the novel adaptive contrastive estimation (ACE) bound. Finally, we show that our gradient-based approaches are able to provide effective design optimization in substantially higher dimensional settings than existing approaches.

...read moreread less

Hierarchical Representations with Poincaré Variational Auto-Encoders.

[...]

Emile Mathieu, Charline Le Lan, Chris J. Maddison, Ryota Tomioka, Yee Whye Teh - Show less +1 more

01 Jan 2019

TL;DR: The authors endow VAE with a Poincar\'e ball model of hyperbolic geometry and derive the necessary methods to work with two main Gaussian generalisations on that space.

...read moreread less

Abstract: The Variational Auto-Encoder (VAE) model is a popular method to learn at once a generative model and embeddings for data living in a high-dimensional space. In the real world, many datasets may be assumed to be hierarchically structured. Traditionally, VAE uses a Euclidean latent space, but tree-like structures cannot be efficiently embedded in such spaces as opposed to hyperbolic spaces with negative curvature. We therefore endow VAE with a Poincar\'e ball model of hyperbolic geometry and derive the necessary methods to work with two main Gaussian generalisations on that space. We empirically show better generalisation to unseen data than the Euclidean counterpart, and can qualitatively and quantitatively better recover hierarchical structures.

...read moreread less

Posted Content•

Information asymmetry in KL-regularized RL

[...]

Alexandre Galashov¹, Siddhant M. Jayakumar, Leonard Hasenclever², Dhruva Tirumala¹, Jonathan Schwarz¹, Guillaume Desjardins¹, Wojciech Marian Czarnecki³, Yee Whye Teh², Razvan Pascanu¹, Nicolas Heess¹ - Show less +6 more•Institutions (3)

Google¹, University of Oxford², Jagiellonian University³

03 May 2019-arXiv: Learning

TL;DR: In this paper, the KL regularized expected reward objective is replaced with a default policy, and the default policy is trained from data to learn reusable behaviors that help the policy learn faster.

...read moreread less

Abstract: Many real world tasks exhibit rich structure that is repeated across different parts of the state space or in time. In this work we study the possibility of leveraging such repeated structure to speed up and regularize learning. We start from the KL regularized expected reward objective which introduces an additional component, a default policy. Instead of relying on a fixed default policy, we learn it from data. But crucially, we restrict the amount of information the default policy receives, forcing it to learn reusable behaviors that help the policy learn faster. We formalize this strategy and discuss connections to information bottleneck approaches and to the variational EM algorithm. We present empirical results in both discrete and continuous action domains and demonstrate that, for certain tasks, learning a default policy alongside the policy can significantly speed up and improve learning.

...read moreread less

Posted Content•

Deep amortized clustering

[...]

Juho Lee, Yoonho Lee, Yee Whye Teh¹•Institutions (1)

University of Oxford¹

25 Sep 2019-arXiv: Learning

TL;DR: It is empirically show, on both synthetic and image data, that DAC can efficiently and accurately cluster new datasets coming from the same distribution used to generate training datasets.

...read moreread less

Abstract: We propose a deep amortized clustering (DAC), a neural architecture which learns to cluster datasets efficiently using a few forward passes. DAC implicitly learns what makes a cluster, how to group data points into clusters, and how to count the number of clusters in datasets. DAC is meta-learned using labelled datasets for training, a process distinct from traditional clustering algorithms which usually require hand-specified prior knowledge about cluster shapes/structures. We empirically show, on both synthetic and image data, that DAC can efficiently and accurately cluster new datasets coming from the same distribution used to generate training datasets.

...read moreread less

Posted Content•

Meta-Learning surrogate models for sequential decision making

[...]

Alexandre Galashov, Jonathan Schwarz¹, Hyunjik Kim, Marta Garnelo, David Saxton, Pushmeet Kohli, S. M. Ali Eslami, Yee Whye Teh - Show less +4 more•Institutions (1)

Google¹

28 Mar 2019-arXiv: Machine Learning

TL;DR: This work introduces a unified probabilistic framework for solving sequential decision making problems ranging from Bayesian optimisation to contextual bandits and reinforcement learning, and explores the use of Neural processes due to statistical and computational desiderata.

...read moreread less

Abstract: We introduce a unified probabilistic framework for solving sequential decision making problems ranging from Bayesian optimisation to contextual bandits and reinforcement learning. This is accomplished by a probabilistic model-based approach that explains observed data while capturing predictive uncertainty during the decision making process. Crucially, this probabilistic model is chosen to be a Meta-Learning system that allows learning from a distribution of related problems, allowing data efficient adaptation to a target task. As a suitable instantiation of this framework, we explore the use of Neural processes due to statistical and computational desiderata. We apply our framework to a broad range of problem domains, such as control problems, recommender systems and adversarial attacks on RL agents, demonstrating an efficient and general black-box learning approach.

...read moreread less

Showing papers by "Yee Whye Teh published in 2019"