scispace - formally typeset
Search or ask a question

Showing papers by "Yee Whye Teh published in 2019"


Proceedings Article
02 Apr 2019
TL;DR: Augmented Neural ODEs are introduced which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural Odes.
Abstract: We show that Neural Ordinary Differential Equations (ODEs) learn representations that preserve the topology of the input space and prove that this implies the existence of functions Neural ODEs cannot represent. To address these limitations, we introduce Augmented Neural ODEs which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural ODEs.

359 citations


Proceedings Article
24 May 2019
TL;DR: The Set Transformer as discussed by the authors is an attention-based neural network module, specifically designed to model interactions among elements in the input set, consisting of an encoder and a decoder, both of which rely on attention mechanisms.
Abstract: Many machine learning tasks such as multiple instance learning, 3D shape recognition, and few-shot image classification are defined on sets of instances. Since solutions to such problems do not depend on the order of elements of the set, models used to address them should be permutation invariant. We present an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set. The model consists of an encoder and a decoder, both of which rely on attention mechanisms. In an effort to reduce computational complexity, we introduce an attention scheme inspired by inducing point methods from sparse Gaussian process literature. It reduces the computation time of self-attention from quadratic to linear in the number of elements in the set. We show that our model is theoretically attractive and we evaluate it on a range of tasks, demonstrating the state-of-the-art performance compared to recent methods for set-structured data.

250 citations


Proceedings Article
17 Jun 2019
TL;DR: This work introduces an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects, and finds that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for un supervised classification on SVHN and MNIST.
Abstract: Objects are composed of a set of geometrically organized parts. We introduce an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects. Since these relationships do not depend on the viewpoint, our model is robust to viewpoint changes. SCAE consists of two stages. In the first stage, the model predicts presences and poses of part templates directly from the image and tries to reconstruct the image by appropriately arranging the templates. In the second stage, the SCAE predicts parameters of a few object capsules, which are then used to reconstruct part poses. Inference in this model is amortized and performed by off-the-shelf neural encoders, unlike in previous capsule networks. We find that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for unsupervised classification on SVHN (55%) and MNIST (98.7%).

180 citations


Proceedings Article
01 Jan 2019
TL;DR: The proposed approach (CURL) performs task inference directly within the model, is able to dynamically expand to capture new concepts over its lifetime, and incorporates additional rehearsal-based techniques to deal with catastrophic forgetting.
Abstract: Continual learning aims to improve the ability of modern learning systems to deal with non-stationary distributions, typically by attempting to learn a series of tasks sequentially. Prior art in the field has largely considered supervised or reinforcement learning tasks, and often assumes full knowledge of task labels and boundaries. In this work, we propose an approach (CURL) to tackle a more general problem that we will refer to as unsupervised continual learning. The focus is on learning representations without any knowledge about task identity, and we explore scenarios when there are abrupt changes between tasks, smooth transitions from one task to another, or even when the data is shuffled. The proposed approach performs task inference directly within the model, is able to dynamically expand to capture new concepts over its lifetime, and incorporates additional rehearsal-based techniques to deal with catastrophic forgetting. We demonstrate the efficacy of CURL in an unsupervised learning setting with MNIST and Omniglot, where the lack of labels ensures no information is leaked about the task. Further, we demonstrate strong performance compared to prior art in an i.i.d setting, or when adapting the technique to supervised tasks such as incremental class learning.

170 citations


Proceedings Article
01 Jan 2019
TL;DR: This paper proposed to incorporate attention into NPs, allowing each input location to attend to relevant context points for the prediction, which greatly improves the accuracy of predictions, results in noticeably faster training, and expands the range of functions that can be modelled.
Abstract: Neural Processes (NPs) (Garnelo et al 2018a;b) approach regression by learning to map a context set of observed input-output pairs to a distribution over regression functions. Each function models the distribution of the output given an input, conditioned on the context. NPs have the benefit of fitting observed data efficiently with linear complexity in the number of context input-output pairs, and can learn a wide family of conditional distributions; they learn predictive distributions conditioned on context sets of arbitrary size. Nonetheless, we show that NPs suffer a fundamental drawback of underfitting, giving inaccurate predictions at the inputs of the observed data they condition on. We address this issue by incorporating attention into NPs, allowing each input location to attend to the relevant context points for the prediction. We show that this greatly improves the accuracy of predictions, results in noticeably faster training, and expands the range of functions that can be modelled.

125 citations


Posted Content
TL;DR: Attention is incorporated into NPs, allowing each input location to attend to the relevant context points for the prediction, which greatly improves the accuracy of predictions, results in noticeably faster training, and expands the range of functions that can be modelled.
Abstract: Neural Processes (NPs) (Garnelo et al 2018a;b) approach regression by learning to map a context set of observed input-output pairs to a distribution over regression functions. Each function models the distribution of the output given an input, conditioned on the context. NPs have the benefit of fitting observed data efficiently with linear complexity in the number of context input-output pairs, and can learn a wide family of conditional distributions; they learn predictive distributions conditioned on context sets of arbitrary size. Nonetheless, we show that NPs suffer a fundamental drawback of underfitting, giving inaccurate predictions at the inputs of the observed data they condition on. We address this issue by incorporating attention into NPs, allowing each input location to attend to the relevant context points for the prediction. We show that this greatly improves the accuracy of predictions, results in noticeably faster training, and expands the range of functions that can be modelled.

117 citations


Proceedings Article
24 May 2019
TL;DR: In this paper, a generalization of disentanglement in VAEs is proposed, which is based on decomposition of the latent representation, i.e., the latent encodings of the data having an appropriate level of overlap, represented through the prior.
Abstract: We develop a generalisation of disentanglement in VAEs---decomposition of the latent representation---characterising it as the fulfilment of two factors: a) the latent encodings of the data having an appropriate level of overlap, and b) the aggregate encoding of the data conforming to a desired structure, represented through the prior. Decomposition permits disentanglement, i.e. explicit independence between latents, as a special case, but also allows for a much richer class of properties to be imposed on the learnt representation, such as sparsity, clustering, independent subspaces, or even intricate hierarchical dependency relationships. We show that the $\beta$-VAE varies from the standard VAE predominantly in its control of latent overlap and that for the standard choice of an isotropic Gaussian prior, its objective is invariant to rotations of the latent representation. Viewed from the decomposition perspective, breaking this invariance with simple manipulations of the prior can yield better disentanglement with little or no detriment to reconstructions. We further demonstrate how other choices of prior can assist in producing different decompositions and introduce an alternative training objective that allows the control of both decomposition factors in a principled manner.

109 citations


Posted Content
TL;DR: This work proposes a method that separately learns the policy and the task belief by taking advantage of various kinds of privileged information, which can be very effective at solving standard meta-RL environments, as well as a complex continuous control environment with sparse rewards and requiring long-term memory.
Abstract: Humans achieve efficient learning by relying on prior knowledge about the structure of naturally occurring tasks. There is considerable interest in designing reinforcement learning (RL) algorithms with similar properties. This includes proposals to learn the learning algorithm itself, an idea also known as meta learning. One formal interpretation of this idea is as a partially observable multi-task RL problem in which task information is hidden from the agent. Such unknown task problems can be reduced to Markov decision processes (MDPs) by augmenting an agent's observations with an estimate of the belief about the task based on past experience. However estimating the belief state is intractable in most partially-observed MDPs. We propose a method that separately learns the policy and the task belief by taking advantage of various kinds of privileged information. Our approach can be very effective at solving standard meta-RL environments, as well as a complex continuous control environment with sparse rewards and requiring long-term memory.

86 citations


Posted Content
TL;DR: This work proposes a framework specific for the scenario where no information about task boundaries or task identity is given, and proposes a separation of concerns into what task is being solved and how the task should be solved, which opens the door to combining meta-learning and continual learning techniques, leveraging their complementary advantages.
Abstract: While neural networks are powerful function approximators, they suffer from catastrophic forgetting when the data distribution is not stationary. One particular formalism that studies learning under non-stationary distribution is provided by continual learning, where the non-stationarity is imposed by a sequence of distinct tasks. Most methods in this space assume, however, the knowledge of task boundaries, and focus on alleviating catastrophic forgetting. In this work, we depart from this view and move the focus towards faster remembering -- i.e measuring how quickly the network recovers performance rather than measuring the network's performance without any adaptation. We argue that in many settings this can be more effective and that it opens the door to combining meta-learning and continual learning techniques, leveraging their complementary advantages. We propose a framework specific for the scenario where no information about task boundaries or task identity is given. It relies on a separation of concerns into what task is being solved and how the task should be solved. This framework is implemented by differentiating task specific parameters from task agnostic parameters, where the latter are optimized in a continual meta learning fashion, without access to multiple tasks at the same time. We showcase this framework in a supervised learning scenario and discuss the implication of the proposed formalism.

82 citations


Posted Content
07 Jun 2019
TL;DR: This work proposes a statistically principled, easy-to-implement test using the empirical distribution of model likelihoods to determine whether or not inputs reside in the typical set.
Abstract: Recent work has shown that deep generative models can assign higher likelihood to out-of-distribution data sets than to their training data. We posit that this phenomenon is caused by a mismatch between the model's typical set and its areas of high probability density. In-distribution inputs should reside in the former but not necessarily in the latter, as previous work has presumed. To determine whether or not inputs reside in the typical set, we propose a statistically principled, easy-to-implement test using the empirical distribution of model likelihoods. The test is model agnostic and widely applicable, only requiring that the likelihood can be computed or closely approximated. We report experiments showing that our procedure can successfully detect the out-of-distribution sets in several of the challenging cases reported by Nalisnick et al. (2019).

82 citations


Posted Content
TL;DR: The authors endow VAEs with a Poincare ball model of hyperbolic geometry as a latent space and rigorously derive the necessary methods to work with two main Gaussian generalisations on that space.
Abstract: The variational auto-encoder (VAE) is a popular method for learning a generative model and embeddings of the data. Many real datasets are hierarchically structured. However, traditional VAEs map data in a Euclidean latent space which cannot efficiently embed tree-like structures. Hyperbolic spaces with negative curvature can. We therefore endow VAEs with a Poincare ball model of hyperbolic geometry as a latent space and rigorously derive the necessary methods to work with two main Gaussian generalisations on that space. We empirically show better generalisation to unseen data than the Euclidean counterpart, and can qualitatively and quantitatively better recover hierarchical structures.

Proceedings Article
01 Jan 2019
TL;DR: This work starts from the KL regularized expected reward objective and introduces an additional component, a default policy, but crucially restricts the amount of information the default policy receives, forcing it to learn reusable behaviors that help the policy learn faster.
Abstract: Many real world tasks exhibit rich structure that is repeated across different parts of the state space or in time. In this work we study the possibility of leveraging such repeated structure to speed up and regularize learning. We start from the KL regularized expected reward objective which introduces an additional component, a default policy. Instead of relying on a fixed default policy, we learn it from data. But crucially, we restrict the amount of information the default policy receives, forcing it to learn reusable behaviors that help the policy learn faster. We formalize this strategy and discuss connections to information bottleneck approaches and to the variational EM algorithm. We present empirical results in both discrete and continuous action domains and demonstrate that, for certain tasks, learning a default policy alongside the policy can significantly speed up and improve learning.

Posted Content
TL;DR: A statistically principled, easy-to-implement test using the empirical distribution of model likelihoods to determine whether or not inputs reside in the typical set, only requiring that the likelihood can be computed or closely approximated.
Abstract: Recent work has shown that deep generative models can assign higher likelihood to out-of-distribution data sets than to their training data (Nalisnick et al., 2019; Choi et al., 2019). We posit that this phenomenon is caused by a mismatch between the model's typical set and its areas of high probability density. In-distribution inputs should reside in the former but not necessarily in the latter, as previous work has presumed. To determine whether or not inputs reside in the typical set, we propose a statistically principled, easy-to-implement test using the empirical distribution of model likelihoods. The test is model agnostic and widely applicable, only requiring that the likelihood can be computed or closely approximated. We report experiments showing that our procedure can successfully detect the out-of-distribution sets in several of the challenging cases reported by Nalisnick et al. (2019).

Posted Content
TL;DR: This report recast memory-based meta-learning within a Bayesian framework, showing that the meta-learned strategies are near-optimal because they amortize Bayes-filtered data, where the adaptation is implemented in the memory dynamics as a state-machine of sufficient statistics.
Abstract: In this report we review memory-based meta-learning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building near-optimal predictors and reinforcement learners which behave as if they had a probabilistic model that allowed them to efficiently exploit task structure. Furthermore, we recast memory-based meta-learning within a Bayesian framework, showing that the meta-learned strategies are near-optimal because they amortize Bayes-filtered data, where the adaptation is implemented in the memory dynamics as a state-machine of sufficient statistics. Essentially, memory-based meta-learning translates the hard problem of probabilistic sequential inference into a regression problem.

Posted Content
TL;DR: In this article, it was shown that augmented neural ODEs are empirically more stable, generalize better and have a lower computational cost than Neural ODE, which implies the existence of functions that Neural Ordinary Differential Equations cannot represent.
Abstract: We show that Neural Ordinary Differential Equations (ODEs) learn representations that preserve the topology of the input space and prove that this implies the existence of functions Neural ODEs cannot represent. To address these limitations, we introduce Augmented Neural ODEs which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural ODEs.

Proceedings Article
01 Jan 2019
TL;DR: This work endow VAEs with a Poincare ball model of hyperbolic geometry as a latent space and rigorously derive the necessary methods to work with two main Gaussian generalisations on that space.
Abstract: The Variational Auto-Encoder (VAE) is a popular method for learning a generative model and embeddings of the data. Many real datasets are hierarchically structured. However, traditional VAEs map data in a Euclidean latent space which cannot efficiently embed tree-like structures. Hyperbolic spaces with negative curvature can. We therefore endow VAEs with a Poincare ball model of hyperbolic geometry as a latent space and rigorously derive the necessary methods to work with two main Gaussian generalisations on that space. We empirically show better generalisation to unseen data than the Euclidean counterpart, and can qualitatively and quantitatively better recover hierarchical structures.

Posted Content
TL;DR: This work introduces several classes of fast EIG estimators by building on ideas from amortized variational inference, and shows theoretically and empirically that these estimators can provide significant gains in speed and accuracy over previous approaches.
Abstract: Bayesian optimal experimental design (BOED) is a principled framework for making efficient use of limited experimental resources. Unfortunately, its applicability is hampered by the difficulty of obtaining accurate estimates of the expected information gain (EIG) of an experiment. To address this, we introduce several classes of fast EIG estimators by building on ideas from amortized variational inference. We show theoretically and empirically that these estimators can provide significant gains in speed and accuracy over previous approaches. We further demonstrate the practicality of our approach on a number of end-to-end experiments.

Proceedings Article
24 May 2019
TL;DR: The hybrid model, despite the invertibility constraints, achieves similar accuracy to purely predictive models, and the generative component remains a good model of the input features despite the hybrid optimization objective.
Abstract: We propose a neural hybrid model consisting of a linear model defined on a set of features computed by a deep, invertible transformation (i.e. a normalizing flow). An attractive property of our model is that both p(features), the density of the features, and p(targets | features), the predictive distribution, can be computed exactly in a single feed-forward pass. We show that our hybrid model, despite the invertibility constraints, achieves similar accuracy to purely predictive models. Moreover the generative component remains a good model of the input features despite the hybrid optimization objective. This offers additional capabilities such as detection of out-of-distribution inputs and enabling semi-supervised learning. The availability of the exact joint density p(targets, features) also allows us to compute many quantities readily, making our hybrid model a useful building block for downstream applications of probabilistic deep learning.

Posted Content
TL;DR: This article proposed a neural hybrid model consisting of a linear model defined on a set of features computed by a deep, invertible transformation (i.e., a normalizing flow).
Abstract: We propose a neural hybrid model consisting of a linear model defined on a set of features computed by a deep, invertible transformation (i.e. a normalizing flow). An attractive property of our model is that both p(features), the density of the features, and p(targets | features), the predictive distribution, can be computed exactly in a single feed-forward pass. We show that our hybrid model, despite the invertibility constraints, achieves similar accuracy to purely predictive models. Moreover the generative component remains a good model of the input features despite the hybrid optimization objective. This offers additional capabilities such as detection of out-of-distribution inputs and enabling semi-supervised learning. The availability of the exact joint density p(targets, features) also allows us to compute many quantities readily, making our hybrid model a useful building block for downstream applications of probabilistic deep learning.

Posted Content
TL;DR: Zhang et al. as discussed by the authors introduced an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects. But their model is robust to viewpoint changes.
Abstract: Objects are composed of a set of geometrically organized parts. We introduce an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects. Since these relationships do not depend on the viewpoint, our model is robust to viewpoint changes. SCAE consists of two stages. In the first stage, the model predicts presences and poses of part templates directly from the image and tries to reconstruct the image by appropriately arranging the templates. In the second stage, SCAE predicts parameters of a few object capsules, which are then used to reconstruct part poses. Inference in this model is amortized and performed by off-the-shelf neural encoders, unlike in previous capsule networks. We find that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for unsupervised classification on SVHN (55%) and MNIST (98.7%). The code is available at this https URL

Posted Content
TL;DR: In this paper, the authors consider group invariance from the perspective of probabilistic symmetry, and obtain generative functional representations of probability distributions that are invariant or equivariant under the action of a compact group.
Abstract: Treating neural network inputs and outputs as random variables, we characterize the structure of neural networks that can be used to model data that are invariant or equivariant under the action of a compact group. Much recent research has been devoted to encoding invariance under symmetry transformations into neural network architectures, in an effort to improve the performance of deep neural networks in data-scarce, non-i.i.d., or unsupervised settings. By considering group invariance from the perspective of probabilistic symmetry, we establish a link between functional and probabilistic symmetry, and obtain generative functional representations of probability distributions that are invariant or equivariant under the action of a compact group. Our representations completely characterize the structure of neural networks that can be used to model such distributions and yield a general program for constructing invariant stochastic or deterministic neural networks. We demonstrate that examples from the recent literature are special cases, and develop the details of the general program for exchangeable sequences and arrays.

Posted Content
TL;DR: The functional regularization for Continual Learning (CL) framework as discussed by the authors uses a Gaussian process obtained by treating the weights of the last layer of a neural network as random and Gaussian distributed to avoid forgetting a previous task by constructing an approximate posterior belief over the underlying task-specific function.
Abstract: We introduce a framework for Continual Learning (CL) based on Bayesian inference over the function space rather than the parameters of a deep neural network. This method, referred to as functional regularisation for Continual Learning, avoids forgetting a previous task by constructing and memorising an approximate posterior belief over the underlying task-specific function. To achieve this we rely on a Gaussian process obtained by treating the weights of the last layer of a neural network as random and Gaussian distributed. Then, the training algorithm sequentially encounters tasks and constructs posterior beliefs over the task-specific functions by using inducing point sparse Gaussian process methods. At each step a new task is first learnt and then a summary is constructed consisting of (i) inducing inputs -- a fixed-size subset of the task inputs selected such that it optimally represents the task -- and (ii) a posterior distribution over the function values at these inputs. This summary then regularises learning of future tasks, through Kullback-Leibler regularisation terms. Our method thus unites approaches focused on (pseudo-)rehearsal with those derived from a sequential Bayesian inference perspective in a principled way, leading to strong results on accepted benchmarks.

Posted Content
TL;DR: This work considers the implications of the KL-regularized expected reward objective framework in cases where both the policy and default behavior are augmented with latent variables and discusses how the resulting hierarchical structures can be used to implement different inductive biases and how their modularity can benefit transfer.
Abstract: As reinforcement learning agents are tasked with solving more challenging and diverse tasks, the ability to incorporate prior knowledge into the learning system and to exploit reusable structure in solution space is likely to become increasingly important. The KL-regularized expected reward objective constitutes one possible tool to this end. It introduces an additional component, a default or prior behavior, which can be learned alongside the policy and as such partially transforms the reinforcement learning problem into one of behavior modelling. In this work we consider the implications of this framework in cases where both the policy and default behavior are augmented with latent variables. We discuss how the resulting hierarchical structures can be used to implement different inductive biases and how their modularity can benefit transfer. Empirically we find that they can lead to faster learning and transfer on a range of continuous control tasks.

Proceedings Article
01 Jan 2019
TL;DR: In this paper, the authors introduce several classes of fast EIG estimators by building on ideas from amortized variational inference, which can provide significant gains in speed and accuracy over previous approaches.
Abstract: Bayesian optimal experimental design (BOED) is a principled framework for making efficient use of limited experimental resources. Unfortunately, its applicability is hampered by the difficulty of obtaining accurate estimates of the expected information gain (EIG) of an experiment. To address this, we introduce several classes of fast EIG estimators by building on ideas from amortized variational inference. We show theoretically and empirically that these estimators can provide significant gains in speed and accuracy over previous approaches. We further demonstrate the practicality of our approach on a number of end-to-end experiments.

Proceedings Article
01 Jan 2019
TL;DR: In this article, Bornschein and Bengio revisited the reweighted wake-sleep (RWS) algorithm, and through extensive evaluations, show that it outperforms current state-of-the-art methods in learning stochastic control-flow models.
Abstract: Stochastic control-flow models (SCFMs) are a class of generative models that involve branching on choices from discrete random variables. Amortized gradient-based learning of SCFMs is challenging as most approaches targeting discrete variables rely on their continuous relaxations---which can be intractable in SCFMs, as branching on relaxations requires evaluating all (exponentially many) branching paths. Tractable alternatives mainly combine REINFORCE with complex control-variate schemes to improve the variance of naive estimators. Here, we revisit the reweighted wake-sleep (RWS) (Bornschein and Bengio, 2015) algorithm, and through extensive evaluations, show that it outperforms current state-of-the-art methods in learning SCFMs. Further, in contrast to the importance weighted autoencoder, we observe that RWS learns better models and inference networks with increasing numbers of particles. Our results suggest that RWS is a competitive, often preferable, alternative for learning SCFMs.

Posted Content
TL;DR: This work introduces a fully stochastic gradient based approach to Bayesian optimal experimental design (BOED) that utilizes variational lower bounds on the expected information gain of an experiment that can be simultaneously optimized with respect to both the variational and design parameters.
Abstract: We introduce a fully stochastic gradient based approach to Bayesian optimal experimental design (BOED). Our approach utilizes variational lower bounds on the expected information gain (EIG) of an experiment that can be simultaneously optimized with respect to both the variational and design parameters. This allows the design process to be carried out through a single unified stochastic gradient ascent procedure, in contrast to existing approaches that typically construct a pointwise EIG estimator, before passing this estimator to a separate optimizer. We provide a number of different variational objectives including the novel adaptive contrastive estimation (ACE) bound. Finally, we show that our gradient-based approaches are able to provide effective design optimization in substantially higher dimensional settings than existing approaches.

01 Jan 2019
TL;DR: The authors endow VAE with a Poincar\'e ball model of hyperbolic geometry and derive the necessary methods to work with two main Gaussian generalisations on that space.
Abstract: The Variational Auto-Encoder (VAE) model is a popular method to learn at once a generative model and embeddings for data living in a high-dimensional space. In the real world, many datasets may be assumed to be hierarchically structured. Traditionally, VAE uses a Euclidean latent space, but tree-like structures cannot be efficiently embedded in such spaces as opposed to hyperbolic spaces with negative curvature. We therefore endow VAE with a Poincar\'e ball model of hyperbolic geometry and derive the necessary methods to work with two main Gaussian generalisations on that space. We empirically show better generalisation to unseen data than the Euclidean counterpart, and can qualitatively and quantitatively better recover hierarchical structures.

Posted Content
TL;DR: In this paper, the KL regularized expected reward objective is replaced with a default policy, and the default policy is trained from data to learn reusable behaviors that help the policy learn faster.
Abstract: Many real world tasks exhibit rich structure that is repeated across different parts of the state space or in time. In this work we study the possibility of leveraging such repeated structure to speed up and regularize learning. We start from the KL regularized expected reward objective which introduces an additional component, a default policy. Instead of relying on a fixed default policy, we learn it from data. But crucially, we restrict the amount of information the default policy receives, forcing it to learn reusable behaviors that help the policy learn faster. We formalize this strategy and discuss connections to information bottleneck approaches and to the variational EM algorithm. We present empirical results in both discrete and continuous action domains and demonstrate that, for certain tasks, learning a default policy alongside the policy can significantly speed up and improve learning.

Posted Content
TL;DR: It is empirically show, on both synthetic and image data, that DAC can efficiently and accurately cluster new datasets coming from the same distribution used to generate training datasets.
Abstract: We propose a deep amortized clustering (DAC), a neural architecture which learns to cluster datasets efficiently using a few forward passes. DAC implicitly learns what makes a cluster, how to group data points into clusters, and how to count the number of clusters in datasets. DAC is meta-learned using labelled datasets for training, a process distinct from traditional clustering algorithms which usually require hand-specified prior knowledge about cluster shapes/structures. We empirically show, on both synthetic and image data, that DAC can efficiently and accurately cluster new datasets coming from the same distribution used to generate training datasets.

Posted Content
TL;DR: This work introduces a unified probabilistic framework for solving sequential decision making problems ranging from Bayesian optimisation to contextual bandits and reinforcement learning, and explores the use of Neural processes due to statistical and computational desiderata.
Abstract: We introduce a unified probabilistic framework for solving sequential decision making problems ranging from Bayesian optimisation to contextual bandits and reinforcement learning. This is accomplished by a probabilistic model-based approach that explains observed data while capturing predictive uncertainty during the decision making process. Crucially, this probabilistic model is chosen to be a Meta-Learning system that allows learning from a distribution of related problems, allowing data efficient adaptation to a target task. As a suitable instantiation of this framework, we explore the use of Neural processes due to statistical and computational desiderata. We apply our framework to a broad range of problem domains, such as control problems, recommender systems and adversarial attacks on RL agents, demonstrating an efficient and general black-box learning approach.