scispace - formally typeset
Open AccessProceedings Article

Concrete Dropout

Yarin Gal, +2 more
- Vol. 30, pp 3581-3590
Reads0
Chats0
TLDR
In this paper, a continuous relaxation of dropout's discrete masks is proposed, which allows for automatic tuning of the dropout probability in large models, and as a result faster experimentation cycles.
Abstract
Dropout is used as a practical tool to obtain uncertainty estimates in large vision models and reinforcement learning (RL) tasks. But to obtain well-calibrated uncertainty estimates, a grid-search over the dropout probabilities is necessary—a prohibitive operation with large models, and an impossible one with RL. We propose a new dropout variant which gives improved performance and better calibrated uncertainties. Relying on recent developments in Bayesian deep learning, we use a continuous relaxation of dropout’s discrete masks. Together with a principled optimisation objective, this allows for automatic tuning of the dropout probability in large models, and as a result faster experimentation cycles. In RL this allows the agent to adapt its uncertainty dynamically as more data is observed. We analyse the proposed variant extensively on a range of tasks, and give insights into common practice in the field where larger dropout probabilities are often used in deeper model layers.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

TL;DR: In this paper, the lottery tickets hypothesis is proposed to find the subnetworks that can reach test accuracy comparable to the original network in a similar number of iterations, where the winning tickets have won the initialization lottery: their connections have initial weights that make training particularly effective.
Proceedings Article

A Simple Baseline for Bayesian Uncertainty in Deep Learning

TL;DR: In this article, the authors proposed SWA-Gaussian (SWAG) approach for uncertainty representation and calibration in deep learning, where the first moment of stochastic gradient descent (SGD) is computed using a modified learning rate schedule.
Proceedings ArticleDOI

Deep and Confident Prediction for Time Series at Uber

TL;DR: A novel end-to-end Bayesian deep model is proposed that provides time series prediction along with uncertainty estimation at Uber and is successfully applied to large-scale time series anomaly detection at Uber.
Posted Content

Understanding Measures of Uncertainty for Adversarial Example Detection

TL;DR: In this article, failure modes for MC dropout, a widely used approach for estimating uncertainty in deep models, are highlighted, and a proposal to improve the quality of uncertainty estimates using probabilistic model ensembles is made.
References
More filters
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Proceedings Article

Auto-Encoding Variational Bayes

TL;DR: A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.
Journal ArticleDOI

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

TL;DR: This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units that are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reInforcement tasks, and they do this without explicitly computing gradient estimates.
Posted Content

Improving neural networks by preventing co-adaptation of feature detectors

TL;DR: The authors randomly omits half of the feature detectors on each training case to prevent complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors.