Adam: A Method for Stochastic Optimization

Home
/
Papers
/
Adam: A Method for Stochastic Optimization

Proceedings Article•

Adam: A Method for Stochastic Optimization

Diederik P. Kingma¹, Jimmy Ba²•Institutions (2)

University of Amsterdam¹, University of Toronto²

01 Jan 2015-

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

read less

Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

NeRF-VAE: A Geometry Aware 3D Scene Generative Model

[...]

Adam R. Kosiorek, Heiko Strathmann, Daniel Zoran, Pol Moreno¹, Rosalia Schneider, Sona Mokra, Danilo Jimenez Rezende - Show less +3 more•Institutions (1)

Google¹

01 Apr 2021-arXiv: Machine Learning

TL;DR: This model is a VAE that learns a distribution over radiance fields by conditioning them on a latent scene representation and is able to infer and render geometrically-consistent scenes from previously unseen 3D environments using very few input images.

...read moreread less

Abstract: We propose NeRF-VAE, a 3D scene generative model that incorporates geometric structure via NeRF and differentiable volume rendering. In contrast to NeRF, our model takes into account shared structure across scenes, and is able to infer the structure of a novel scene -- without the need to re-train -- using amortized inference. NeRF-VAE's explicit 3D rendering process further contrasts previous generative models with convolution-based rendering which lacks geometric structure. Our model is a VAE that learns a distribution over radiance fields by conditioning them on a latent scene representation. We show that, once trained, NeRF-VAE is able to infer and render geometrically-consistent scenes from previously unseen 3D environments using very few input images. We further demonstrate that NeRF-VAE generalizes well to out-of-distribution cameras, while convolutional models do not. Finally, we introduce and study an attention-based conditioning mechanism of NeRF-VAE's decoder, which improves model performance.

...read moreread less

58 citations

Cites methods from "Adam: A Method for Stochastic Optim..."

...We use the Adam optimizer (Kingma & Ba, 2014) with learning rate = 3 × 10−4 for Jaytracer and GQN datasets and = 5× 10−4 for the CLEVR dataset....
[...]
...NeRF is trained with a batch size of 256 rays, using Adam with learning rate 1−3, for 56 iterations....
[...]
...We use Adam (Kingma & Ba, 2014) and β-annealing of the KL term in Eq....
[...]
...We use a 128 dimensional latent variable, Adam with a learning rate of 5−4 for 16 iterations, and β = 11−6, which is annealed to 1−4 from iteration 40k to 140k....
[...]

Proceedings Article•DOI•

A Deep CNN-Based Framework For Enhanced Aerial Imagery Registration with Applications to UAV Geolocalization

[...]

Ahmed Samy Nassar¹, Karim Amer¹, Reda ElHakim¹, Mohamed ElHelw¹•Institutions (1)

Nile University¹

01 Jun 2018

TL;DR: A novel framework for geolocalizing Unmanned Aerial Vehicles using only their onboard camera and the utilization of visual information can offer a promising approach for unconstrained UAV navigation and enable the aerial platform to be self-aware of its surroundings thus opening up new application domains or enhancing existing ones.

...read moreread less

Abstract: In this paper we present a novel framework for geolocalizing Unmanned Aerial Vehicles (UAVs) using only their onboard camera. The framework exploits the abundance of satellite imagery, along with established computer vision and deep learning methods, to locate the UAV in a satellite imagery map. It utilizes the contextual information extracted from the scene to attain increased geolocalization accuracy and enable navigation without the use of a Global Positioning System (GPS), which is advantageous in GPS-denied environments and provides additional enhancement to existing GPS-based systems. The framework inputs two images at a time, one captured using a UAV-mounted downlooking camera, and the other synthetically generated from the satellite map based on the UAV location within the map. Local features are extracted and used to register both images, a process that is performed recurrently to relate UAV motion to its actual map position, hence performing preliminary localization. A semantic shape matching algorithm is subsequently applied to extract and match meaningful shape information from both images, and use this information to improve localization accuracy. The framework is evaluated on two different datasets representing different geographical regions. Obtained results demonstrate the viability of proposed method and that the utilization of visual information can offer a promising approach for unconstrained UAV navigation and enable the aerial platform to be self-aware of its surroundings thus opening up new application domains or enhancing existing ones.

...read moreread less

58 citations

Cites methods from "Adam: A Method for Stochastic Optim..."

...As for the the optimization algorithm, we chose the Adaptive Moment Estimator (ADAM) [25]....
[...]
...In our early experiments, ADAM converged much faster than stochastic gradient descent and NADAM [11]....
[...]

Journal Article•DOI•

Applying deep reinforcement learning to active flow control in turbulent conditions

[...]

Feng Ren, Jean Rabault, Hui Tang¹•Institutions (1)

Hong Kong Polytechnic University¹

18 Jun 2020-arXiv: Fluid Dynamics

TL;DR: In this paper, the authors used reinforcement learning (DRLSTM) to perform active flow control in strong turbulent flows, and achieved a remarkable drag reduction of around 30% in a circular cylinder, accompanied by elongation of the recirculation bubble and reduction of turbulent fluctuations in the cylinder wake.

...read moreread less

Abstract: Machine learning has recently become a promising technique in fluid mechanics, especially for active flow control (AFC) applications. A recent work [J. Fluid Mech. (2019), vol. 865, pp. 281-302] has demonstrated the feasibility and effectiveness of deep reinforcement learning (DRL) in performing AFC over a circular cylinder at $Re = 100$, i.e., in the laminar flow regime. As a follow-up study, we investigate the same AFC problem at an intermediate Reynolds number, i.e., $Re = 1000$, where the turbulence in the flow poses great challenges to the control. The results show that the DRL agent can still find effective control strategies, but requires much more episodes in the learning. A remarkable drag reduction of around $30\%$ is achieved, which is accompanied by elongation of the recirculation bubble and reduction of turbulent fluctuations in the cylinder wake. To our best knowledge, this study is the first successful application of DRL to AFC in weak turbulent conditions. It therefore sets a new milestone in progressing towards AFC in strong turbulent flows.

...read moreread less

58 citations

Proceedings Article•DOI•

Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions

[...]

Zinan Lin¹, Alankar Jain¹, Chen Wang², Giulia Fanti¹, Vyas Sekar¹ - Show less +1 more•Institutions (2)

Carnegie Mellon University¹, IBM²

30 Sep 2019-arXiv: Learning

TL;DR: This work explores if and how generative adversarial networks can be used to incentivize data sharing by enabling a generic framework for sharing synthetic datasets with minimal expert knowledge and designs a custom workflow called DoppelGANger, which achieves up to 43% better fidelity than baseline models.

...read moreread less

Abstract: Limited data access is a longstanding barrier to data-driven research and development in the networked systems community. In this work, we explore if and how generative adversarial networks (GANs) can be used to incentivize data sharing by enabling a generic framework for sharing synthetic datasets with minimal expert knowledge. As a specific target, our focus in this paper is on time series datasets with metadata (e.g., packet loss rate measurements with corresponding ISPs). We identify key challenges of existing GAN approaches for such workloads with respect to fidelity (e.g., long-term dependencies, complex multidimensional relationships, mode collapse) and privacy (i.e., existing guarantees are poorly understood and can sacrifice fidelity). To improve fidelity, we design a custom workflow called DoppelGANger (DG) and demonstrate that across diverse real-world datasets (e.g., bandwidth measurements, cluster requests, web sessions) and use cases (e.g., structural characterization, predictive modeling, algorithm comparison), DG achieves up to 43% better fidelity than baseline models. Although we do not resolve the privacy problem in this work, we identify fundamental challenges with both classical notions of privacy and recent advances to improve the privacy properties of GANs, and suggest a potential roadmap for addressing these challenges. By shedding light on the promise and challenges, we hope our work can rekindle the conversation on workflows for data sharing.

...read moreread less

58 citations

Proceedings Article•DOI•

Diversifying Sample Generation for Accurate Data-Free Quantization

[...]

Xiangguo Zhang¹, Haotong Qin¹, Yifu Ding¹, Ruihao Gong², Qinghua Yan¹, Renshuai Tao¹, Yuhang Li³, Fengwei Yu², Xianglong Liu¹ - Show less +5 more•Institutions (3)

Beihang University¹, SenseTime², Yale University³

01 Jun 2021

TL;DR: In this paper, the authors propose to slack the alignment of feature statistics in the BN layer to relax the constraint at the distribution level and design a layerwise enhancement to reinforce specific layers for different data samples.

...read moreread less

Abstract: Quantization has emerged as one of the most prevalent approaches to compress and accelerate neural networks. Recently, data-free quantization has been widely studied as a practical and promising solution. It synthesizes data for calibrating the quantized model according to the batch normalization (BN) statistics of FP32 ones and significantly relieves the heavy dependency on real training data in traditional quantization methods. Unfortunately, we find that in practice, the synthetic data identically constrained by BN statistics suffers serious homogenization at both distribution level and sample level and further causes a significant performance drop of the quantized model. We propose Diverse Sample Generation (DSG) scheme to mitigate the adverse effects caused by homogenization. Specifically, we slack the alignment of feature statistics in the BN layer to relax the constraint at the distribution level and design a layerwise enhancement to reinforce specific layers for different data samples. Our DSG scheme is versatile and even able to be applied to the state-of-the-art post-training quantization method like AdaRound. We evaluate the DSG scheme on the large-scale image classification task and consistently obtain significant improvements over various network architectures and quantization methods, especially when quantized to lower bits (e.g., up to 22% improvement on W4A4). Moreover, benefiting from the enhanced diversity, models calibrated with synthetic data perform close to those calibrated with real data and even outperform them on W4A4.

...read moreread less

58 citations

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

ImageNet Classification with Deep Convolutional Neural Networks

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

...read moreread less

73,978 citations

Proceedings Article•

Auto-Encoding Variational Bayes

[...]

Diederik P. Kingma¹, Max Welling¹•Institutions (1)

University of Amsterdam¹

01 Jan 2014

TL;DR: A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.

...read moreread less

Abstract: How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.

...read moreread less

20,769 citations

Journal Article•DOI•

Reducing the Dimensionality of Data with Neural Networks

[...]

Geoffrey E. Hinton¹, Ruslan Salakhutdinov¹•Institutions (1)

University of Toronto¹

28 Jul 2006-Science

TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.

...read moreread less

Abstract: High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such "autoencoder" networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

...read moreread less

16,717 citations

Journal Article•DOI•

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

[...]

Geoffrey E. Hinton¹, Li Deng², Dong Yu², George E. Dahl¹, Abdelrahman Mohamed¹, Navdeep Jaitly¹, Andrew W. Senior³, Vincent Vanhoucke³, Patrick Nguyen³, Tara N. Sainath⁴, Brian Kingsbury⁴ - Show less +7 more•Institutions (4)

University of Toronto¹, Microsoft², Google³, IBM⁴

18 Oct 2012-IEEE Signal Processing Magazine

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

9,091 citations

"Adam: A Method for Stochastic Optim..." refers background or methods in this paper

...Objectives may also have other sources of noise than data subsampling, such as dropout (Hinton et al., 2012b) regularization....
[...]
...SGD proved itself as an efficient and effective optimization method that was central in many machine learning success stories, such as recent advances in deep learning (Deng et al., 2013; Krizhevsky et al., 2012; Hinton & Salakhutdinov, 2006; Hinton et al., 2012a; Graves et al., 2013)....
[...]
...…the advantages of two recently popular methods: AdaGrad (Duchi et al., 2011), which works well with sparse gradients, and RMSProp (Tieleman & Hinton, 2012), which works well in on-line and non-stationary settings; important connections to these and other stochastic optimization methods are…...
[...]

Proceedings Article•

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

[...]

John C. Duchi¹, Elad Hazan², Yoram Singer³•Institutions (3)

University of California, Berkeley¹, IBM², Google³

01 Jan 2010

TL;DR: Adaptive subgradient methods as discussed by the authors dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning, which allows us to find needles in haystacks in the form of very predictive but rarely seen features.

...read moreread less

Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms.

...read moreread less

7,244 citations