scispace - formally typeset
Search or ask a question
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.
Citations
More filters
Posted Content
TL;DR: This model is a VAE that learns a distribution over radiance fields by conditioning them on a latent scene representation and is able to infer and render geometrically-consistent scenes from previously unseen 3D environments using very few input images.
Abstract: We propose NeRF-VAE, a 3D scene generative model that incorporates geometric structure via NeRF and differentiable volume rendering. In contrast to NeRF, our model takes into account shared structure across scenes, and is able to infer the structure of a novel scene -- without the need to re-train -- using amortized inference. NeRF-VAE's explicit 3D rendering process further contrasts previous generative models with convolution-based rendering which lacks geometric structure. Our model is a VAE that learns a distribution over radiance fields by conditioning them on a latent scene representation. We show that, once trained, NeRF-VAE is able to infer and render geometrically-consistent scenes from previously unseen 3D environments using very few input images. We further demonstrate that NeRF-VAE generalizes well to out-of-distribution cameras, while convolutional models do not. Finally, we introduce and study an attention-based conditioning mechanism of NeRF-VAE's decoder, which improves model performance.

58 citations


Cites methods from "Adam: A Method for Stochastic Optim..."

  • ...We use the Adam optimizer (Kingma & Ba, 2014) with learning rate = 3 × 10−4 for Jaytracer and GQN datasets and = 5× 10−4 for the CLEVR dataset....

    [...]

  • ...NeRF is trained with a batch size of 256 rays, using Adam with learning rate 1−3, for 56 iterations....

    [...]

  • ...We use Adam (Kingma & Ba, 2014) and β-annealing of the KL term in Eq....

    [...]

  • ...We use a 128 dimensional latent variable, Adam with a learning rate of 5−4 for 16 iterations, and β = 11−6, which is annealed to 1−4 from iteration 40k to 140k....

    [...]

Proceedings ArticleDOI
01 Jun 2018
TL;DR: A novel framework for geolocalizing Unmanned Aerial Vehicles using only their onboard camera and the utilization of visual information can offer a promising approach for unconstrained UAV navigation and enable the aerial platform to be self-aware of its surroundings thus opening up new application domains or enhancing existing ones.
Abstract: In this paper we present a novel framework for geolocalizing Unmanned Aerial Vehicles (UAVs) using only their onboard camera. The framework exploits the abundance of satellite imagery, along with established computer vision and deep learning methods, to locate the UAV in a satellite imagery map. It utilizes the contextual information extracted from the scene to attain increased geolocalization accuracy and enable navigation without the use of a Global Positioning System (GPS), which is advantageous in GPS-denied environments and provides additional enhancement to existing GPS-based systems. The framework inputs two images at a time, one captured using a UAV-mounted downlooking camera, and the other synthetically generated from the satellite map based on the UAV location within the map. Local features are extracted and used to register both images, a process that is performed recurrently to relate UAV motion to its actual map position, hence performing preliminary localization. A semantic shape matching algorithm is subsequently applied to extract and match meaningful shape information from both images, and use this information to improve localization accuracy. The framework is evaluated on two different datasets representing different geographical regions. Obtained results demonstrate the viability of proposed method and that the utilization of visual information can offer a promising approach for unconstrained UAV navigation and enable the aerial platform to be self-aware of its surroundings thus opening up new application domains or enhancing existing ones.

58 citations


Cites methods from "Adam: A Method for Stochastic Optim..."

  • ...As for the the optimization algorithm, we chose the Adaptive Moment Estimator (ADAM) [25]....

    [...]

  • ...In our early experiments, ADAM converged much faster than stochastic gradient descent and NADAM [11]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors used reinforcement learning (DRLSTM) to perform active flow control in strong turbulent flows, and achieved a remarkable drag reduction of around 30% in a circular cylinder, accompanied by elongation of the recirculation bubble and reduction of turbulent fluctuations in the cylinder wake.
Abstract: Machine learning has recently become a promising technique in fluid mechanics, especially for active flow control (AFC) applications. A recent work [J. Fluid Mech. (2019), vol. 865, pp. 281-302] has demonstrated the feasibility and effectiveness of deep reinforcement learning (DRL) in performing AFC over a circular cylinder at $Re = 100$, i.e., in the laminar flow regime. As a follow-up study, we investigate the same AFC problem at an intermediate Reynolds number, i.e., $Re = 1000$, where the turbulence in the flow poses great challenges to the control. The results show that the DRL agent can still find effective control strategies, but requires much more episodes in the learning. A remarkable drag reduction of around $30\%$ is achieved, which is accompanied by elongation of the recirculation bubble and reduction of turbulent fluctuations in the cylinder wake. To our best knowledge, this study is the first successful application of DRL to AFC in weak turbulent conditions. It therefore sets a new milestone in progressing towards AFC in strong turbulent flows.

58 citations

Proceedings ArticleDOI
Zinan Lin1, Alankar Jain1, Chen Wang2, Giulia Fanti1, Vyas Sekar1 
TL;DR: This work explores if and how generative adversarial networks can be used to incentivize data sharing by enabling a generic framework for sharing synthetic datasets with minimal expert knowledge and designs a custom workflow called DoppelGANger, which achieves up to 43% better fidelity than baseline models.
Abstract: Limited data access is a longstanding barrier to data-driven research and development in the networked systems community. In this work, we explore if and how generative adversarial networks (GANs) can be used to incentivize data sharing by enabling a generic framework for sharing synthetic datasets with minimal expert knowledge. As a specific target, our focus in this paper is on time series datasets with metadata (e.g., packet loss rate measurements with corresponding ISPs). We identify key challenges of existing GAN approaches for such workloads with respect to fidelity (e.g., long-term dependencies, complex multidimensional relationships, mode collapse) and privacy (i.e., existing guarantees are poorly understood and can sacrifice fidelity). To improve fidelity, we design a custom workflow called DoppelGANger (DG) and demonstrate that across diverse real-world datasets (e.g., bandwidth measurements, cluster requests, web sessions) and use cases (e.g., structural characterization, predictive modeling, algorithm comparison), DG achieves up to 43% better fidelity than baseline models. Although we do not resolve the privacy problem in this work, we identify fundamental challenges with both classical notions of privacy and recent advances to improve the privacy properties of GANs, and suggest a potential roadmap for addressing these challenges. By shedding light on the promise and challenges, we hope our work can rekindle the conversation on workflows for data sharing.

58 citations

Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this paper, the authors propose to slack the alignment of feature statistics in the BN layer to relax the constraint at the distribution level and design a layerwise enhancement to reinforce specific layers for different data samples.
Abstract: Quantization has emerged as one of the most prevalent approaches to compress and accelerate neural networks. Recently, data-free quantization has been widely studied as a practical and promising solution. It synthesizes data for calibrating the quantized model according to the batch normalization (BN) statistics of FP32 ones and significantly relieves the heavy dependency on real training data in traditional quantization methods. Unfortunately, we find that in practice, the synthetic data identically constrained by BN statistics suffers serious homogenization at both distribution level and sample level and further causes a significant performance drop of the quantized model. We propose Diverse Sample Generation (DSG) scheme to mitigate the adverse effects caused by homogenization. Specifically, we slack the alignment of feature statistics in the BN layer to relax the constraint at the distribution level and design a layerwise enhancement to reinforce specific layers for different data samples. Our DSG scheme is versatile and even able to be applied to the state-of-the-art post-training quantization method like AdaRound. We evaluate the DSG scheme on the large-scale image classification task and consistently obtain significant improvements over various network architectures and quantization methods, especially when quantized to lower bits (e.g., up to 22% improvement on W4A4). Moreover, benefiting from the enhanced diversity, models calibrated with synthetic data perform close to those calibrated with real data and even outperform them on W4A4.

58 citations

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
01 Jan 2014
TL;DR: A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.
Abstract: How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.

20,769 citations

Journal ArticleDOI
28 Jul 2006-Science
TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.
Abstract: High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such "autoencoder" networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

16,717 citations

Journal ArticleDOI
TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

9,091 citations


"Adam: A Method for Stochastic Optim..." refers background or methods in this paper

  • ...Objectives may also have other sources of noise than data subsampling, such as dropout (Hinton et al., 2012b) regularization....

    [...]

  • ...SGD proved itself as an efficient and effective optimization method that was central in many machine learning success stories, such as recent advances in deep learning (Deng et al., 2013; Krizhevsky et al., 2012; Hinton & Salakhutdinov, 2006; Hinton et al., 2012a; Graves et al., 2013)....

    [...]

  • ...…the advantages of two recently popular methods: AdaGrad (Duchi et al., 2011), which works well with sparse gradients, and RMSProp (Tieleman & Hinton, 2012), which works well in on-line and non-stationary settings; important connections to these and other stochastic optimization methods are…...

    [...]

Proceedings Article
01 Jan 2010
TL;DR: Adaptive subgradient methods as discussed by the authors dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning, which allows us to find needles in haystacks in the form of very predictive but rarely seen features.
Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms.

7,244 citations

Trending Questions (1)
What is the Adam spectral sequence?

The Adam spectral sequence is not mentioned in the given text.