Adam: A Method for Stochastic Optimization
Citations
[...]
38,208 citations
15,696 citations
13,994 citations
Cites methods from "Adam: A Method for Stochastic Optim..."
...We additionally found training to be very sensitive to the Adam epsilon term, and in some cases we obtained better performance or improved stability after tuning it....
[...]
...BERT is optimized with Adam (Kingma and Ba, 2015) using the following parameters: β1 = 0....
[...]
...BERT is optimized with Adam (Kingma and Ba, 2015) using the following parameters: β1 = 0.9, β2 = 0.999, ǫ = 1e-6 and L2 weight decay of 0.01....
[...]
12,690 citations
Cites methods from "Adam: A Method for Stochastic Optim..."
...This justifies the choice of Adam as the optimizer used to pre-train ResNets on JFT. Note that the absolute numbers are lower than those reported by Kolesnikov et al. (2020), since we pre-train only for 7 epochs, not 30....
[...]
...ResNets are typically trained with SGD and our use of Adam as optimizer is quite unconventional....
[...]
...We train all models, including ResNets, using Adam (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.999, a batch size of 4096 and apply a high weight decay of 0.1, which we found to be useful for transfer of all models (Appendix D.1 shows that, in contrast to common practices, Adam works slightly better…...
[...]
...Namely, we compare the fine-tuning performance of two ResNets – 50x1 and 152x2 – pre-trained on JFT with SGD and Adam....
[...]
...We use Adam, with a base learning rate of 2 ·10−4, warmup of 10k steps and cosine learning rate decay....
[...]
11,958 citations
Cites methods from "Adam: A Method for Stochastic Optim..."
...We use minibatch SGD and apply the Adam solver [29]....
[...]
References
6,899 citations
"Adam: A Method for Stochastic Optim..." refers background or methods in this paper
...Objectives may also have other sources of noise than data subsampling, such as dropout (Hinton et al., 2012b) regularization....
[...]
...SGD proved itself as an efficient and effective optimization method that was central in many machine learning success stories, such as recent advances in deep learning (Deng et al., 2013; Krizhevsky et al., 2012; Hinton & Salakhutdinov, 2006; Hinton et al., 2012a; Graves et al., 2013)....
[...]
6,189 citations
"Adam: A Method for Stochastic Optim..." refers methods in this paper
...Other stochastic optimization methods include vSGD (Schaul et al., 2012), AdaDelta (Zeiler, 2012) and the natural Newton method from Roux & Fitzgibbon (2010), all setting stepsizes by estimating curvature from first-order information....
[...]
5,310 citations
4,157 citations
"Adam: A Method for Stochastic Optim..." refers background in this paper
...• Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. • Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional nonconvex optimization. arXiv , pages 1–14, 2014. • Timothy Dozat. Incorporating Nesterov Momentum into Adam. ICLRWorkshop , (1):2013–2016, 2016. • Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations , pages 1–13, 2015. • Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.) , 269:543–547. • Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks : • the official journal of the International Neural Network Society , 12(1):145–151, 1999. • Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701 , 2012. • Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization....
[...]
...• Ruder, S. (2016). An overview of gradient descent optimization algorithms....
[...]
4,121 citations
"Adam: A Method for Stochastic Optim..." refers background or result in this paper
...(Sutskever et al., 2013) suggests reducing the momentum coefficient in the end of training can improve convergence....
[...]
...Decaying β1,t towards zero is important in our theoretical analysis and also matches previous empirical findings, e.g. (Sutskever et al., 2013) suggests reducing the momentum coefficient in the end of training can improve convergence....
[...]