Adam: A Method for Stochastic Optimization
Citations
[...]
38,208 citations
15,696 citations
13,994 citations
Cites methods from "Adam: A Method for Stochastic Optim..."
...We additionally found training to be very sensitive to the Adam epsilon term, and in some cases we obtained better performance or improved stability after tuning it....
[...]
...BERT is optimized with Adam (Kingma and Ba, 2015) using the following parameters: β1 = 0....
[...]
...BERT is optimized with Adam (Kingma and Ba, 2015) using the following parameters: β1 = 0.9, β2 = 0.999, ǫ = 1e-6 and L2 weight decay of 0.01....
[...]
12,690 citations
Cites methods from "Adam: A Method for Stochastic Optim..."
...This justifies the choice of Adam as the optimizer used to pre-train ResNets on JFT. Note that the absolute numbers are lower than those reported by Kolesnikov et al. (2020), since we pre-train only for 7 epochs, not 30....
[...]
...ResNets are typically trained with SGD and our use of Adam as optimizer is quite unconventional....
[...]
...We train all models, including ResNets, using Adam (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.999, a batch size of 4096 and apply a high weight decay of 0.1, which we found to be useful for transfer of all models (Appendix D.1 shows that, in contrast to common practices, Adam works slightly better…...
[...]
...Namely, we compare the fine-tuning performance of two ResNets – 50x1 and 152x2 – pre-trained on JFT with SGD and Adam....
[...]
...We use Adam, with a base learning rate of 2 ·10−4, warmup of 10k steps and cosine learning rate decay....
[...]
11,958 citations
Cites methods from "Adam: A Method for Stochastic Optim..."
...We use minibatch SGD and apply the Adam solver [29]....
[...]
References
73,978 citations
20,769 citations
16,717 citations
9,091 citations
"Adam: A Method for Stochastic Optim..." refers background or methods in this paper
...Objectives may also have other sources of noise than data subsampling, such as dropout (Hinton et al., 2012b) regularization....
[...]
...SGD proved itself as an efficient and effective optimization method that was central in many machine learning success stories, such as recent advances in deep learning (Deng et al., 2013; Krizhevsky et al., 2012; Hinton & Salakhutdinov, 2006; Hinton et al., 2012a; Graves et al., 2013)....
[...]
...…the advantages of two recently popular methods: AdaGrad (Duchi et al., 2011), which works well with sparse gradients, and RMSProp (Tieleman & Hinton, 2012), which works well in on-line and non-stationary settings; important connections to these and other stochastic optimization methods are…...
[...]
7,244 citations