Adam: A Method for Stochastic Optimization
Citations
[...]
38,208 citations
15,696 citations
13,994 citations
Cites methods from "Adam: A Method for Stochastic Optim..."
...We additionally found training to be very sensitive to the Adam epsilon term, and in some cases we obtained better performance or improved stability after tuning it....
[...]
...BERT is optimized with Adam (Kingma and Ba, 2015) using the following parameters: β1 = 0....
[...]
...BERT is optimized with Adam (Kingma and Ba, 2015) using the following parameters: β1 = 0.9, β2 = 0.999, ǫ = 1e-6 and L2 weight decay of 0.01....
[...]
12,690 citations
Cites methods from "Adam: A Method for Stochastic Optim..."
...This justifies the choice of Adam as the optimizer used to pre-train ResNets on JFT. Note that the absolute numbers are lower than those reported by Kolesnikov et al. (2020), since we pre-train only for 7 epochs, not 30....
[...]
...ResNets are typically trained with SGD and our use of Adam as optimizer is quite unconventional....
[...]
...We train all models, including ResNets, using Adam (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.999, a batch size of 4096 and apply a high weight decay of 0.1, which we found to be useful for transfer of all models (Appendix D.1 shows that, in contrast to common practices, Adam works slightly better…...
[...]
...Namely, we compare the fine-tuning performance of two ResNets – 50x1 and 152x2 – pre-trained on JFT with SGD and Adam....
[...]
...We use Adam, with a base learning rate of 2 ·10−4, warmup of 10k steps and cosine learning rate decay....
[...]
11,958 citations
Cites methods from "Adam: A Method for Stochastic Optim..."
...We use minibatch SGD and apply the Adam solver [29]....
[...]
References
3,794 citations
"Adam: A Method for Stochastic Optim..." refers methods in this paper
...We examine the sparse feature problem using IMDB movie review dataset from (Maas et al., 2011)....
[...]
3,551 citations
2,273 citations
"Adam: A Method for Stochastic Optim..." refers background in this paper
...…implies when the data features are sparse and bounded gradients, the summation term can be much smaller than its upper bound ∑d i=1 ‖g1:T,i‖2 dG∞ √ T and∑d i=1 √ T v̂T,i dG∞ √ T , in particular if the class of function and data features are in the form of section 1.2 in (Duchi et al., 2011)....
[...]
2,033 citations
1,970 citations