ADADELTA: An Adaptive Learning Rate Method
Citations
111,197 citations
Cites methods from "ADADELTA: An Adaptive Learning Rate..."
...Other stochastic optimization methods include vSGD (Schaul et al., 2012), AdaDelta (Zeiler, 2012) and the natural Newton method from Roux & Fitzgibbon (2010), all setting stepsizes by estimating curvature from first-order information....
[...]
23,486 citations
20,027 citations
Cites methods from "ADADELTA: An Adaptive Learning Rate..."
...We use a minibatch stochastic gradient descent (SGD) algorithm together with Adadelta (Zeiler, 2012) to train each model....
[...]
...Adadelta (Zeiler, 2012) was used to automatically adapt the learning rate of each parameter ( = 10−6 and ρ = 0.95)....
[...]
19,998 citations
14,635 citations
Additional excerpts
...RmsProp (Schaul, Zhang, & LeCun, 2013; Tieleman & Hinton, 2012) can speed up first order gradient descent methods (Sections 5.5, 5.6.2); compare vario-η (Neuneier & Zimmermann, 1996), Adagrad (Duchi, Hazan, & Singer, 2011) and Adadelta (Zeiler, 2012)....
[...]
References
23,814 citations
"ADADELTA: An Adaptive Learning Rate..." refers methods in this paper
...One method of speeding up training per-dimension is the momentum method [2]....
[...]
9,312 citations
7,244 citations
3,475 citations
"ADADELTA: An Adaptive Learning Rate..." refers methods in this paper
...A recent first order method called ADAGRAD [3] has shown remarkably good results on large scale learning tasks in a distributed environment [4]....
[...]
...The network was trained using the distributed system of [4] in which a centralized parameter server accumulates the gradient information reported back from several replicas of the neural network....
[...]
3,120 citations