Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Citations
30,843 citations
Cites methods from "Adaptive Subgradient Methods for On..."
...A model employing Batch Normalization can be trained using batch gradient descent, or Stochastic Gradient Descent with a mini-batch sizem > 1, or with any of its variants such as Adagrad (Duchi et al., 2011)....
[...]
..., 2013) and Adagrad (Duchi et al., 2011) have been used to achieve state of the art performance....
[...]
...Stochastic gradient descent (SGD) has proved to be an effective way of training deep networks, and SGD variants such as momentum (Sutskever et al., 2013) and Adagrad (Duchi et al., 2011) have been used to achieve state of the art performance....
[...]
...A model employing Batch Normalization can be trained using batch gradient descent, or Stochastic Gradient Descent with a mini-batch size m > 1, or with any of its variants such as Adagrad (Duchi et al., 2011)....
[...]
30,558 citations
Cites methods from "Adaptive Subgradient Methods for On..."
...For all our experiments, we set xmax = 100, α = 3/4, and train the model using AdaGrad (Duchi et al., 2011), stochastically sampling nonzero elements from X , with initial learning rate of 0....
[...]
...For all our experiments, we set xmax = 100, α = 3/4, and train the model using AdaGrad (Duchi et al., 2011), stochastically sampling nonzero elements from X , with initial learning rate of 0.05....
[...]
23,486 citations
20,077 citations
19,543 citations
Cites methods from "Adaptive Subgradient Methods for On..."
...4% top-1 accuracy on the ILSVRC CLS-LOC val dataset, and fine-tune it for SSD using batch-normalization [11] and Adagrad [1] with initial learning rate 0....
[...]
References
49,639 citations
"Adaptive Subgradient Methods for On..." refers background or methods in this paper
...2 Image Ranking ImageNet (Deng et al., 2009) consists of images organized according to the nouns in the WordNet hierarchy, where each noun is associated on average with more than 500 images collected from the web. We selected 15,000 important nouns from the hierarchy and conducted a large scale image ranking task for each noun. This approach is identical to the task tackled by Grangier and Bengio (2008) using the Passive-Aggressive algorithm....
[...]
...2 Image Ranking ImageNet (Deng et al., 2009) consists of images organized according to the nouns in the WordNet hierarchy, where each noun is associated on average with more than 500 images collected from the web....
[...]
...Experiments We performed experiments with several real world datasets with different characteristics: the ImageNet image database (Deng et al., 2009), the Reuters RCV1 text classification dataset (Lewis et al....
[...]
[...]
33,341 citations
[...]
23,986 citations
12,671 citations
Additional excerpts
...(3) implies for all x ∈ X and φ′(xt+1) ∈ ∂φ(xt+1) (Bertsekas, 1999) 〈x− xt+1, ηf (xt) +∇ψt(xt+1)−∇ψt(xt) + ηφ(xt+1)〉 ≥ 0....
[...]