Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
Citations
2,944 citations
1,777 citations
Cites methods from "Accelerating Stochastic Gradient De..."
...Since our approach applies directly to gradient computations, it can be adapted to many other classical and more recent first-order optimization methods, such as NAG [45], Momentum [50], AdaGrad [17], or SVRG [33]....
[...]
1,455 citations
1,272 citations
Cites background from "Accelerating Stochastic Gradient De..."
...However, if a high accuracy was needed, GD or its faster variants would prevail....
[...]
1,213 citations
References
3,372 citations
2,037 citations
1,182 citations
1,014 citations
"Accelerating Stochastic Gradient De..." refers background or methods in this paper
...Two recent papers Le Roux et al. [2012], Shalev-Shwartz and Zhang [2012] proposed methods that achieve such a variance reduction effect for SGD, which leads to a linear convergence rate when ψi(w) is smooth and strongly convex. The method in Le Roux et al. [2012] was referred to as SAG (stochastic average gradient), and the method in Shalev-Shwartz and Zhang [2012] was referred to as SDCA....
[...]
...These methods are suitable for training convex linear prediction problems such as logistic regression or least squares regression, and in fact, SDCA is the method implemented in the popular lib-SVM package Hsieh et al. [2008]. However, both proposals require storage of all gradients (or dual variables)....
[...]
...As long as we pick ηt as a constant η < 1/L, we have linear convergence of O((1 − γ/L)) Nesterov [2004]. However, for SGD, due to the variance of random sampling, we generally need to choose ηt = O(1/t) and obtain a slower sub-linear convergence rate of O(1/t). This means that we have a trade-off of fast computation per iteration and slow convergence for SGD versus slow computation per iteration and fast convergence for gradient descent. Although the fast computation means it can reach an approximate solution relatively quickly, and thus has been proposed by various researchers for large scale problems Zhang [2004], Shalev-Shwartz et al. [2007] (also see Leon Bottou’s Webpage http://leon....
[...]
...As long as we pick ηt as a constant η < 1/L, we have linear convergence of O((1 − γ/L)) Nesterov [2004]. However, for SGD, due to the variance of random sampling, we generally need to choose ηt = O(1/t) and obtain a slower sub-linear convergence rate of O(1/t)....
[...]
...As long as we pick ηt as a constant η < 1/L, we have linear convergence of O((1 − γ/L)) Nesterov [2004]. However, for SGD, due to the variance of random sampling, we generally need to choose ηt = O(1/t) and obtain a slower sub-linear convergence rate of O(1/t). This means that we have a trade-off of fast computation per iteration and slow convergence for SGD versus slow computation per iteration and fast convergence for gradient descent. Although the fast computation means it can reach an approximate solution relatively quickly, and thus has been proposed by various researchers for large scale problems Zhang [2004], Shalev-Shwartz et al....
[...]
986 citations