Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

Home
/
Papers
/
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

Proceedings Article•

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

Rie Johnson, Tong Zhang¹•Institutions (1)

05 Dec 2013-Vol. 26, pp 315-323

TL;DR: It is proved that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG), but the analysis is significantly simpler and more intuitive.

read less

Abstract: Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Deep Learning with Differential Privacy

[...]

Martín Abadi¹, Andy Chu¹, Ian Goodfellow, H. Brendan McMahan¹, Ilya Mironov¹, Kunal Talwar¹, Li Zhang¹ - Show less +3 more•Institutions (1)

Google¹

24 Oct 2016

TL;DR: In this paper, the authors develop new algorithmic techniques for learning and a refined analysis of privacy costs within the framework of differential privacy, and demonstrate that they can train deep neural networks with nonconvex objectives, under a modest privacy budget, and at a manageable cost in software complexity, training efficiency, and model quality.

...read moreread less

Abstract: Machine learning techniques based on neural networks are achieving remarkable results in a wide variety of domains. Often, the training of models requires large, representative datasets, which may be crowdsourced and contain sensitive information. The models should not expose private information in these datasets. Addressing this goal, we develop new algorithmic techniques for learning and a refined analysis of privacy costs within the framework of differential privacy. Our implementation and experiments demonstrate that we can train deep neural networks with non-convex objectives, under a modest privacy budget, and at a manageable cost in software complexity, training efficiency, and model quality.

...read moreread less

2,944 citations

Proceedings Article•DOI•

Deep Learning with Differential Privacy

[...]

Martín Abadi¹, Andy Chu¹, Ian Goodfellow, H. Brendan McMahan¹, Ilya Mironov¹, Kunal Talwar¹, Li Zhang¹ - Show less +3 more•Institutions (1)

Google¹

01 Jul 2016-arXiv: Machine Learning

TL;DR: This work develops new algorithmic techniques for learning and a refined analysis of privacy costs within the framework of differential privacy, and demonstrates that deep neural networks can be trained with non-convex objectives, under a modest privacy budget, and at a manageable cost in software complexity, training efficiency, and model quality.

...read moreread less

1,777 citations

Cites methods from "Accelerating Stochastic Gradient De..."

...Since our approach applies directly to gradient computations, it can be adapted to many other classical and more recent first-order optimization methods, such as NAG [45], Momentum [50], AdaGrad [17], or SVRG [33]....
[...]

Proceedings Article•

SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

[...]

Aaron Defazio¹, Francis Bach², Simon Lacoste-Julien²•Institutions (2)

Australian National University¹, École Normale Supérieure²

08 Dec 2014

TL;DR: SAGA as discussed by the authors improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser.

...read moreread less

Abstract: In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem We give experimental results showing the effectiveness of our method

...read moreread less

1,455 citations

Posted Content•

Federated Optimization: Distributed Machine Learning for On-Device Intelligence

[...]

Jakub Konečný, H. Brendan McMahan, Daniel Ramage, Peter Richtárik¹•Institutions (1)

University of Edinburgh¹

08 Oct 2016-arXiv: Learning

TL;DR: A new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are unevenly distributed over an extremely large number of nodes, is introduced, to train a high-quality centralized model.

...read moreread less

Abstract: We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are unevenly distributed over an extremely large number of nodes. The goal is to train a high-quality centralized model. We refer to this setting as Federated Optimization. In this setting, communication efficiency is of the utmost importance and minimizing the number of rounds of communication is the principal goal. A motivating example arises when we keep the training data locally on users' mobile devices instead of logging it to a data center for training. In federated optimziation, the devices are used as compute nodes performing computation on their local data in order to update a global model. We suppose that we have extremely large number of devices in the network --- as many as the number of users of a given service, each of which has only a tiny fraction of the total data available. In particular, we expect the number of data points available locally to be much smaller than the number of devices. Additionally, since different users generate data with different patterns, it is reasonable to assume that no device has a representative sample of the overall distribution. We show that existing algorithms are not suitable for this setting, and propose a new algorithm which shows encouraging experimental results for sparse convex problems. This work also sets a path for future research needed in the context of \federated optimization.

...read moreread less

1,272 citations

Cites background from "Accelerating Stochastic Gradient De..."

...However, if a high accuracy was needed, GD or its faster variants would prevail....
[...]

Book•

Convex Optimization: Algorithms and Complexity

[...]

Sébastien Bubeck¹•Institutions (1)

Microsoft¹

28 Oct 2015

TL;DR: This monograph presents the main complexity theorems in convex optimization and their corresponding algorithms and provides a gentle introduction to structural optimization with FISTA, saddle-point mirror prox, Nemirovski's alternative to Nesterov's smoothing, and a concise description of interior point methods.

...read moreread less

Abstract: This monograph presents the main complexity theorems in convex optimization and their corresponding algorithms. Starting from the fundamental theory of black-box optimization, the material progresses towards recent advances in structural optimization and stochastic optimization. Our presentation of black-box optimization, strongly influenced by the seminal book of Nesterov, includes the analysis of cutting plane methods, as well as accelerated gradient descent schemes. We also pay special attention to non-Euclidean settings relevant algorithms include Frank-Wolfe, mirror descent, and dual averaging and discuss their relevance in machine learning. We provide a gentle introduction to structural optimization with FISTA to optimize a sum of a smooth and a simple non-smooth term, saddle-point mirror prox Nemirovski's alternative to Nesterov's smoothing, and a concise description of interior point methods. In stochastic optimization we discuss stochastic gradient descent, mini-batches, random coordinate descent, and sublinear algorithms. We also briefly touch upon convex relaxation of combinatorial problems and the use of randomness to round solutions, as well as random walks based methods.

...read moreread less

1,213 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Introductory Lectures on Convex Optimization: A Basic Course

[...]

I︠u︡. E. Nesterov

14 Jan 2014

TL;DR: A polynomial-time interior-point method for linear optimization was proposed in this paper, where the complexity bound was not only in its complexity, but also in the theoretical pre- diction of its high efficiency was supported by excellent computational results.

...read moreread less

Abstract: It was in the middle of the 1980s, when the seminal paper by Kar- markar opened a new epoch in nonlinear optimization The importance of this paper, containing a new polynomial-time algorithm for linear op- timization problems, was not only in its complexity bound At that time, the most surprising feature of this algorithm was that the theoretical pre- diction of its high efficiency was supported by excellent computational results This unusual fact dramatically changed the style and direc- tions of the research in nonlinear optimization Thereafter it became more and more common that the new methods were provided with a complexity analysis, which was considered a better justification of their efficiency than computational experiments In a new rapidly develop- ing field, which got the name "polynomial-time interior-point methods", such a justification was obligatory Afteralmost fifteen years of intensive research, the main results of this development started to appear in monographs [12, 14, 16, 17, 18, 19] Approximately at that time the author was asked to prepare a new course on nonlinear optimization for graduate students The idea was to create a course which would reflect the new developments in the field Actually, this was a major challenge At the time only the theory of interior-point methods for linear optimization was polished enough to be explained to students The general theory of self-concordant functions had appeared in print only once in the form of research monograph [12]

...read moreread less

3,372 citations

Journal Article•DOI•

Pegasos: primal estimated sub-gradient solver for SVM

[...]

Shai Shalev-Shwartz¹, Yoram Singer², Nathan Srebro³, Andrew Cotter³•Institutions (3)

Hebrew University of Jerusalem¹, Google², Toyota Technological Institute at Chicago³

01 Mar 2011-Mathematical Programming

TL;DR: A simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines, which is particularly well suited for large text classification problems, and demonstrates an order-of-magnitude speedup over previous SVM learning methods.

...read moreread less

Abstract: We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy $${\epsilon}$$ is $${\tilde{O}(1 / \epsilon)}$$, where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require $${\Omega(1 / \epsilon^2)}$$ iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is $${\tilde{O}(d/(\lambda \epsilon))}$$, where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an order-of-magnitude speedup over previous SVM learning methods.

...read moreread less

2,037 citations

Proceedings Article•DOI•

Solving large scale linear prediction problems using stochastic gradient descent algorithms

[...]

Tong Zhang¹•Institutions (1)

IBM¹

04 Jul 2004

TL;DR: Stochastic gradient descent algorithms on regularized forms of linear prediction methods, related to online algorithms such as perceptron, are studied, and numerical rate of convergence for such algorithms is obtained.

...read moreread less

Abstract: Linear prediction methods, such as least squares for regression, logistic regression and support vector machines for classification, have been extensively used in statistics and machine learning. In this paper, we study stochastic gradient descent (SGD) algorithms on regularized forms of linear prediction methods. This class of methods, related to online algorithms such as perceptron, are both efficient and very simple to implement. We obtain numerical rate of convergence for such algorithms, and discuss its implications. Experiments on text data will be provided to demonstrate numerical and statistical consequences of our theoretical findings.

...read moreread less

1,182 citations

Proceedings Article•DOI•

A dual coordinate descent method for large-scale linear SVM

[...]

Cho-Jui Hsieh¹, Kai-Wei Chang¹, Chih-Jen Lin¹, S. Sathiya Keerthi², S. Sundararajan² - Show less +1 more•Institutions (2)

National Taiwan University¹, Yahoo!²

05 Jul 2008

TL;DR: A novel dual coordinate descent method for linear SVM with L1-and L2-loss functions that reaches an ε-accurate solution in O(log(1/ε)) iterations is presented.

...read moreread less

Abstract: In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such large-scale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1-and L2-loss functions. The proposed method is simple and reaches an e-accurate solution in O(log(1/e)) iterations. Experiments indicate that our method is much faster than state of the art solvers such as Pegasos, TRON, SVMperf, and a recent primal coordinate descent implementation.

...read moreread less

1,014 citations

"Accelerating Stochastic Gradient De..." refers background or methods in this paper

...Two recent papers Le Roux et al. [2012], Shalev-Shwartz and Zhang [2012] proposed methods that achieve such a variance reduction effect for SGD, which leads to a linear convergence rate when ψi(w) is smooth and strongly convex. The method in Le Roux et al. [2012] was referred to as SAG (stochastic average gradient), and the method in Shalev-Shwartz and Zhang [2012] was referred to as SDCA....
[...]
...These methods are suitable for training convex linear prediction problems such as logistic regression or least squares regression, and in fact, SDCA is the method implemented in the popular lib-SVM package Hsieh et al. [2008]. However, both proposals require storage of all gradients (or dual variables)....
[...]
...As long as we pick ηt as a constant η < 1/L, we have linear convergence of O((1 − γ/L)) Nesterov [2004]. However, for SGD, due to the variance of random sampling, we generally need to choose ηt = O(1/t) and obtain a slower sub-linear convergence rate of O(1/t). This means that we have a trade-off of fast computation per iteration and slow convergence for SGD versus slow computation per iteration and fast convergence for gradient descent. Although the fast computation means it can reach an approximate solution relatively quickly, and thus has been proposed by various researchers for large scale problems Zhang [2004], Shalev-Shwartz et al. [2007] (also see Leon Bottou’s Webpage http://leon....
[...]
...As long as we pick ηt as a constant η < 1/L, we have linear convergence of O((1 − γ/L)) Nesterov [2004]. However, for SGD, due to the variance of random sampling, we generally need to choose ηt = O(1/t) and obtain a slower sub-linear convergence rate of O(1/t)....
[...]
...As long as we pick ηt as a constant η < 1/L, we have linear convergence of O((1 − γ/L)) Nesterov [2004]. However, for SGD, due to the variance of random sampling, we generally need to choose ηt = O(1/t) and obtain a slower sub-linear convergence rate of O(1/t). This means that we have a trade-off of fast computation per iteration and slow convergence for SGD versus slow computation per iteration and fast convergence for gradient descent. Although the fast computation means it can reach an approximate solution relatively quickly, and thus has been proposed by various researchers for large scale problems Zhang [2004], Shalev-Shwartz et al....
[...]

Journal Article•

Stochastic dual coordinate ascent methods for regularized loss

[...]

Shai Shalev-Shwartz¹, Tong Zhang²•Institutions (2)

Hebrew University of Jerusalem¹, Rutgers University²

01 Jan 2013-Journal of Machine Learning Research

TL;DR: In this article, a convergence analysis of stochastic dual coordinate coordinate ascent (SDCA) is presented, showing that this class of methods enjoy strong theoretical guarantees that are comparable or better than SGD.

...read moreread less

Abstract: Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it has so far lacked good convergence analysis. This paper presents a new analysis of Stochastic Dual Coordinate Ascent (SDCA) showing that this class of methods enjoy strong theoretical guarantees that are comparable or better than SGD. This analysis justifies the effectiveness of SDCA for practical applications.

...read moreread less

986 citations