scispace - formally typeset
Open AccessJournal ArticleDOI

Pegasos: primal estimated sub-gradient solver for SVM

Reads0
Chats0
TLDR
A simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines, which is particularly well suited for large text classification problems, and demonstrates an order-of-magnitude speedup over previous SVM learning methods.
Abstract
We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy $${\epsilon}$$ is $${\tilde{O}(1 / \epsilon)}$$, where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require $${\Omega(1 / \epsilon^2)}$$ iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is $${\tilde{O}(d/(\lambda \epsilon))}$$, where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an order-of-magnitude speedup over previous SVM learning methods.

read more

Content maybe subject to copyright    Report

Mathematical Programming manuscript No.
(will be inserted by the editor)
Shai Shalev-Shwartz · Yoram Singer ·
Nathan Srebro · Andrew Cotter
Pegasos: Primal Estimated sub-GrAdient
SOlver for SVM
Received: date / Accepted: date
Abstract We describe and analyze a simple and effective stochastic sub-gradient
descent algorithm for solving the optimization problem cast by Support Vector
Machines (SVM). We prove that the number of iterations required to obtain a so-
lution of accuracy is
˜
O(1/), where each iteration operates on a single training
example. In contrast, previous analyses of stochastic gradient descent methods for
SVMs require (1/
2
) iterations. As in previously devised SVM solvers, the num-
ber of iterations also scales linearly with 1, where λ is the regularization param-
eter of SVM. For a linear kernel, the total run-time of our method is
˜
O(d/(λ)),
where d is a bound on the number of non-zero features in each example. Since
the run-time does not depend directly on the size of the training set, the result-
ing algorithm is especially suited for learning from large datasets. Our approach
also extends to non-linear kernels while working solely on the primal objective
function, though in this case the runtime does depend linearly on the training set
size. Our algorithm is particularly well suited for large text classification prob-
lems, where we demonstrate an order-of-magnitude speedup over previous SVM
learning methods.
Keywords SVM · Stochastic Gradient Descent
Shai Shalev-Shwartz
School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel
E-mail: shais@cs.huji.ac.il
Yoram Singer
Google
E-mail: singer@google.com
Nathan Srebro
Toyota Technological Institute at Chicago
E-mail: nati@uchicago.edu
Andrew Cotter
Toyota Technological Institute at Chicago
E-mail: cotter@tti-c.org

2 Shai Shalev-Shwartz et al.
Mathematics Subject Classification (2000) First · Second · More
1 Introduction
Support Vector Machines (SVMs) are effective and popular classification learning
tool [36,12]. The task of learning a support vector machine is typically cast as
a constrained quadratic programming problem. However, in its native form, it is
in fact an unconstrained empirical loss minimization with a penalty term for the
norm of the classifier that is being learned. Formally, given a training set S =
{(x
i
, y
i
)}
m
i=1
, where x
i
R
n
and y
i
{+1, 1}, we would like to find the
minimizer of the problem
min
w
λ
2
kwk
2
+
1
m
X
(x,y)S
`(w; (x, y)) , (1)
where
`(w; (x, y)) = max{0, 1 y hw, xi} , (2)
and hu, vi denotes the standard inner product between the vectors u and v. We
denote the objective function of Eq. (1) by f(w). We say that an optimization
method finds an -accurate solution
ˆ
w if f(
ˆ
w) min
w
f(w) + . The standard
SVM problem also includes an unregularized bias term. We omit the bias through-
out the coming sections and revisit the incorporation of a bias term in Sec. 6.
We describe and analyze in this paper a simple stochastic sub-gradient descent
algorithm, which we call Pegasos, for solving Eq. (1). At each iteration, a single
training example is chosen at random and used to estimate a sub-gradient of the
objective, and a step with pre-determined step-size is taken in the opposite direc-
tion. We show that with high probability over the choice of the random examples,
our algorithm finds an -accurate solution using only
˜
O(1/(λ)) iterations, while
each iteration involves a single inner product between w and x. Put differently,
the overall runtime required to obtain an accurate solution is
˜
O(n/(λ)), where
n is the dimensionality of w and x. Moreover, this runtime can be reduced to
˜
O(d/(λ)) where d is the number of non-zero features in each example x. Pega-
sos can also be used with non-linear kernels, as we describe in Sec. 4. We would
like to emphasize that a solution is found in probability solely due to the random-
ization steps employed by the algorithm and not due to the data set. The data set
is not assumed to be random, and the analysis holds for any data set S. Further-
more, the runtime does not depend on the number of training examples and thus
our algorithm is especially suited for large datasets.
Before indulging into the detailed description and analysis of Pegasos, we
would like to draw connections to and put our work in context of some of the
more recent work on SVM. For a more comprehensive and up-to-date overview
of relevant work see the references in the papers cited below as well as the web
site dedicated to kernel methods at http://www.kernel-machines.org . Due to the
centrality of the SVM optimization problem, quite a few methods were devised
and analyzed. The different approaches can be roughly divided into the following
categories.

Pegasos: Primal Estimated sub-GrAdient SOlver for SVM 3
Interior Point (IP) methods: IP methods (see for instance [7] and the references
therein) cast the SVM learning task as a quadratic optimization problem subject to
linear constraints. The constraints are replaced with a barrier function. The result
is a sequence of unconstrained problems which can be optimized very efficiently
using Newton or Quasi-Newton methods. The advantage of IP methods is that the
dependence on the accuracy is double logarithmic, namely, log(log(1/)). Alas,
IP methods typically require run time which is cubic in the number of examples
m. Moreover, the memory requirements of IP methods are O(m
2
) which renders
a direct use of IP methods very difficult when the training set consists of many
examples. It should be noted that there have been several attempts to reduce the
complexity based on additional assumptions (see e.g. [15]). However, the depen-
dence on m remains super linear. In addition, while the focus of the paper is the
optimization problem cast by SVM, one needs to bear in mind that the optimiza-
tion problem is a proxy method for obtaining good classification error on unseen
examples. Achieving a very high accuracy in the optimization process is usually
unnecessary and does not translate to a significant increase in the generalization
accuracy. The time spent by IP methods for finding a single accurate solution may,
for instance, be better utilized for trying different regularization values.
Decomposition methods: To overcome the quadratic memory requirement of IP
methods, decomposition methods such as SMO [29] and SVM-Light [20] tackle
the dual representation of the SVM optimization problem, and employ an active
set of constraints thus working on a subset of dual variables. In the extreme case,
called row-action methods [8], the active set consists of a single constraint. While
algorithms in this family are fairly simple to implement and entertain general
asymptotic convergence properties [8], the time complexity of most of the algo-
rithms in this family is typically super linear in the training set size m. Moreover,
since decomposition methods find a feasible dual solution and their goal is to max-
imize the dual objective function, they often result in a rather slow convergence
rate to the optimum of the primal objective function. (See also the discussion
in [19].)
Primal optimization: Most existing approaches, including the methods discussed
above, focus on the dual of Eq. (1), especially when used in conjunction with
non-linear kernels. However, even when non-linear kernels are used, the Repre-
senter theorem [23] allows us to re-parametrize w as w =
P
α
i
y
i
x
i
and cast the
primal objective Eq. (1) as an unconstrained optimization problem with the vari-
ables α
1
, . . . , α
m
(see Sec. 4). Tackling the primal objective directly was studied,
for example, by Chapelle [10], who considered using smooth loss functions in-
stead of the hinge loss, in which case the optimization problem becomes a smooth
unconstrained optimization problem. Chapelle then suggested using various op-
timization approaches such as conjugate gradient descent and Newton’s method.
We take a similar approach here, however we cope with the non-differentiability
of the hinge-loss directly by using sub-gradients instead of gradients. Another im-
portant distinction is that Chapelle views the optimization problem as a function of
the variables α
i
. In contrast, though Pegasos maintains the same set of variables,
the optimization process is performed with respect to w, see Sec. 4 for details.
Stochastic gradient descent: The Pegasos algorithm is an application of a stochas-
tic sub-gradient method (see for example [25,34]). In the context of machine
learning problems, the efficiency of the stochastic gradient approach has been

4 Shai Shalev-Shwartz et al.
studied in [26,1, 3,27,6, 5]. In particular, it has been claimed and experimentally
observed that, “Stochastic algorithms yield the best generalization performance
despite being the worst optimization algorithms”. This claim has recently received
formal treatment in [4,32].
Two concrete algorithms that are closely related to the Pegasos algorithm
and are also variants of stochastic sub-gradient methods are the NORMA algo-
rithm [24] and a stochastic gradient algorithm due to Zhang [37]. The main dif-
ference between Pegasos and these variants is in the procedure for setting the step
size. We elaborate on this issue in Sec. 7. The convergence rate given in [24]
implies that the number of iterations required to achieve -accurate solution is
O(1/(λ )
2
). This bound is inferior to the corresponding bound of Pegasos. The
analysis in [37] for the case of regularized loss shows that the squared Euclidean
distance to the optimal solution converges to zero but the rate of convergence de-
pends on the step size parameter. As we show in Sec. 7, tuning this parameter is
crucial to the success of the method. In contrast, Pegasos is virtually parameter
free. Another related recent work is Nesterov’s general primal-dual subgradient
method for the minimization of non-smooth functions [28]. Intuitively, the ideas
presented in [28] can be combined with the stochastic regime of Pegasos. We leave
this direction and other potential extensions of Pegasos for future research.
Online methods: Online learning methods are very closely related to stochas-
tic gradient methods, as they operate on only a single example at each iteration.
Moreover, many online learning rules, including the Perceptron rule, can be seen
as implementing a stochastic gradient step. Many such methods, including the
Perceptron and the Passive Aggressive method [11] also have strong connections
to the “margin” or norm of the predictor, though they do not directly minimize the
SVM objective. Nevertheless, online learning algorithms were proposed as fast
alternatives to SVMs (e.g. [16]). Such algorithms can be used to obtain a predic-
tor with low generalization error using an online-to-batch conversion scheme [9].
However, the conversion schemes do not necessarily yield an -accurate solutions
to the original SVM problem and their performance is typically inferior to di-
rect batch optimizers. As noted above, Pegasos shares the simplicity and speed of
online learning algorithms, yet it is guaranteed to converge to the optimal SVM
solution.
Cutting Planes Approach: Recently, Joachims [21] proposed SVM-Perf, which
uses a cutting planes method to find a solution with accuracy in time O(md/(λ
2
)).
This bound was later improved by Smola et al [33] to O(md/(λ)). The complex-
ity guarantee for Pegasos avoids the dependence on the data set size m. In addi-
tion, while SVM-Perf yields very significant improvements over decomposition
methods for large data sets, our experiments (see Sec. 7) indicate that Pegasos is
substantially faster than SVM-Perf.
2 The Pegasos Algorithm
As mentioned above, Pegasos performs stochastic gradient descent on the primal
objective Eq. (1) with a carefully chosen stepsize. We describe in this section the
core of the Pegasos procedure in detail and provide pseudo-code. We also present
a few variants of the basic algorithm and discuss few implementation issues.

Pegasos: Primal Estimated sub-GrAdient SOlver for SVM 5
INPUT: S, λ, T
INITIALIZE: Set w
1
= 0
FOR t = 1, 2, . . . , T
Choose i
t
{1, . . . , |S|} uniformly at random.
Set η
t
=
1
λt
If y
i
t
hw
t
, x
i
t
i < 1, then:
Set w
t+1
(1 η
t
λ)w
t
+ η
t
y
i
t
x
i
t
Else (if y
i
t
hw
t
, x
i
t
i 1):
Set w
t+1
(1 η
t
λ)w
t
[ Optional: w
t+1
min
n
1,
1/
λ
kw
t+1
k
o
w
t+1
]
OUTPUT: w
T +1
Fig. 1 The Pegasos Algorithm.
2.1 The Basic Pegasos Algorithms
On each iteration Pegasos operates as follow. Initially, we set w
1
to the zero vec-
tor. On iteration t of the algorithm, we first choose a random training example
(x
i
t
, y
i
t
) by picking an index i
t
{1, . . . , m} uniformly at random. We then re-
place the objective in Eq. (1) with an approximation based on the training example
(x
i
t
, y
i
t
), yielding:
f(w; i
t
) =
λ
2
kwk
2
+ `(w; (x
i
t
, y
i
t
)) . (3)
We consider the sub-gradient of the above approximate objective, given by:
t
= λ w
t
1
[y
i
t
hw
t
, x
i
t
i < 1] y
i
t
x
i
t
, (4)
where
1
[y hw, xi < 1] is the indicator function which takes a value of one if its
argument is true (w yields non-zero loss on the example (x, y)), and zero other-
wise. We then update w
t+1
w
t
η
t
t
using a step size of η
t
= 1/(λt). Note
that this update can be written as:
w
t+1
(1
1
t
)w
t
+ η
t
1
[y
i
t
hw
t
, x
i
t
i < 1] y
i
t
x
i
t
. (5)
After a predetermined number T of iterations, we output the last iterate w
T +1
.
The pseudo-code of Pegasos is given in Fig. 1.
2.2 Incorporating a Projection Step
The above description of Pegasos is a verbatim application of the stochastic gradient-
descent method. A potential variation is the gradient-projection approach where
we limit the set of admissible solutions to the ball of radius 1/
λ. To enforce this
property, we project w
t
after each iteration onto this sphere by performing the
update:
w
t+1
min
n
1,
1/
λ
kw
t+1
k
o
w
t+1
. (6)

Figures
Citations
More filters
Journal ArticleDOI

Online Distributed Learning Over Networks in RKH Spaces Using Random Fourier Features

TL;DR: This work proposes to approximate the solution as a fixed-size vector (of larger dimension than the input space) using the previously introduced framework of random Fourier features to pave the way to use standard linear combine-then-adapt techniques.
Proceedings ArticleDOI

Factorized Similarity Learning in Networks

TL;DR: A Factorized Similarity Learning (FSL) is proposed to integrate the link, node content, and user supervision into an uniform framework and is learned by using matrix factorization, and the final similarities are approximated by the span of low rank matrices.
Proceedings Article

Minimizing the maximal loss: how and why

TL;DR: It is proved that in some situations better accuracy on the training set is crucial to obtain good performance on unseen examples and an algorithm is described that can convert any online algorithm to a minimizer of the maximal loss.
Journal ArticleDOI

Online Travel Mode Identification Using Smartphones With Battery Saving Considerations

TL;DR: This paper proposes a GPS-and-network-free method to detect a traveler's travel mode using mobile phone sensors and achieves almost 100% accuracy in a binary classification of wheeled/unwheeled travel modes and an average of 97.1% accuracy with all six travel modes.
Proceedings Article

nEmesis: Which Restaurants Should You Avoid Today?

TL;DR: It is demonstrated that readily accessible online data can be used to detect cases of foodborne illness in a timely manner and the joint associations of multiple factors mined from online data with the DOHMH violation scores are investigated.
References
More filters
Book

Convex Optimization

TL;DR: In this article, the focus is on recognizing convex optimization problems and then finding the most appropriate technique for solving them, and a comprehensive introduction to the subject is given. But the focus of this book is not on the optimization problem itself, but on the problem of finding the appropriate technique to solve it.

Statistical learning theory

TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Book

Pattern classification and scene analysis

TL;DR: In this article, a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition is provided, including Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.
Proceedings ArticleDOI

Advances in kernel methods: support vector learning

TL;DR: Support vector machines for dynamic reconstruction of a chaotic system, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel.
Frequently Asked Questions (9)
Q1. What are the contributions in this paper?

The authors describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines ( SVM ). The authors prove that the number of iterations required to obtain a solution of accuracy is Õ ( 1/ ), where each iteration operates on a single training example. Their approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Their algorithm is particularly well suited for large text classification problems, where the authors demonstrate an order-of-magnitude speedup over previous SVM learning methods. 

(For instance, if full optimization of SVM yields a test classification error of 1%, then the authors chose such that a -accurate optimization would guarantee test classification error of at most 1.1%.) 

While algorithms in this family are fairly simple to implement and entertain general asymptotic convergence properties [8], the time complexity of most of the algorithms in this family is typically super linear in the training set size m. 

2. If f(w) = maxi fi(w) for r differentiable functions f1, . . . , fr, and j = arg maxi fi(w0), then the gradient of fj at w0 is a sub-gradient of f at w0. 

Cutting Planes Approach: Recently, Joachims [21] proposed SVM-Perf, which uses a cutting planes method to find a solution with accuracy in timeO(md/(λ 2)). 

as the authors show in the sequel, the kernelized Pegasos variant described in section 4 gives good performance on a range of kernel SVM problems, provided that these problems have sufficient regularization. 

In its more traditional form, the SVM learning problem was described as the following constrained optimization problem,1 2 ‖w‖2 + C m∑ i=1 ξi s.t. ∀i ∈ [m] : ξi ≥ 0, ξi ≥ 1− yi 〈w,xi〉 . 

The authors now show that the Pegasos algorithm can be implemented using only kernel evaluations, without direct access to the feature vectors φ(x) or explicit access to the weight vector w. 

As in the linear experiments, the authors chose a primal suboptimality threshold for each dataset which guarantees a testing classification error within 10% of that at the optimum.