What are the contributions in this paper?

The authors describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines ( SVM ). The authors prove that the number of iterations required to obtain a solution of accuracy is Õ ( 1/ ), where each iteration operates on a single training example. Their approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Their algorithm is particularly well suited for large text classification problems, where the authors demonstrate an order-of-magnitude speedup over previous SVM learning methods.

What is the criterion for a -accurate optimization of a?

(For instance, if full optimization of SVM yields a test classification error of 1%, then the authors chose such that a -accurate optimization would guarantee test classification error of at most 1.1%.)

What is the value of fj at w0?

2. If f(w) = maxi fi(w) for r differentiable functions f1, . . . , fr, and j = arg maxi fi(w0), then the gradient of fj at w0 is a sub-gradient of f at w0.

What is the performance of the kernelized Pegasos variant?

as the authors show in the sequel, the kernelized Pegasos variant described in section 4 gives good performance on a range of kernel SVM problems, provided that these problems have sufficient regularization.

What is the optimum form of the SVM learning problem?

In its more traditional form, the SVM learning problem was described as the following constrained optimization problem,1 2 ‖w‖2 + C m∑ i=1 ξi s.t. ∀i ∈ [m] : ξi ≥ 0, ξi ≥ 1− yi 〈w,xi〉 .

How can the authors implement the Pegasos algorithm?

The authors now show that the Pegasos algorithm can be implemented using only kernel evaluations, without direct access to the feature vectors φ(x) or explicit access to the weight vector w.

What is the primary suboptimality threshold for the Pegasos variant?

As in the linear experiments, the authors chose a primal suboptimality threshold for each dataset which guarantees a testing classification error within 10% of that at the optimum.

(Open Access) Pegasos: primal estimated sub-gradient solver for SVM (2011) | Shai Shalev-Shwartz

Mathematical Programming manuscript No.

(will be inserted by the editor)

Shai Shalev-Shwartz · Yoram Singer ·

Nathan Srebro · Andrew Cotter

Pegasos: Primal Estimated sub-GrAdient

SOlver for SVM

Received: date / Accepted: date

Abstract We describe and analyze a simple and effective stochastic sub-gradient

descent algorithm for solving the optimization problem cast by Support Vector

Machines (SVM). We prove that the number of iterations required to obtain a so-

lution of accuracy  is

O(1/), where each iteration operates on a single training

example. In contrast, previous analyses of stochastic gradient descent methods for

SVMs require Ω(1/

) iterations. As in previously devised SVM solvers, the num-

ber of iterations also scales linearly with 1/λ, where λ is the regularization param-

eter of SVM. For a linear kernel, the total run-time of our method is

O(d/(λ)),

where d is a bound on the number of non-zero features in each example. Since

the run-time does not depend directly on the size of the training set, the result-

ing algorithm is especially suited for learning from large datasets. Our approach

also extends to non-linear kernels while working solely on the primal objective

function, though in this case the runtime does depend linearly on the training set

size. Our algorithm is particularly well suited for large text classiﬁcation prob-

lems, where we demonstrate an order-of-magnitude speedup over previous SVM

learning methods.

Keywords SVM · Stochastic Gradient Descent

Shai Shalev-Shwartz

School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel

E-mail: shais@cs.huji.ac.il

Yoram Singer

Google

E-mail: singer@google.com

Nathan Srebro

Toyota Technological Institute at Chicago

E-mail: nati@uchicago.edu

Andrew Cotter

Toyota Technological Institute at Chicago

E-mail: cotter@tti-c.org

2 Shai Shalev-Shwartz et al.

Mathematics Subject Classiﬁcation (2000) First · Second · More

1 Introduction

Support Vector Machines (SVMs) are effective and popular classiﬁcation learning

tool [36,12]. The task of learning a support vector machine is typically cast as

a constrained quadratic programming problem. However, in its native form, it is

in fact an unconstrained empirical loss minimization with a penalty term for the

norm of the classiﬁer that is being learned. Formally, given a training set S =

{(x

, y

)}

i=1

, where x

∈

and y

∈ {+1, −1}, we would like to ﬁnd the

minimizer of the problem

min

kwk

(x,y)∈S

`(w; (x, y)) , (1)

where

`(w; (x, y)) = max{0, 1 − y hw, xi} , (2)

and hu, vi denotes the standard inner product between the vectors u and v. We

denote the objective function of Eq. (1) by f(w). We say that an optimization

method ﬁnds an -accurate solution

w if f(

w) ≤ min

f(w) + . The standard

SVM problem also includes an unregularized bias term. We omit the bias through-

out the coming sections and revisit the incorporation of a bias term in Sec. 6.

We describe and analyze in this paper a simple stochastic sub-gradient descent

algorithm, which we call Pegasos, for solving Eq. (1). At each iteration, a single

training example is chosen at random and used to estimate a sub-gradient of the

objective, and a step with pre-determined step-size is taken in the opposite direc-

tion. We show that with high probability over the choice of the random examples,

our algorithm ﬁnds an -accurate solution using only

O(1/(λ)) iterations, while

each iteration involves a single inner product between w and x. Put differently,

the overall runtime required to obtain an  accurate solution is

O(n/(λ)), where

n is the dimensionality of w and x. Moreover, this runtime can be reduced to

O(d/(λ)) where d is the number of non-zero features in each example x. Pega-

sos can also be used with non-linear kernels, as we describe in Sec. 4. We would

like to emphasize that a solution is found in probability solely due to the random-

ization steps employed by the algorithm and not due to the data set. The data set

is not assumed to be random, and the analysis holds for any data set S. Further-

more, the runtime does not depend on the number of training examples and thus

our algorithm is especially suited for large datasets.

Before indulging into the detailed description and analysis of Pegasos, we

would like to draw connections to and put our work in context of some of the

more recent work on SVM. For a more comprehensive and up-to-date overview

of relevant work see the references in the papers cited below as well as the web

site dedicated to kernel methods at http://www.kernel-machines.org . Due to the

centrality of the SVM optimization problem, quite a few methods were devised

and analyzed. The different approaches can be roughly divided into the following

categories.

Pegasos: Primal Estimated sub-GrAdient SOlver for SVM 3

Interior Point (IP) methods: IP methods (see for instance [7] and the references

therein) cast the SVM learning task as a quadratic optimization problem subject to

linear constraints. The constraints are replaced with a barrier function. The result

is a sequence of unconstrained problems which can be optimized very efﬁciently

using Newton or Quasi-Newton methods. The advantage of IP methods is that the

dependence on the accuracy  is double logarithmic, namely, log(log(1/)). Alas,

IP methods typically require run time which is cubic in the number of examples

m. Moreover, the memory requirements of IP methods are O(m

) which renders

a direct use of IP methods very difﬁcult when the training set consists of many

examples. It should be noted that there have been several attempts to reduce the

complexity based on additional assumptions (see e.g. [15]). However, the depen-

dence on m remains super linear. In addition, while the focus of the paper is the

optimization problem cast by SVM, one needs to bear in mind that the optimiza-

tion problem is a proxy method for obtaining good classiﬁcation error on unseen

examples. Achieving a very high accuracy in the optimization process is usually

unnecessary and does not translate to a signiﬁcant increase in the generalization

accuracy. The time spent by IP methods for ﬁnding a single accurate solution may,

for instance, be better utilized for trying different regularization values.

Decomposition methods: To overcome the quadratic memory requirement of IP

methods, decomposition methods such as SMO [29] and SVM-Light [20] tackle

the dual representation of the SVM optimization problem, and employ an active

set of constraints thus working on a subset of dual variables. In the extreme case,

called row-action methods [8], the active set consists of a single constraint. While

algorithms in this family are fairly simple to implement and entertain general

asymptotic convergence properties [8], the time complexity of most of the algo-

rithms in this family is typically super linear in the training set size m. Moreover,

since decomposition methods ﬁnd a feasible dual solution and their goal is to max-

imize the dual objective function, they often result in a rather slow convergence

rate to the optimum of the primal objective function. (See also the discussion

in [19].)

Primal optimization: Most existing approaches, including the methods discussed

above, focus on the dual of Eq. (1), especially when used in conjunction with

non-linear kernels. However, even when non-linear kernels are used, the Repre-

senter theorem [23] allows us to re-parametrize w as w =

and cast the

primal objective Eq. (1) as an unconstrained optimization problem with the vari-

ables α

, . . . , α

(see Sec. 4). Tackling the primal objective directly was studied,

for example, by Chapelle [10], who considered using smooth loss functions in-

stead of the hinge loss, in which case the optimization problem becomes a smooth

unconstrained optimization problem. Chapelle then suggested using various op-

timization approaches such as conjugate gradient descent and Newton’s method.

We take a similar approach here, however we cope with the non-differentiability

of the hinge-loss directly by using sub-gradients instead of gradients. Another im-

portant distinction is that Chapelle views the optimization problem as a function of

the variables α

. In contrast, though Pegasos maintains the same set of variables,

the optimization process is performed with respect to w, see Sec. 4 for details.

Stochastic gradient descent: The Pegasos algorithm is an application of a stochas-

tic sub-gradient method (see for example [25,34]). In the context of machine

learning problems, the efﬁciency of the stochastic gradient approach has been

4 Shai Shalev-Shwartz et al.

studied in [26,1, 3,27,6, 5]. In particular, it has been claimed and experimentally

observed that, “Stochastic algorithms yield the best generalization performance

despite being the worst optimization algorithms”. This claim has recently received

formal treatment in [4,32].

Two concrete algorithms that are closely related to the Pegasos algorithm

and are also variants of stochastic sub-gradient methods are the NORMA algo-

rithm [24] and a stochastic gradient algorithm due to Zhang [37]. The main dif-

ference between Pegasos and these variants is in the procedure for setting the step

size. We elaborate on this issue in Sec. 7. The convergence rate given in [24]

implies that the number of iterations required to achieve -accurate solution is

O(1/(λ )

). This bound is inferior to the corresponding bound of Pegasos. The

analysis in [37] for the case of regularized loss shows that the squared Euclidean

distance to the optimal solution converges to zero but the rate of convergence de-

pends on the step size parameter. As we show in Sec. 7, tuning this parameter is

crucial to the success of the method. In contrast, Pegasos is virtually parameter

free. Another related recent work is Nesterov’s general primal-dual subgradient

method for the minimization of non-smooth functions [28]. Intuitively, the ideas

presented in [28] can be combined with the stochastic regime of Pegasos. We leave

this direction and other potential extensions of Pegasos for future research.

Online methods: Online learning methods are very closely related to stochas-

tic gradient methods, as they operate on only a single example at each iteration.

Moreover, many online learning rules, including the Perceptron rule, can be seen

as implementing a stochastic gradient step. Many such methods, including the

Perceptron and the Passive Aggressive method [11] also have strong connections

to the “margin” or norm of the predictor, though they do not directly minimize the

SVM objective. Nevertheless, online learning algorithms were proposed as fast

alternatives to SVMs (e.g. [16]). Such algorithms can be used to obtain a predic-

tor with low generalization error using an online-to-batch conversion scheme [9].

However, the conversion schemes do not necessarily yield an -accurate solutions

to the original SVM problem and their performance is typically inferior to di-

rect batch optimizers. As noted above, Pegasos shares the simplicity and speed of

online learning algorithms, yet it is guaranteed to converge to the optimal SVM

solution.

Cutting Planes Approach: Recently, Joachims [21] proposed SVM-Perf, which

uses a cutting planes method to ﬁnd a solution with accuracy  in time O(md/(λ

)).

This bound was later improved by Smola et al [33] to O(md/(λ)). The complex-

ity guarantee for Pegasos avoids the dependence on the data set size m. In addi-

tion, while SVM-Perf yields very signiﬁcant improvements over decomposition

methods for large data sets, our experiments (see Sec. 7) indicate that Pegasos is

substantially faster than SVM-Perf.

2 The Pegasos Algorithm

As mentioned above, Pegasos performs stochastic gradient descent on the primal

objective Eq. (1) with a carefully chosen stepsize. We describe in this section the

core of the Pegasos procedure in detail and provide pseudo-code. We also present

a few variants of the basic algorithm and discuss few implementation issues.

Pegasos: Primal Estimated sub-GrAdient SOlver for SVM 5

INPUT: S, λ, T

INITIALIZE: Set w

= 0

FOR t = 1, 2, . . . , T

Choose i

∈ {1, . . . , |S|} uniformly at random.

Set η

λt

If y

, x

i < 1, then:

Set w

t+1

← (1 − η

λ)w

+ η

Else (if y

, x

i ≥ 1):

Set w

t+1

← (1 − η

λ)w

[ Optional: w

t+1

← min

√

t+1

]

OUTPUT: w

T +1

Fig. 1 The Pegasos Algorithm.

2.1 The Basic Pegasos Algorithms

On each iteration Pegasos operates as follow. Initially, we set w

to the zero vec-

tor. On iteration t of the algorithm, we ﬁrst choose a random training example

, y

) by picking an index i

∈ {1, . . . , m} uniformly at random. We then re-

place the objective in Eq. (1) with an approximation based on the training example

, y

), yielding:

f(w; i

) =

kwk

+ `(w; (x

, y

)) . (3)

We consider the sub-gradient of the above approximate objective, given by:

∇

= λ w

−

, x

i < 1] y

, (4)

where

[y hw, xi < 1] is the indicator function which takes a value of one if its

argument is true (w yields non-zero loss on the example (x, y)), and zero other-

wise. We then update w

t+1

← w

− η

∇

using a step size of η

= 1/(λt). Note

that this update can be written as:

t+1

← (1 −

+ η

, x

i < 1] y

. (5)

After a predetermined number T of iterations, we output the last iterate w

T +1

The pseudo-code of Pegasos is given in Fig. 1.

2.2 Incorporating a Projection Step

The above description of Pegasos is a verbatim application of the stochastic gradient-

descent method. A potential variation is the gradient-projection approach where

we limit the set of admissible solutions to the ball of radius 1/

√

λ. To enforce this

property, we project w

after each iteration onto this sphere by performing the

update:

t+1

← min

√

t+1

. (6)

Pegasos: primal estimated sub-gradient solver for SVM

Figures

Citations

Online Distributed Learning Over Networks in RKH Spaces Using Random Fourier Features

Factorized Similarity Learning in Networks

Minimizing the maximal loss: how and why

Online Travel Mode Identification Using Smartphones With Battery Saving Considerations

nEmesis: Which Restaurants Should You Avoid Today?

References

Convex Optimization

Statistical learning theory

Pattern Classification and Scene Analysis.

Pattern classification and scene analysis

Advances in kernel methods: support vector learning

Related Papers (5)

LIBSVM: A library for support vector machines

Support-Vector Networks

LIBLINEAR: A Library for Large Linear Classification

The Nature of Statistical Learning Theory

Convex Optimization

Frequently Asked Questions (9)

Q1. What are the contributions in this paper?

Q2. What is the criterion for a -accurate optimization of a?

Q3. What is the time complexity of the algorithms in this family?

Q4. What is the value of fj at w0?

Q5. What is the way to find a solution to the SVM objective?

Q6. What is the performance of the kernelized Pegasos variant?

Q7. What is the optimum form of the SVM learning problem?

Q8. How can the authors implement the Pegasos algorithm?

Q9. What is the primary suboptimality threshold for the Pegasos variant?