Journal Article•DOI•

Pegasos: primal estimated sub-gradient solver for SVM

Q: What are the contributions in this paper?

The authors describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines ( SVM ). The authors prove that the number of iterations required to obtain a solution of accuracy is Õ ( 1/ ), where each iteration operates on a single training example. Their approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Their algorithm is particularly well suited for large text classification problems, where the authors demonstrate an order-of-magnitude speedup over previous SVM learning methods.

Q: What is the criterion for a -accurate optimization of a?

(For instance, if full optimization of SVM yields a test classification error of 1%, then the authors chose such that a -accurate optimization would guarantee test classification error of at most 1.1%.)

Q: What is the value of fj at w0?

2. If f(w) = maxi fi(w) for r differentiable functions f1, . . . , fr, and j = arg maxi fi(w0), then the gradient of fj at w0 is a sub-gradient of f at w0.

Q: What is the performance of the kernelized Pegasos variant?

as the authors show in the sequel, the kernelized Pegasos variant described in section 4 gives good performance on a range of kernel SVM problems, provided that these problems have sufficient regularization.

Q: What is the optimum form of the SVM learning problem?

In its more traditional form, the SVM learning problem was described as the following constrained optimization problem,1 2 ‖w‖2 + C m∑ i=1 ξi s.t. ∀i ∈ [m] : ξi ≥ 0, ξi ≥ 1− yi 〈w,xi〉 .

Q: How can the authors implement the Pegasos algorithm?

The authors now show that the Pegasos algorithm can be implemented using only kernel evaluations, without direct access to the feature vectors φ(x) or explicit access to the weight vector w.

Q: What is the primary suboptimality threshold for the Pegasos variant?

As in the linear experiments, the authors chose a primal suboptimality threshold for each dataset which guarantees a testing classification error within 10% of that at the optimum.

Shai Shalev-Shwartz¹, Yoram Singer², Nathan Srebro³, Andrew Cotter³•Institutions (3)

Hebrew University of Jerusalem¹, Google², Toyota Technological Institute at Chicago³

01 Mar 2011-Mathematical Programming (Springer-Verlag)-Vol. 127, Iss: 1, pp 3-30

TL;DR: A simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines, which is particularly well suited for large text classification problems, and demonstrates an order-of-magnitude speedup over previous SVM learning methods.

read less

Abstract: We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy $${\epsilon}$$ is $${\tilde{O}(1 / \epsilon)}$$, where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require $${\Omega(1 / \epsilon^2)}$$ iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is $${\tilde{O}(d/(\lambda \epsilon))}$$, where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an order-of-magnitude speedup over previous SVM learning methods.

...read moreread less

Summary (2 min read)

Jump to: [1 Introduction] – [2 The Pegasos Algorithm] – [3 Analysis] – [4 Using Mercer kernels] – [5 Other prediction problems and loss functions] – [6 Incorporating a bias term] – [7 Experiments] and [8 Conclusions]

1 Introduction

Support Vector Machines (SVMs) are effective and popular classification learning tool [36,12].
Alas, IP methods typically require run time which is cubic in the number of examples m.
Most existing approaches, including the methods discussed above, focus on the dual of Eq. (1), especially when used in conjunction with non-linear kernels.
In contrast, Pegasos is virtually parameter free.
Nevertheless, online learning algorithms were proposed as fast alternatives to SVMs (e.g. [16]).

2 The Pegasos Algorithm

As mentioned above, Pegasos performs stochastic gradient descent on the primal objective Eq. (1) with a carefully chosen stepsize.
The authors describe in this section the core of the Pegasos procedure in detail and provide pseudo-code.
The authors also present a few variants of the basic algorithm and discuss few implementation issues.
To underscore the difference between the fully deterministic case and the stochastic case, the authors refer to the subsamples in the latter case as mini-batches and call the process mini-batch iterates.
After w is updated, the stored norm ν needs to be updated, which can again be done in time O(d) as before.

3 Analysis

In this section the authors analyze the convergence properties of Pegasos.
The authors start by bounding the average instantaneous objective of the algorithm relatively to the average instantaneous objective of the optimal solution.
Thus, to prove the theorem it suffices to show that the conditions stated in Lemma 1 hold.
The above lemma tells us that on average after two attempts the authors are likely to find a good solution.

4 Using Mercer kernels

One of the main benefits of SVMs is that they can be used with kernels rather then with direct access to the feature vectors x.
It is now easy to implement the Pegasos algorithm by maintaining the vector α.
Since the iterates wt remain as before (just their representation changes), the guarantees on the accuracy after a number of iterations are still valid.
Concretely, the Representer theorem guarantees that the optimal solution of Eq. (17) is spanned by the training instances, i.e. it is of the form, w = ∑m i=1 α[i]φ(xi).
Interestingly, Chapelle also proposes preconditioning the gradients w.r.t. α by the kernel matrix, which effectively amounts to taking gradients w.r.t. w, as the authors do here.

5 Other prediction problems and loss functions

So far, the authors focused on the SVM formulation for binary classification using the hinge-loss.
The generality of these assumptions implies that the authors can apply Pegasos with any loss function which satisfies these requirements.
The log-loss version of the multiclass loss is convex as well with a bounded sub-gradient whose value is at most, 2 maxy′ ‖φ(x, y′)‖.
Therefore, |Y| is exponential in the length of the sequence.
Based on the above two properties, the authors now show explicitly how to calculate a sub-gradient for several loss functions.

6 Incorporating a bias term

In many applications, the weight vector w is augmented with a bias term which is a scalar, typically denoted as b.
Once the constant feature is added the rest of the algorithm remains intact, thus the bias term is not explicitly introduced.
A third method entertains the advantages of the two methods above at the price of a more complex algorithm that is applicable only for large batch sizes (large values of k), but not for the basic Pegasos algorithm (with k = 1).
The problem however is how to find a sub-gradient of g(w;At), as g(w;At) is defined through a minimization problem over b.
At is large enough, it might be possible to use more involved measure concentration tools to show that the expectation of f(w;At) is close enough to f(w;S) so as to still obtain fast convergence properties.

7 Experiments

In this section the authors present experimental results that demonstrate the merits of their algorithm.
The authors chose this threshold for each dataset such that a primal suboptimality less than guarantees a classification error on test data which is at most 1.1 times the test data classification error at the optimum.
It is interesting to note that the performance of Pegasos does not depend on the number of examples but rather on the value of λ.
All of the implementations use the same sparse representation for vectors, so the amount of time which it takes to perform a single kernel evaluation should, for each dataset, be roughly the same across all four algorithms.
The authors can see that, for three values of k, all significantly greater than 100, the experiments with the largest mini-batch size made the least progress while performing the same amount of computation.

8 Conclusions

The authors described and analyzed a simple and effective algorithm for approximately minimizing the objective function of SVM.
The authors derived fast rate of convergence results and experimented with the algorithm.
The authors empirical results indicate that for linear kernels, Pegasos achieves state-of-the-art results, despite of, or possibly due to, its simplicity.
The authors would like to thank Léon Bottou for useful discussions and suggestions, and Thorsten Joachims and Léon Bottou for help with the experiments.
Part of this work was done while SS and NS were visiting IBM research labs, Haifa, Israel.

Did you find this useful? Give us your feedback

Figures (12)

Fig. 2 The Mini-Batch Pegasos Algorithm.

Fig. 8 The effect of different sampling methods on the performance of Pegasos for the astro-ph dataset. The curves show the primal suboptimality achieved by uniform i.i.d. sampling, sampling from a fixed permutation, and sampling from a different permutation for every epoch.

Fig. 9 Comparisons of Pegasos to Norma (left), and Pegasos to stochastic gradient descent with a fixed learning rate (right) on the Astro-Physics datset. In the left hand side plot, the solid curves designate the objective value while the dashed curves show the test loss.

Fig. 5 Comparison of kernel SVM optimizers. Primal suboptimality (top row), primal suboptimality in log scale (middle row) and testing classification error (bottom row), for one run each of Pegasos, stochastic DCA, SVM-Light, and LASVM, on the Reuters (left column), Adult (center column) and USPS (right column) datasets. Plots of traces generated on the MNIST dataset (not shown) appear broadly similar to those for the USPS dataset. The horizontal axis is runtime in seconds.

Table 3 Relative kernel evaluation throughputs: the number of kernel evaluations per second of runtime divided by Pegasos’s number of kernel evaluations per second of runtime on the same dataset.

Fig. 6 Demonstration of dependence of Pegasos’ performance on regularization, on the USPS dataset. This plot shows (on a log-log scale) the primal suboptimalities of Pegasos and stochastic DCA after certain fixed numbers of iterations, for various values of λ.

Fig. 7 The effect of the mini-batch size on the runtime of Pegasos for the astro-ph dataset. The first plot shows the primal suboptimality achieved for certain fixed values of overall runtime kT , for various values of the mini-batch size k. The second plot shows the primal suboptimality achieved for certain fixed values of k, for various values of kT . Very similar results were achieved for the CCAT dataset.

Table 1 Training runtime and test error achieved (in parentheses) using various optimization methods on linear SVM problems. The suboptimality thresholds used for termination are = 0.0275, 0.00589 and 0.0449 on the astro-ph, CCAT and cov1 datasets (respectively). The testing classification errors at the optima of the SVM objectives are 3.36%, 6.03% and 22.6%.

Fig. 4 Comparison of linear SVM optimizers. Primal suboptimality (top row) and testing classification error (bottom row), for one run each of Pegasos, stochastic DCA, SVM-Perf, and LASVM, on the astro-ph (left), CCAT (center) and cov1 (right) datasets. In all plots the horizontal axis measures runtime in seconds.

Fig. 3 The Kernelized Pegasos Algorithm.

Table 2 Training runtime and test error achieved (in parentheses) using various optimization methods on linear SVM problems. = 0.00719, 0.0445, 0.000381 and 0.00144 on the Reuters, Adult, USPS and MNIST datasets (respectively). The testing classification errors at the optima of the SVM objectives are 2.88%, 14.9%, 0.457% and 0.57%.

Content maybe subject to copyright Report

Mathematical Programming manuscript No.

(will be inserted by the editor)

Shai Shalev-Shwartz · Yoram Singer ·

Nathan Srebro · Andrew Cotter

Pegasos: Primal Estimated sub-GrAdient

SOlver for SVM

Received: date / Accepted: date

Abstract We describe and analyze a simple and effective stochastic sub-gradient

descent algorithm for solving the optimization problem cast by Support Vector

Machines (SVM). We prove that the number of iterations required to obtain a so-

lution of accuracy  is

O(1/), where each iteration operates on a single training

example. In contrast, previous analyses of stochastic gradient descent methods for

SVMs require Ω(1/

) iterations. As in previously devised SVM solvers, the num-

ber of iterations also scales linearly with 1/λ, where λ is the regularization param-

eter of SVM. For a linear kernel, the total run-time of our method is

O(d/(λ)),

where d is a bound on the number of non-zero features in each example. Since

the run-time does not depend directly on the size of the training set, the result-

ing algorithm is especially suited for learning from large datasets. Our approach

also extends to non-linear kernels while working solely on the primal objective

function, though in this case the runtime does depend linearly on the training set

size. Our algorithm is particularly well suited for large text classiﬁcation prob-

lems, where we demonstrate an order-of-magnitude speedup over previous SVM

learning methods.

Keywords SVM · Stochastic Gradient Descent

Shai Shalev-Shwartz

School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel

E-mail: shais@cs.huji.ac.il

Yoram Singer

Google

E-mail: singer@google.com

Nathan Srebro

Toyota Technological Institute at Chicago

E-mail: nati@uchicago.edu

Andrew Cotter

Toyota Technological Institute at Chicago

E-mail: cotter@tti-c.org

2 Shai Shalev-Shwartz et al.

Mathematics Subject Classiﬁcation (2000) First · Second · More

1 Introduction

Support Vector Machines (SVMs) are effective and popular classiﬁcation learning

tool [36,12]. The task of learning a support vector machine is typically cast as

a constrained quadratic programming problem. However, in its native form, it is

in fact an unconstrained empirical loss minimization with a penalty term for the

norm of the classiﬁer that is being learned. Formally, given a training set S =

{(x

, y

)}

i=1

, where x

∈

and y

∈ {+1, −1}, we would like to ﬁnd the

minimizer of the problem

min

kwk

(x,y)∈S

`(w; (x, y)) , (1)

where

`(w; (x, y)) = max{0, 1 − y hw, xi} , (2)

and hu, vi denotes the standard inner product between the vectors u and v. We

denote the objective function of Eq. (1) by f(w). We say that an optimization

method ﬁnds an -accurate solution

w if f(

w) ≤ min

f(w) + . The standard

SVM problem also includes an unregularized bias term. We omit the bias through-

out the coming sections and revisit the incorporation of a bias term in Sec. 6.

We describe and analyze in this paper a simple stochastic sub-gradient descent

algorithm, which we call Pegasos, for solving Eq. (1). At each iteration, a single

training example is chosen at random and used to estimate a sub-gradient of the

objective, and a step with pre-determined step-size is taken in the opposite direc-

tion. We show that with high probability over the choice of the random examples,

our algorithm ﬁnds an -accurate solution using only

O(1/(λ)) iterations, while

each iteration involves a single inner product between w and x. Put differently,

the overall runtime required to obtain an  accurate solution is

O(n/(λ)), where

n is the dimensionality of w and x. Moreover, this runtime can be reduced to

O(d/(λ)) where d is the number of non-zero features in each example x. Pega-

sos can also be used with non-linear kernels, as we describe in Sec. 4. We would

like to emphasize that a solution is found in probability solely due to the random-

ization steps employed by the algorithm and not due to the data set. The data set

is not assumed to be random, and the analysis holds for any data set S. Further-

more, the runtime does not depend on the number of training examples and thus

our algorithm is especially suited for large datasets.

Before indulging into the detailed description and analysis of Pegasos, we

would like to draw connections to and put our work in context of some of the

more recent work on SVM. For a more comprehensive and up-to-date overview

of relevant work see the references in the papers cited below as well as the web

site dedicated to kernel methods at http://www.kernel-machines.org . Due to the

centrality of the SVM optimization problem, quite a few methods were devised

and analyzed. The different approaches can be roughly divided into the following

categories.

Pegasos: Primal Estimated sub-GrAdient SOlver for SVM 3

Interior Point (IP) methods: IP methods (see for instance [7] and the references

therein) cast the SVM learning task as a quadratic optimization problem subject to

linear constraints. The constraints are replaced with a barrier function. The result

is a sequence of unconstrained problems which can be optimized very efﬁciently

using Newton or Quasi-Newton methods. The advantage of IP methods is that the

dependence on the accuracy  is double logarithmic, namely, log(log(1/)). Alas,

IP methods typically require run time which is cubic in the number of examples

m. Moreover, the memory requirements of IP methods are O(m

) which renders

a direct use of IP methods very difﬁcult when the training set consists of many

examples. It should be noted that there have been several attempts to reduce the

complexity based on additional assumptions (see e.g. [15]). However, the depen-

dence on m remains super linear. In addition, while the focus of the paper is the

optimization problem cast by SVM, one needs to bear in mind that the optimiza-

tion problem is a proxy method for obtaining good classiﬁcation error on unseen

examples. Achieving a very high accuracy in the optimization process is usually

unnecessary and does not translate to a signiﬁcant increase in the generalization

accuracy. The time spent by IP methods for ﬁnding a single accurate solution may,

for instance, be better utilized for trying different regularization values.

Decomposition methods: To overcome the quadratic memory requirement of IP

methods, decomposition methods such as SMO [29] and SVM-Light [20] tackle

the dual representation of the SVM optimization problem, and employ an active

set of constraints thus working on a subset of dual variables. In the extreme case,

called row-action methods [8], the active set consists of a single constraint. While

algorithms in this family are fairly simple to implement and entertain general

asymptotic convergence properties [8], the time complexity of most of the algo-

rithms in this family is typically super linear in the training set size m. Moreover,

since decomposition methods ﬁnd a feasible dual solution and their goal is to max-

imize the dual objective function, they often result in a rather slow convergence

rate to the optimum of the primal objective function. (See also the discussion

in [19].)

Primal optimization: Most existing approaches, including the methods discussed

above, focus on the dual of Eq. (1), especially when used in conjunction with

non-linear kernels. However, even when non-linear kernels are used, the Repre-

senter theorem [23] allows us to re-parametrize w as w =

and cast the

primal objective Eq. (1) as an unconstrained optimization problem with the vari-

ables α

, . . . , α

(see Sec. 4). Tackling the primal objective directly was studied,

for example, by Chapelle [10], who considered using smooth loss functions in-

stead of the hinge loss, in which case the optimization problem becomes a smooth

unconstrained optimization problem. Chapelle then suggested using various op-

timization approaches such as conjugate gradient descent and Newton’s method.

We take a similar approach here, however we cope with the non-differentiability

of the hinge-loss directly by using sub-gradients instead of gradients. Another im-

portant distinction is that Chapelle views the optimization problem as a function of

the variables α

. In contrast, though Pegasos maintains the same set of variables,

the optimization process is performed with respect to w, see Sec. 4 for details.

Stochastic gradient descent: The Pegasos algorithm is an application of a stochas-

tic sub-gradient method (see for example [25,34]). In the context of machine

learning problems, the efﬁciency of the stochastic gradient approach has been

4 Shai Shalev-Shwartz et al.

studied in [26,1, 3,27,6, 5]. In particular, it has been claimed and experimentally

observed that, “Stochastic algorithms yield the best generalization performance

despite being the worst optimization algorithms”. This claim has recently received

formal treatment in [4,32].

Two concrete algorithms that are closely related to the Pegasos algorithm

and are also variants of stochastic sub-gradient methods are the NORMA algo-

rithm [24] and a stochastic gradient algorithm due to Zhang [37]. The main dif-

ference between Pegasos and these variants is in the procedure for setting the step

size. We elaborate on this issue in Sec. 7. The convergence rate given in [24]

implies that the number of iterations required to achieve -accurate solution is

O(1/(λ )

). This bound is inferior to the corresponding bound of Pegasos. The

analysis in [37] for the case of regularized loss shows that the squared Euclidean

distance to the optimal solution converges to zero but the rate of convergence de-

pends on the step size parameter. As we show in Sec. 7, tuning this parameter is

crucial to the success of the method. In contrast, Pegasos is virtually parameter

free. Another related recent work is Nesterov’s general primal-dual subgradient

method for the minimization of non-smooth functions [28]. Intuitively, the ideas

presented in [28] can be combined with the stochastic regime of Pegasos. We leave

this direction and other potential extensions of Pegasos for future research.

Online methods: Online learning methods are very closely related to stochas-

tic gradient methods, as they operate on only a single example at each iteration.

Moreover, many online learning rules, including the Perceptron rule, can be seen

as implementing a stochastic gradient step. Many such methods, including the

Perceptron and the Passive Aggressive method [11] also have strong connections

to the “margin” or norm of the predictor, though they do not directly minimize the

SVM objective. Nevertheless, online learning algorithms were proposed as fast

alternatives to SVMs (e.g. [16]). Such algorithms can be used to obtain a predic-

tor with low generalization error using an online-to-batch conversion scheme [9].

However, the conversion schemes do not necessarily yield an -accurate solutions

to the original SVM problem and their performance is typically inferior to di-

rect batch optimizers. As noted above, Pegasos shares the simplicity and speed of

online learning algorithms, yet it is guaranteed to converge to the optimal SVM

solution.

Cutting Planes Approach: Recently, Joachims [21] proposed SVM-Perf, which

uses a cutting planes method to ﬁnd a solution with accuracy  in time O(md/(λ

)).

This bound was later improved by Smola et al [33] to O(md/(λ)). The complex-

ity guarantee for Pegasos avoids the dependence on the data set size m. In addi-

tion, while SVM-Perf yields very signiﬁcant improvements over decomposition

methods for large data sets, our experiments (see Sec. 7) indicate that Pegasos is

substantially faster than SVM-Perf.

2 The Pegasos Algorithm

As mentioned above, Pegasos performs stochastic gradient descent on the primal

objective Eq. (1) with a carefully chosen stepsize. We describe in this section the

core of the Pegasos procedure in detail and provide pseudo-code. We also present

a few variants of the basic algorithm and discuss few implementation issues.

Pegasos: Primal Estimated sub-GrAdient SOlver for SVM 5

INPUT: S, λ, T

INITIALIZE: Set w

= 0

FOR t = 1, 2, . . . , T

Choose i

∈ {1, . . . , |S|} uniformly at random.

Set η

λt

If y

, x

i < 1, then:

Set w

t+1

← (1 − η

λ)w

+ η

Else (if y

, x

i ≥ 1):

Set w

t+1

← (1 − η

λ)w

[ Optional: w

t+1

← min

√

t+1

]

OUTPUT: w

T +1

Fig. 1 The Pegasos Algorithm.

2.1 The Basic Pegasos Algorithms

On each iteration Pegasos operates as follow. Initially, we set w

to the zero vec-

tor. On iteration t of the algorithm, we ﬁrst choose a random training example

, y

) by picking an index i

∈ {1, . . . , m} uniformly at random. We then re-

place the objective in Eq. (1) with an approximation based on the training example

, y

), yielding:

f(w; i

) =

kwk

+ `(w; (x

, y

)) . (3)

We consider the sub-gradient of the above approximate objective, given by:

∇

= λ w

−

, x

i < 1] y

, (4)

where

[y hw, xi < 1] is the indicator function which takes a value of one if its

argument is true (w yields non-zero loss on the example (x, y)), and zero other-

wise. We then update w

t+1

← w

− η

∇

using a step size of η

= 1/(λt). Note

that this update can be written as:

t+1

← (1 −

+ η

, x

i < 1] y

. (5)

After a predetermined number T of iterations, we output the last iterate w

T +1

The pseudo-code of Pegasos is given in Fig. 1.

2.2 Incorporating a Projection Step

The above description of Pegasos is a verbatim application of the stochastic gradient-

descent method. A potential variation is the gradient-projection approach where

we limit the set of admissible solutions to the ball of radius 1/

√

λ. To enforce this

property, we project w

after each iteration onto this sphere by performing the

update:

t+1

← min

√

t+1

. (6)

HTML Viewer

Frequently Asked Questions (9)

Q1. What are the contributions in this paper?

The authors describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines ( SVM ). The authors prove that the number of iterations required to obtain a solution of accuracy is Õ ( 1/ ), where each iteration operates on a single training example. Their approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Their algorithm is particularly well suited for large text classification problems, where the authors demonstrate an order-of-magnitude speedup over previous SVM learning methods.

Q2. What is the criterion for a -accurate optimization of a?

(For instance, if full optimization of SVM yields a test classification error of 1%, then the authors chose such that a -accurate optimization would guarantee test classification error of at most 1.1%.)

Q3. What is the time complexity of the algorithms in this family?

While algorithms in this family are fairly simple to implement and entertain general asymptotic convergence properties [8], the time complexity of most of the algorithms in this family is typically super linear in the training set size m.

Q4. What is the value of fj at w0?

2. If f(w) = maxi fi(w) for r differentiable functions f1, . . . , fr, and j = arg maxi fi(w0), then the gradient of fj at w0 is a sub-gradient of f at w0.

Q5. What is the way to find a solution to the SVM objective?

Cutting Planes Approach: Recently, Joachims [21] proposed SVM-Perf, which uses a cutting planes method to find a solution with accuracy in timeO(md/(λ 2)).

Q6. What is the performance of the kernelized Pegasos variant?

as the authors show in the sequel, the kernelized Pegasos variant described in section 4 gives good performance on a range of kernel SVM problems, provided that these problems have sufficient regularization.

Q7. What is the optimum form of the SVM learning problem?

In its more traditional form, the SVM learning problem was described as the following constrained optimization problem,1 2 ‖w‖2 + C m∑ i=1 ξi s.t. ∀i ∈ [m] : ξi ≥ 0, ξi ≥ 1− yi 〈w,xi〉 .

Q8. How can the authors implement the Pegasos algorithm?

The authors now show that the Pegasos algorithm can be implemented using only kernel evaluations, without direct access to the feature vectors φ(x) or explicit access to the weight vector w.

Q9. What is the primary suboptimality threshold for the Pegasos variant?

As in the linear experiments, the authors chose a primal suboptimality threshold for each dataset which guarantees a testing classification error within 10% of that at the optimum.

Pegasos: primal estimated sub-gradient solver for SVM

Summary (2 min read)

1 Introduction

2 The Pegasos Algorithm

3 Analysis

4 Using Mercer kernels

5 Other prediction problems and loss functions

6 Incorporating a bias term

7 Experiments

8 Conclusions

Figures (12)

Citations

Cites background from "Pegasos: primal estimated sub-gradi..."

Cites methods from "Pegasos: primal estimated sub-gradi..."

Cites methods from "Pegasos: primal estimated sub-gradi..."

Cites methods from "Pegasos: primal estimated sub-gradi..."

References

"Pegasos: primal estimated sub-gradi..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (9)

Q1. What are the contributions in this paper?

Q2. What is the criterion for a -accurate optimization of a?

Q3. What is the time complexity of the algorithms in this family?

Q4. What is the value of fj at w0?

Q5. What is the way to find a solution to the SVM objective?

Q6. What is the performance of the kernelized Pegasos variant?

Q7. What is the optimum form of the SVM learning problem?

Q8. How can the authors implement the Pegasos algorithm?

Q9. What is the primary suboptimality threshold for the Pegasos variant?