scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Pegasos: primal estimated sub-gradient solver for SVM

TL;DR: A simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines, which is particularly well suited for large text classification problems, and demonstrates an order-of-magnitude speedup over previous SVM learning methods.
Abstract: We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy $${\epsilon}$$ is $${\tilde{O}(1 / \epsilon)}$$, where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require $${\Omega(1 / \epsilon^2)}$$ iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is $${\tilde{O}(d/(\lambda \epsilon))}$$, where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an order-of-magnitude speedup over previous SVM learning methods.

Summary (2 min read)

1 Introduction

  • Support Vector Machines (SVMs) are effective and popular classification learning tool [36,12].
  • Alas, IP methods typically require run time which is cubic in the number of examples m.
  • Most existing approaches, including the methods discussed above, focus on the dual of Eq. (1), especially when used in conjunction with non-linear kernels.
  • In contrast, Pegasos is virtually parameter free.
  • Nevertheless, online learning algorithms were proposed as fast alternatives to SVMs (e.g. [16]).

2 The Pegasos Algorithm

  • As mentioned above, Pegasos performs stochastic gradient descent on the primal objective Eq. (1) with a carefully chosen stepsize.
  • The authors describe in this section the core of the Pegasos procedure in detail and provide pseudo-code.
  • The authors also present a few variants of the basic algorithm and discuss few implementation issues.
  • To underscore the difference between the fully deterministic case and the stochastic case, the authors refer to the subsamples in the latter case as mini-batches and call the process mini-batch iterates.
  • After w is updated, the stored norm ν needs to be updated, which can again be done in time O(d) as before.

3 Analysis

  • In this section the authors analyze the convergence properties of Pegasos.
  • The authors start by bounding the average instantaneous objective of the algorithm relatively to the average instantaneous objective of the optimal solution.
  • Thus, to prove the theorem it suffices to show that the conditions stated in Lemma 1 hold.
  • The above lemma tells us that on average after two attempts the authors are likely to find a good solution.

4 Using Mercer kernels

  • One of the main benefits of SVMs is that they can be used with kernels rather then with direct access to the feature vectors x.
  • It is now easy to implement the Pegasos algorithm by maintaining the vector α.
  • Since the iterates wt remain as before (just their representation changes), the guarantees on the accuracy after a number of iterations are still valid.
  • Concretely, the Representer theorem guarantees that the optimal solution of Eq. (17) is spanned by the training instances, i.e. it is of the form, w = ∑m i=1 α[i]φ(xi).
  • Interestingly, Chapelle also proposes preconditioning the gradients w.r.t. α by the kernel matrix, which effectively amounts to taking gradients w.r.t. w, as the authors do here.

5 Other prediction problems and loss functions

  • So far, the authors focused on the SVM formulation for binary classification using the hinge-loss.
  • The generality of these assumptions implies that the authors can apply Pegasos with any loss function which satisfies these requirements.
  • The log-loss version of the multiclass loss is convex as well with a bounded sub-gradient whose value is at most, 2 maxy′ ‖φ(x, y′)‖.
  • Therefore, |Y| is exponential in the length of the sequence.
  • Based on the above two properties, the authors now show explicitly how to calculate a sub-gradient for several loss functions.

6 Incorporating a bias term

  • In many applications, the weight vector w is augmented with a bias term which is a scalar, typically denoted as b.
  • Once the constant feature is added the rest of the algorithm remains intact, thus the bias term is not explicitly introduced.
  • A third method entertains the advantages of the two methods above at the price of a more complex algorithm that is applicable only for large batch sizes (large values of k), but not for the basic Pegasos algorithm (with k = 1).
  • The problem however is how to find a sub-gradient of g(w;At), as g(w;At) is defined through a minimization problem over b.
  • At is large enough, it might be possible to use more involved measure concentration tools to show that the expectation of f(w;At) is close enough to f(w;S) so as to still obtain fast convergence properties.

7 Experiments

  • In this section the authors present experimental results that demonstrate the merits of their algorithm.
  • The authors chose this threshold for each dataset such that a primal suboptimality less than guarantees a classification error on test data which is at most 1.1 times the test data classification error at the optimum.
  • It is interesting to note that the performance of Pegasos does not depend on the number of examples but rather on the value of λ.
  • All of the implementations use the same sparse representation for vectors, so the amount of time which it takes to perform a single kernel evaluation should, for each dataset, be roughly the same across all four algorithms.
  • The authors can see that, for three values of k, all significantly greater than 100, the experiments with the largest mini-batch size made the least progress while performing the same amount of computation.

8 Conclusions

  • The authors described and analyzed a simple and effective algorithm for approximately minimizing the objective function of SVM.
  • The authors derived fast rate of convergence results and experimented with the algorithm.
  • The authors empirical results indicate that for linear kernels, Pegasos achieves state-of-the-art results, despite of, or possibly due to, its simplicity.
  • The authors would like to thank Léon Bottou for useful discussions and suggestions, and Thorsten Joachims and Léon Bottou for help with the experiments.
  • Part of this work was done while SS and NS were visiting IBM research labs, Haifa, Israel.

Did you find this useful? Give us your feedback

Figures (12)

Content maybe subject to copyright    Report

Mathematical Programming manuscript No.
(will be inserted by the editor)
Shai Shalev-Shwartz · Yoram Singer ·
Nathan Srebro · Andrew Cotter
Pegasos: Primal Estimated sub-GrAdient
SOlver for SVM
Received: date / Accepted: date
Abstract We describe and analyze a simple and effective stochastic sub-gradient
descent algorithm for solving the optimization problem cast by Support Vector
Machines (SVM). We prove that the number of iterations required to obtain a so-
lution of accuracy is
˜
O(1/), where each iteration operates on a single training
example. In contrast, previous analyses of stochastic gradient descent methods for
SVMs require (1/
2
) iterations. As in previously devised SVM solvers, the num-
ber of iterations also scales linearly with 1, where λ is the regularization param-
eter of SVM. For a linear kernel, the total run-time of our method is
˜
O(d/(λ)),
where d is a bound on the number of non-zero features in each example. Since
the run-time does not depend directly on the size of the training set, the result-
ing algorithm is especially suited for learning from large datasets. Our approach
also extends to non-linear kernels while working solely on the primal objective
function, though in this case the runtime does depend linearly on the training set
size. Our algorithm is particularly well suited for large text classification prob-
lems, where we demonstrate an order-of-magnitude speedup over previous SVM
learning methods.
Keywords SVM · Stochastic Gradient Descent
Shai Shalev-Shwartz
School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel
E-mail: shais@cs.huji.ac.il
Yoram Singer
Google
E-mail: singer@google.com
Nathan Srebro
Toyota Technological Institute at Chicago
E-mail: nati@uchicago.edu
Andrew Cotter
Toyota Technological Institute at Chicago
E-mail: cotter@tti-c.org

2 Shai Shalev-Shwartz et al.
Mathematics Subject Classification (2000) First · Second · More
1 Introduction
Support Vector Machines (SVMs) are effective and popular classification learning
tool [36,12]. The task of learning a support vector machine is typically cast as
a constrained quadratic programming problem. However, in its native form, it is
in fact an unconstrained empirical loss minimization with a penalty term for the
norm of the classifier that is being learned. Formally, given a training set S =
{(x
i
, y
i
)}
m
i=1
, where x
i
R
n
and y
i
{+1, 1}, we would like to find the
minimizer of the problem
min
w
λ
2
kwk
2
+
1
m
X
(x,y)S
`(w; (x, y)) , (1)
where
`(w; (x, y)) = max{0, 1 y hw, xi} , (2)
and hu, vi denotes the standard inner product between the vectors u and v. We
denote the objective function of Eq. (1) by f(w). We say that an optimization
method finds an -accurate solution
ˆ
w if f(
ˆ
w) min
w
f(w) + . The standard
SVM problem also includes an unregularized bias term. We omit the bias through-
out the coming sections and revisit the incorporation of a bias term in Sec. 6.
We describe and analyze in this paper a simple stochastic sub-gradient descent
algorithm, which we call Pegasos, for solving Eq. (1). At each iteration, a single
training example is chosen at random and used to estimate a sub-gradient of the
objective, and a step with pre-determined step-size is taken in the opposite direc-
tion. We show that with high probability over the choice of the random examples,
our algorithm finds an -accurate solution using only
˜
O(1/(λ)) iterations, while
each iteration involves a single inner product between w and x. Put differently,
the overall runtime required to obtain an accurate solution is
˜
O(n/(λ)), where
n is the dimensionality of w and x. Moreover, this runtime can be reduced to
˜
O(d/(λ)) where d is the number of non-zero features in each example x. Pega-
sos can also be used with non-linear kernels, as we describe in Sec. 4. We would
like to emphasize that a solution is found in probability solely due to the random-
ization steps employed by the algorithm and not due to the data set. The data set
is not assumed to be random, and the analysis holds for any data set S. Further-
more, the runtime does not depend on the number of training examples and thus
our algorithm is especially suited for large datasets.
Before indulging into the detailed description and analysis of Pegasos, we
would like to draw connections to and put our work in context of some of the
more recent work on SVM. For a more comprehensive and up-to-date overview
of relevant work see the references in the papers cited below as well as the web
site dedicated to kernel methods at http://www.kernel-machines.org . Due to the
centrality of the SVM optimization problem, quite a few methods were devised
and analyzed. The different approaches can be roughly divided into the following
categories.

Pegasos: Primal Estimated sub-GrAdient SOlver for SVM 3
Interior Point (IP) methods: IP methods (see for instance [7] and the references
therein) cast the SVM learning task as a quadratic optimization problem subject to
linear constraints. The constraints are replaced with a barrier function. The result
is a sequence of unconstrained problems which can be optimized very efficiently
using Newton or Quasi-Newton methods. The advantage of IP methods is that the
dependence on the accuracy is double logarithmic, namely, log(log(1/)). Alas,
IP methods typically require run time which is cubic in the number of examples
m. Moreover, the memory requirements of IP methods are O(m
2
) which renders
a direct use of IP methods very difficult when the training set consists of many
examples. It should be noted that there have been several attempts to reduce the
complexity based on additional assumptions (see e.g. [15]). However, the depen-
dence on m remains super linear. In addition, while the focus of the paper is the
optimization problem cast by SVM, one needs to bear in mind that the optimiza-
tion problem is a proxy method for obtaining good classification error on unseen
examples. Achieving a very high accuracy in the optimization process is usually
unnecessary and does not translate to a significant increase in the generalization
accuracy. The time spent by IP methods for finding a single accurate solution may,
for instance, be better utilized for trying different regularization values.
Decomposition methods: To overcome the quadratic memory requirement of IP
methods, decomposition methods such as SMO [29] and SVM-Light [20] tackle
the dual representation of the SVM optimization problem, and employ an active
set of constraints thus working on a subset of dual variables. In the extreme case,
called row-action methods [8], the active set consists of a single constraint. While
algorithms in this family are fairly simple to implement and entertain general
asymptotic convergence properties [8], the time complexity of most of the algo-
rithms in this family is typically super linear in the training set size m. Moreover,
since decomposition methods find a feasible dual solution and their goal is to max-
imize the dual objective function, they often result in a rather slow convergence
rate to the optimum of the primal objective function. (See also the discussion
in [19].)
Primal optimization: Most existing approaches, including the methods discussed
above, focus on the dual of Eq. (1), especially when used in conjunction with
non-linear kernels. However, even when non-linear kernels are used, the Repre-
senter theorem [23] allows us to re-parametrize w as w =
P
α
i
y
i
x
i
and cast the
primal objective Eq. (1) as an unconstrained optimization problem with the vari-
ables α
1
, . . . , α
m
(see Sec. 4). Tackling the primal objective directly was studied,
for example, by Chapelle [10], who considered using smooth loss functions in-
stead of the hinge loss, in which case the optimization problem becomes a smooth
unconstrained optimization problem. Chapelle then suggested using various op-
timization approaches such as conjugate gradient descent and Newton’s method.
We take a similar approach here, however we cope with the non-differentiability
of the hinge-loss directly by using sub-gradients instead of gradients. Another im-
portant distinction is that Chapelle views the optimization problem as a function of
the variables α
i
. In contrast, though Pegasos maintains the same set of variables,
the optimization process is performed with respect to w, see Sec. 4 for details.
Stochastic gradient descent: The Pegasos algorithm is an application of a stochas-
tic sub-gradient method (see for example [25,34]). In the context of machine
learning problems, the efficiency of the stochastic gradient approach has been

4 Shai Shalev-Shwartz et al.
studied in [26,1, 3,27,6, 5]. In particular, it has been claimed and experimentally
observed that, “Stochastic algorithms yield the best generalization performance
despite being the worst optimization algorithms”. This claim has recently received
formal treatment in [4,32].
Two concrete algorithms that are closely related to the Pegasos algorithm
and are also variants of stochastic sub-gradient methods are the NORMA algo-
rithm [24] and a stochastic gradient algorithm due to Zhang [37]. The main dif-
ference between Pegasos and these variants is in the procedure for setting the step
size. We elaborate on this issue in Sec. 7. The convergence rate given in [24]
implies that the number of iterations required to achieve -accurate solution is
O(1/(λ )
2
). This bound is inferior to the corresponding bound of Pegasos. The
analysis in [37] for the case of regularized loss shows that the squared Euclidean
distance to the optimal solution converges to zero but the rate of convergence de-
pends on the step size parameter. As we show in Sec. 7, tuning this parameter is
crucial to the success of the method. In contrast, Pegasos is virtually parameter
free. Another related recent work is Nesterov’s general primal-dual subgradient
method for the minimization of non-smooth functions [28]. Intuitively, the ideas
presented in [28] can be combined with the stochastic regime of Pegasos. We leave
this direction and other potential extensions of Pegasos for future research.
Online methods: Online learning methods are very closely related to stochas-
tic gradient methods, as they operate on only a single example at each iteration.
Moreover, many online learning rules, including the Perceptron rule, can be seen
as implementing a stochastic gradient step. Many such methods, including the
Perceptron and the Passive Aggressive method [11] also have strong connections
to the “margin” or norm of the predictor, though they do not directly minimize the
SVM objective. Nevertheless, online learning algorithms were proposed as fast
alternatives to SVMs (e.g. [16]). Such algorithms can be used to obtain a predic-
tor with low generalization error using an online-to-batch conversion scheme [9].
However, the conversion schemes do not necessarily yield an -accurate solutions
to the original SVM problem and their performance is typically inferior to di-
rect batch optimizers. As noted above, Pegasos shares the simplicity and speed of
online learning algorithms, yet it is guaranteed to converge to the optimal SVM
solution.
Cutting Planes Approach: Recently, Joachims [21] proposed SVM-Perf, which
uses a cutting planes method to find a solution with accuracy in time O(md/(λ
2
)).
This bound was later improved by Smola et al [33] to O(md/(λ)). The complex-
ity guarantee for Pegasos avoids the dependence on the data set size m. In addi-
tion, while SVM-Perf yields very significant improvements over decomposition
methods for large data sets, our experiments (see Sec. 7) indicate that Pegasos is
substantially faster than SVM-Perf.
2 The Pegasos Algorithm
As mentioned above, Pegasos performs stochastic gradient descent on the primal
objective Eq. (1) with a carefully chosen stepsize. We describe in this section the
core of the Pegasos procedure in detail and provide pseudo-code. We also present
a few variants of the basic algorithm and discuss few implementation issues.

Pegasos: Primal Estimated sub-GrAdient SOlver for SVM 5
INPUT: S, λ, T
INITIALIZE: Set w
1
= 0
FOR t = 1, 2, . . . , T
Choose i
t
{1, . . . , |S|} uniformly at random.
Set η
t
=
1
λt
If y
i
t
hw
t
, x
i
t
i < 1, then:
Set w
t+1
(1 η
t
λ)w
t
+ η
t
y
i
t
x
i
t
Else (if y
i
t
hw
t
, x
i
t
i 1):
Set w
t+1
(1 η
t
λ)w
t
[ Optional: w
t+1
min
n
1,
1/
λ
kw
t+1
k
o
w
t+1
]
OUTPUT: w
T +1
Fig. 1 The Pegasos Algorithm.
2.1 The Basic Pegasos Algorithms
On each iteration Pegasos operates as follow. Initially, we set w
1
to the zero vec-
tor. On iteration t of the algorithm, we first choose a random training example
(x
i
t
, y
i
t
) by picking an index i
t
{1, . . . , m} uniformly at random. We then re-
place the objective in Eq. (1) with an approximation based on the training example
(x
i
t
, y
i
t
), yielding:
f(w; i
t
) =
λ
2
kwk
2
+ `(w; (x
i
t
, y
i
t
)) . (3)
We consider the sub-gradient of the above approximate objective, given by:
t
= λ w
t
1
[y
i
t
hw
t
, x
i
t
i < 1] y
i
t
x
i
t
, (4)
where
1
[y hw, xi < 1] is the indicator function which takes a value of one if its
argument is true (w yields non-zero loss on the example (x, y)), and zero other-
wise. We then update w
t+1
w
t
η
t
t
using a step size of η
t
= 1/(λt). Note
that this update can be written as:
w
t+1
(1
1
t
)w
t
+ η
t
1
[y
i
t
hw
t
, x
i
t
i < 1] y
i
t
x
i
t
. (5)
After a predetermined number T of iterations, we output the last iterate w
T +1
.
The pseudo-code of Pegasos is given in Fig. 1.
2.2 Incorporating a Projection Step
The above description of Pegasos is a verbatim application of the stochastic gradient-
descent method. A potential variation is the gradient-projection approach where
we limit the set of admissible solutions to the ball of radius 1/
λ. To enforce this
property, we project w
t
after each iteration onto this sphere by performing the
update:
w
t+1
min
n
1,
1/
λ
kw
t+1
k
o
w
t+1
. (6)

Citations
More filters
Journal ArticleDOI
TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Abstract: We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.

10,501 citations


Cites background from "Pegasos: primal estimated sub-gradi..."

  • ...For linear SVMs, a learning rate t 1⁄4 1=t has been shown to work well [37]....

    [...]

Book
24 Aug 2012
TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.
Abstract: Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package--PMTK (probabilistic modeling toolkit)--that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

8,059 citations


Cites methods from "Pegasos: primal estimated sub-gradi..."

  • ...A specific instance of this the Pegasos algorithm (Shalev-Shwartz et al. 2007), which stands for “primal estimated sub-gradient solver for SVM”....

    [...]

Journal ArticleDOI
TL;DR: dlib-ml contains an extensible linear algebra toolkit with built in BLAS support, and implementations of algorithms for performing inference in Bayesian networks and kernel-based methods for classification, regression, clustering, anomaly detection, and feature ranking.
Abstract: There are many excellent toolkits which provide support for developing machine learning software in Python, R, Matlab, and similar environments. Dlib-ml is an open source library, targeted at both engineers and research scientists, which aims to provide a similarly rich environment for developing machine learning software in the C++ language. Towards this end, dlib-ml contains an extensible linear algebra toolkit with built in BLAS support. It also houses implementations of algorithms for performing inference in Bayesian networks and kernel-based methods for classification, regression, clustering, anomaly detection, and feature ranking. To enable easy use of these tools, the entire library has been developed with contract programming, which provides complete and precise documentation as well as powerful debugging tools.

2,701 citations


Cites methods from "Pegasos: primal estimated sub-gradi..."

  • ...The other SVM solver is a kernelized version of the Pegasos algorithm introduced by Shalev-Shwartz et al. (2007)....

    [...]

Proceedings Article
05 Dec 2013
TL;DR: It is proved that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG), but the analysis is significantly simpler and more intuitive.
Abstract: Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning.

2,539 citations

Book
01 Jan 2018

2,291 citations


Cites methods from "Pegasos: primal estimated sub-gradi..."

  • ...This update is identical to that used by the primal support vector machine (SVM) algorithm [448], except that the updates are performed only for the misclassified points in the perceptron, whereas the SVM also uses the marginally correct points near the decision boundary for updates....

    [...]

References
More filters
Book
01 Mar 2004
TL;DR: In this article, the focus is on recognizing convex optimization problems and then finding the most appropriate technique for solving them, and a comprehensive introduction to the subject is given. But the focus of this book is not on the optimization problem itself, but on the problem of finding the appropriate technique to solve it.
Abstract: Convex optimization problems arise frequently in many different fields. A comprehensive introduction to the subject, this book shows in detail how such problems can be solved numerically with great efficiency. The focus is on recognizing convex optimization problems and then finding the most appropriate technique for solving them. The text contains many worked examples and homework exercises and will appeal to students, researchers and practitioners in fields such as engineering, computer science, mathematics, statistics, finance, and economics.

33,341 citations

01 Jan 1998
TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Abstract: A comprehensive look at learning and generalization theory. The statistical theory of learning and generalization concerns the problem of choosing desired functions on the basis of empirical data. Highly applicable to a variety of computer science and robotics fields, this book offers lucid coverage of the theory as a whole. Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

26,531 citations


"Pegasos: primal estimated sub-gradi..." refers background in this paper

  • ...Support Vector Machines (SVMs) are effective and popular classification learning tool [ 36 ,12]....

    [...]

Book
01 Jan 1973
TL;DR: In this article, a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition is provided, including Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.
Abstract: Provides a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition. The topics treated include Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.

13,647 citations

Proceedings ArticleDOI
08 Feb 1999
TL;DR: Support vector machines for dynamic reconstruction of a chaotic system, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel.
Abstract: Introduction to support vector learning roadmap. Part 1 Theory: three remarks on the support vector method of function estimation, Vladimir Vapnik generalization performance of support vector machines and other pattern classifiers, Peter Bartlett and John Shawe-Taylor Bayesian voting schemes and large margin classifiers, Nello Cristianini and John Shawe-Taylor support vector machines, reproducing kernel Hilbert spaces, and randomized GACV, Grace Wahba geometry and invariance in kernel based methods, Christopher J.C. Burges on the annealed VC entropy for margin classifiers - a statistical mechanics study, Manfred Opper entropy numbers, operators and support vector kernels, Robert C. Williamson et al. Part 2 Implementations: solving the quadratic programming problem arising in support vector classification, Linda Kaufman making large-scale support vector machine learning practical, Thorsten Joachims fast training of support vector machines using sequential minimal optimization, John C. Platt. Part 3 Applications: support vector machines for dynamic reconstruction of a chaotic system, Davide Mattera and Simon Haykin using support vector machines for time series prediction, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel. Part 4 Extensions of the algorithm: reducing the run-time complexity in support vector machines, Edgar E. Osuna and Federico Girosi support vector regression with ANOVA decomposition kernels, Mark O. Stitson et al support vector density estimation, Jason Weston et al combining support vector and mathematical programming methods for classification, Bernhard Scholkopf et al.

5,506 citations

Frequently Asked Questions (9)
Q1. What are the contributions in this paper?

The authors describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines ( SVM ). The authors prove that the number of iterations required to obtain a solution of accuracy is Õ ( 1/ ), where each iteration operates on a single training example. Their approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Their algorithm is particularly well suited for large text classification problems, where the authors demonstrate an order-of-magnitude speedup over previous SVM learning methods. 

(For instance, if full optimization of SVM yields a test classification error of 1%, then the authors chose such that a -accurate optimization would guarantee test classification error of at most 1.1%.) 

While algorithms in this family are fairly simple to implement and entertain general asymptotic convergence properties [8], the time complexity of most of the algorithms in this family is typically super linear in the training set size m. 

2. If f(w) = maxi fi(w) for r differentiable functions f1, . . . , fr, and j = arg maxi fi(w0), then the gradient of fj at w0 is a sub-gradient of f at w0. 

Cutting Planes Approach: Recently, Joachims [21] proposed SVM-Perf, which uses a cutting planes method to find a solution with accuracy in timeO(md/(λ 2)). 

as the authors show in the sequel, the kernelized Pegasos variant described in section 4 gives good performance on a range of kernel SVM problems, provided that these problems have sufficient regularization. 

In its more traditional form, the SVM learning problem was described as the following constrained optimization problem,1 2 ‖w‖2 + C m∑ i=1 ξi s.t. ∀i ∈ [m] : ξi ≥ 0, ξi ≥ 1− yi 〈w,xi〉 . 

The authors now show that the Pegasos algorithm can be implemented using only kernel evaluations, without direct access to the feature vectors φ(x) or explicit access to the weight vector w. 

As in the linear experiments, the authors chose a primal suboptimality threshold for each dataset which guarantees a testing classification error within 10% of that at the optimum.