Pegasos: primal estimated sub-gradient solver for SVM
Summary (2 min read)
1 Introduction
- Support Vector Machines (SVMs) are effective and popular classification learning tool [36,12].
- Alas, IP methods typically require run time which is cubic in the number of examples m.
- Most existing approaches, including the methods discussed above, focus on the dual of Eq. (1), especially when used in conjunction with non-linear kernels.
- In contrast, Pegasos is virtually parameter free.
- Nevertheless, online learning algorithms were proposed as fast alternatives to SVMs (e.g. [16]).
2 The Pegasos Algorithm
- As mentioned above, Pegasos performs stochastic gradient descent on the primal objective Eq. (1) with a carefully chosen stepsize.
- The authors describe in this section the core of the Pegasos procedure in detail and provide pseudo-code.
- The authors also present a few variants of the basic algorithm and discuss few implementation issues.
- To underscore the difference between the fully deterministic case and the stochastic case, the authors refer to the subsamples in the latter case as mini-batches and call the process mini-batch iterates.
- After w is updated, the stored norm ν needs to be updated, which can again be done in time O(d) as before.
3 Analysis
- In this section the authors analyze the convergence properties of Pegasos.
- The authors start by bounding the average instantaneous objective of the algorithm relatively to the average instantaneous objective of the optimal solution.
- Thus, to prove the theorem it suffices to show that the conditions stated in Lemma 1 hold.
- The above lemma tells us that on average after two attempts the authors are likely to find a good solution.
4 Using Mercer kernels
- One of the main benefits of SVMs is that they can be used with kernels rather then with direct access to the feature vectors x.
- It is now easy to implement the Pegasos algorithm by maintaining the vector α.
- Since the iterates wt remain as before (just their representation changes), the guarantees on the accuracy after a number of iterations are still valid.
- Concretely, the Representer theorem guarantees that the optimal solution of Eq. (17) is spanned by the training instances, i.e. it is of the form, w = ∑m i=1 α[i]φ(xi).
- Interestingly, Chapelle also proposes preconditioning the gradients w.r.t. α by the kernel matrix, which effectively amounts to taking gradients w.r.t. w, as the authors do here.
5 Other prediction problems and loss functions
- So far, the authors focused on the SVM formulation for binary classification using the hinge-loss.
- The generality of these assumptions implies that the authors can apply Pegasos with any loss function which satisfies these requirements.
- The log-loss version of the multiclass loss is convex as well with a bounded sub-gradient whose value is at most, 2 maxy′ ‖φ(x, y′)‖.
- Therefore, |Y| is exponential in the length of the sequence.
- Based on the above two properties, the authors now show explicitly how to calculate a sub-gradient for several loss functions.
6 Incorporating a bias term
- In many applications, the weight vector w is augmented with a bias term which is a scalar, typically denoted as b.
- Once the constant feature is added the rest of the algorithm remains intact, thus the bias term is not explicitly introduced.
- A third method entertains the advantages of the two methods above at the price of a more complex algorithm that is applicable only for large batch sizes (large values of k), but not for the basic Pegasos algorithm (with k = 1).
- The problem however is how to find a sub-gradient of g(w;At), as g(w;At) is defined through a minimization problem over b.
- At is large enough, it might be possible to use more involved measure concentration tools to show that the expectation of f(w;At) is close enough to f(w;S) so as to still obtain fast convergence properties.
7 Experiments
- In this section the authors present experimental results that demonstrate the merits of their algorithm.
- The authors chose this threshold for each dataset such that a primal suboptimality less than guarantees a classification error on test data which is at most 1.1 times the test data classification error at the optimum.
- It is interesting to note that the performance of Pegasos does not depend on the number of examples but rather on the value of λ.
- All of the implementations use the same sparse representation for vectors, so the amount of time which it takes to perform a single kernel evaluation should, for each dataset, be roughly the same across all four algorithms.
- The authors can see that, for three values of k, all significantly greater than 100, the experiments with the largest mini-batch size made the least progress while performing the same amount of computation.
8 Conclusions
- The authors described and analyzed a simple and effective algorithm for approximately minimizing the objective function of SVM.
- The authors derived fast rate of convergence results and experimented with the algorithm.
- The authors empirical results indicate that for linear kernels, Pegasos achieves state-of-the-art results, despite of, or possibly due to, its simplicity.
- The authors would like to thank Léon Bottou for useful discussions and suggestions, and Thorsten Joachims and Léon Bottou for help with the experiments.
- Part of this work was done while SS and NS were visiting IBM research labs, Haifa, Israel.
Did you find this useful? Give us your feedback
Citations
10,501 citations
Cites background from "Pegasos: primal estimated sub-gradi..."
...For linear SVMs, a learning rate t 1⁄4 1=t has been shown to work well [37]....
[...]
8,059 citations
Cites methods from "Pegasos: primal estimated sub-gradi..."
...A specific instance of this the Pegasos algorithm (Shalev-Shwartz et al. 2007), which stands for “primal estimated sub-gradient solver for SVM”....
[...]
2,701 citations
Cites methods from "Pegasos: primal estimated sub-gradi..."
...The other SVM solver is a kernelized version of the Pegasos algorithm introduced by Shalev-Shwartz et al. (2007)....
[...]
2,539 citations
2,291 citations
Cites methods from "Pegasos: primal estimated sub-gradi..."
...This update is identical to that used by the primal support vector machine (SVM) algorithm [448], except that the updates are performed only for the misclassified points in the perceptron, whereas the SVM also uses the marginally correct points near the decision boundary for updates....
[...]
References
[...]
33,341 citations
26,531 citations
"Pegasos: primal estimated sub-gradi..." refers background in this paper
...Support Vector Machines (SVMs) are effective and popular classification learning tool [ 36 ,12]....
[...]
14,948 citations
13,647 citations
5,506 citations
Related Papers (5)
Frequently Asked Questions (9)
Q2. What is the criterion for a -accurate optimization of a?
(For instance, if full optimization of SVM yields a test classification error of 1%, then the authors chose such that a -accurate optimization would guarantee test classification error of at most 1.1%.)
Q3. What is the time complexity of the algorithms in this family?
While algorithms in this family are fairly simple to implement and entertain general asymptotic convergence properties [8], the time complexity of most of the algorithms in this family is typically super linear in the training set size m.
Q4. What is the value of fj at w0?
2. If f(w) = maxi fi(w) for r differentiable functions f1, . . . , fr, and j = arg maxi fi(w0), then the gradient of fj at w0 is a sub-gradient of f at w0.
Q5. What is the way to find a solution to the SVM objective?
Cutting Planes Approach: Recently, Joachims [21] proposed SVM-Perf, which uses a cutting planes method to find a solution with accuracy in timeO(md/(λ 2)).
Q6. What is the performance of the kernelized Pegasos variant?
as the authors show in the sequel, the kernelized Pegasos variant described in section 4 gives good performance on a range of kernel SVM problems, provided that these problems have sufficient regularization.
Q7. What is the optimum form of the SVM learning problem?
In its more traditional form, the SVM learning problem was described as the following constrained optimization problem,1 2 ‖w‖2 + C m∑ i=1 ξi s.t. ∀i ∈ [m] : ξi ≥ 0, ξi ≥ 1− yi 〈w,xi〉 .
Q8. How can the authors implement the Pegasos algorithm?
The authors now show that the Pegasos algorithm can be implemented using only kernel evaluations, without direct access to the feature vectors φ(x) or explicit access to the weight vector w.
Q9. What is the primary suboptimality threshold for the Pegasos variant?
As in the linear experiments, the authors chose a primal suboptimality threshold for each dataset which guarantees a testing classification error within 10% of that at the optimum.