scispace - formally typeset
Open AccessProceedings ArticleDOI

A dual coordinate descent method for large-scale linear SVM

TLDR
A novel dual coordinate descent method for linear SVM with L1-and L2-loss functions that reaches an ε-accurate solution in O(log(1/ε)) iterations is presented.
Abstract
In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such large-scale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1-and L2-loss functions. The proposed method is simple and reaches an e-accurate solution in O(log(1/e)) iterations. Experiments indicate that our method is much faster than state of the art solvers such as Pegasos, TRON, SVMperf, and a recent primal coordinate descent implementation.

read more

Content maybe subject to copyright    Report

A Dual Coordinate Descent Method for Large-scale Linear SVM
Cho-Jui Hsieh b92085@csie.ntu.edu.tw
Kai-Wei Chang b92084@csie.ntu.edu.tw
Chih-Jen Lin cjlin@csie.ntu.edu.tw
Department of Computer Science, National Taiwan University, Taipei 106, Taiwan
S. Sathiya Keerthi selvarak@yahoo-inc.com
Yahoo! Research, Santa Clara, USA
S. Sundararajan ssrajan@yahoo-inc.com
Yahoo! Labs, Bangalore, India
Abstract
In many applications, data appear with a
huge number of instances as well as features.
Linear Support Vector Machines (SVM) is
one of the most popular tools to deal with
such large-scale sparse data. This paper
presents a novel dual coordinate descent
method for linear SVM with L1- and L2-
loss functions. The proposed method is sim-
ple and reaches an -accurate solution in
O(log(1/)) iterations. Experiments indicate
that our method is much faster than state
of the art solvers such as Pegasos, TRON,
SVM
perf
, and a recent primal coordinate de-
scent implementation.
1. Introduction
Support vector machines (SVM) (Boser et al., 1992)
are useful for data classification. Given a set of
instance-label pairs (x
i
, y
i
), i = 1, . . . , l, x
i
R
n
, y
i
{−1, +1}, SVM requires the solution of the following
unconstrained optimization problem:
min
w
1
2
w
T
w + C
l
X
i=1
ξ(w; x
i
, y
i
), (1)
where ξ(w; x
i
, y
i
) is a loss function, and C > 0 is a
penalty parameter. Two common loss functions are:
max(1 y
i
w
T
x
i
, 0) and max(1 y
i
w
T
x
i
, 0)
2
. (2)
The former is called L1-SVM, while the latter is L2-
SVM. In some applications, an SVM problem appears
Appearing in Proceedings of the 25
th
International Confer-
ence on Machine Learning, Helsinki, Finland, 2008. Copy-
right 2008 by the author(s)/owner(s).
with a bias term b. One often deal with this term by
appending each instance with an additional dimension:
x
T
i
[x
T
i
, 1] w
T
[w
T
, b]. (3)
Problem (1) is often referred to as the primal form of
SVM. One may instead solve its dual problem:
min
α
f(α) =
1
2
α
T
¯
Qα e
T
α
subject to 0 α
i
U, i, (4)
where
¯
Q = Q + D, D is a diagonal matrix, and Q
ij
=
y
i
y
j
x
T
i
x
j
. For L1-SVM, U = C and D
ii
= 0, i. For
L2-SVM, U = and D
ii
= 1/(2C), i.
An SVM usually maps training vectors into a high-
dimensional space via a nonlinear function φ(x). Due
to the high dimensionality of the vector variable w,
one solves the dual problem (4) by the kernel trick
(i.e., using a closed form of φ(x
i
)
T
φ(x
j
)). We call
such a problem as a nonlinear SVM. In some applica-
tions, data appear in a rich dimensional feature space,
the performances are similar with/without nonlinear
mapping. If data are not mapped, we can often train
much larger data sets. We indicate such cases as linear
SVM; these are often encountered in applications such
as document classification. In this paper, we aim at
solving very large linear SVM problems.
Recently, many methods have been proposed for lin-
ear SVM in large-scale scenarios. For L1-SVM, Zhang
(2004), Shalev-Shwartz et al. (2007), Bottou (2007)
propose various stochastic gradient descent methods.
Collins et al. (2008) apply an exponentiated gradi-
ent method. SVM
perf
(Joachims, 2006) uses a cutting
plane technique. Smola et al. (2008) apply bundle
methods, and view SVM
perf
as a special case. For
L2-SVM, Keerthi and DeCoste (2005) propose mod-
ified Newton methods. A trust region Newton method
(TRON) (Lin et al., 2008) is proposed for logistic re-

A Dual Coordinate Descent Method for Large-scale Linear SVM
gression and L2-SVM. These algorithms focus on dif-
ferent aspects of the training speed. Some aim at
quickly obtaining a usable model, but some achieve
fast final convergence of solving the optimization prob-
lem in (1) or (4). Moreover, among these methods,
Joachims (2006), Smola et al. (2008) and Collins et al.
(2008) solve SVM via the dual (4). Others consider the
primal form (1). The decision of using primal or dual
is of course related to the algorithm design.
Very recently, Chang et al. (2008) propose using co-
ordinate descent methods for solving primal L2-SVM.
Experiments show that their approach more quickly
obtains a useful model than some of the above meth-
ods. Coordinate descent, a popular optimization tech-
nique, updates one variable at a time by minimizing a
single-variable sub-problem. If one can efficiently solve
this sub-problem, then it can be a competitive opti-
mization method. Due to the non-differentiability of
the primal L1-SVM, Chang et al’s work is restricted to
L2-SVM. Moreover, as primal L2-SVM is differentiable
but not twice differentiable, certain considerations are
needed in solving the single-variable sub-problem.
While the dual form (4) involves bound constraints
0α
i
U, its objective function is twice differentiable
for both L1- and L2-SVM. In this paper, we investi-
gate coordinate descent methods for the dual problem
(4). We prove that an -optimal solution is obtained
in O(log(1/)) iterations. We propose an implemen-
tation using a random order of sub-problems at each
iteration, which leads to very fast training. Experi-
ments indicate that our method is more efficient than
the primal coordinate descent method. As Chang et al.
(2008) solve the primal, they require the easy access
of a feature’s corresponding data values. However, in
practice one often has an easier access of values per in-
stance. Solving the dual takes this advantage, so our
implementation is simpler than Chang et al. (2008).
Early SVM papers (Mangasarian & Musicant, 1999;
Friess et al., 1998) have discussed coordinate descent
methods for the SVM dual form.
1
However, they
do not focus on large data using the linear kernel.
Crammer and Singer (2003) proposed an online setting
for multi-class SVM without considering large sparse
data. Recently, Bordes et al. (2007) applied a coor-
dinate descent method to multi-class SVM, but they
focus on nonlinear kernels. In this paper, we point
out that dual coordinate descent methods make crucial
advantage of the linear kernel and outperform other
solvers when the numbers of data and features are both
1
Note that coordinate descent methods for uncon-
strained quadratic programming can be traced back to Hil-
dreth (1957).
large.
Coordinate descent methods for (4) are related to the
popular decomposition methods for training nonlinear
SVM. In this paper, we show their key differences and
explain why earlier studies on decomposition meth-
ods failed to modify their algorithms in an efficient
way like ours for large-scale linear SVM. We also dis-
cuss the connection to other linear SVM works such as
(Crammer & Singer, 2003; Collins et al., 2008; Shalev-
Shwartz et al., 2007).
This paper is organized as follows. In Section 2, we
describe our proposed algorithm. Implementation is-
sues are investigated in Section 3. Section 4 discusses
the connection to other methods. In Section 5, we
compare our method with state of the art implemen-
tations for large linear SVM. Results show that the
new method is more efficient.
2. A Dual Coordinate Descent Method
In this section, we describe our coordinate descent
method for L1- and L2-SVM. The optimization pro-
cess starts from an initial point α
0
R
l
and generates
a sequence of vectors {α
k
}
k=0
. We refer to the process
from α
k
to α
k+1
as an outer iteration. In each outer
iteration we have l inner iterations, so that sequen-
tially α
1
, α
2
, . . . , α
l
are updated. Each outer iteration
thus generates vectors α
k,i
R
l
, i = 1, . . . , l + 1, such
that α
k,1
= α
k
, α
k,l+1
= α
k+1
, and
α
k,i
= [α
k+1
1
, . . . , α
k+1
i1
, α
k
i
, . . . , α
k
l
]
T
, i = 2, . . . , l.
For updating α
k,i
to α
k,i+1
, we solve the following
one-variable sub-problem:
min
d
f(α
k,i
+ de
i
) subject to 0 α
k
i
+ d U, (5)
where e
i
= [0, . . . , 0, 1, 0, . . . , 0]
T
. The objective func-
tion of (5) is a simple quadratic function of d:
f(α
k,i
+ de
i
) =
1
2
¯
Q
ii
d
2
+
i
f(α
k,i
)d + constant, (6)
where
i
f is the ith component of the gradient f .
One can easily see that (5) has an optimum at d = 0
(i.e., no need to update α
i
) if and only if
P
i
f(α
k,i
) = 0, (7)
where
P
f(α) means the projected gradient
P
i
f(α) =
i
f(α) if 0 < α
i
< U,
min(0,
i
f(α)) if α
i
= 0,
max(0,
i
f(α)) if α
i
= U.
(8)

A Dual Coordinate Descent Method for Large-scale Linear SVM
Algorithm 1 A dual coordinate descent method for
Linear SVM
Given α and the corresponding w =
P
i
y
i
α
i
x
i
.
While α is not optimal
For i = 1, . . . , l
(a) G = y
i
w
T
x
i
1 + D
ii
α
i
(b)
P G =
min(G, 0) if α
i
= 0,
max(G, 0) if α
i
= U,
G if 0 < α
i
< U
(c) If |P G| 6= 0,
¯α
i
α
i
α
i
min(max(α
i
G/
¯
Q
ii
, 0), U)
w w + (α
i
¯α
i
)y
i
x
i
If (7) holds, we move to the index i+1 without updat-
ing α
k,i
i
. Otherwise, we must find the optimal solution
of (5). If
¯
Q
ii
> 0, easily the solution is:
α
k,i+1
i
= min
max
α
k,i
i
i
f(α
k,i
)
¯
Q
ii
, 0
, U
. (9)
We thus need to calculate
¯
Q
ii
and
i
f(α
k,i
). First,
¯
Q
ii
= x
T
i
x
i
+ D
ii
can be precomputed and stored in
the memory. Second, to evaluate
i
f(α
k,i
), we use
i
f(α) = (
¯
Qα)
i
1 =
X
l
j=1
¯
Q
ij
α
j
1. (10)
¯
Q may be too large to be stored, so one calculates
¯
Q’s
ith row when doing (10). If ¯n is the average number
of nonzero elements per instance, and O(¯n) is needed
for each kernel evaluation, then calculating the ith row
of the kernel matrix takes O(l¯n). Such operations are
expensive. However, for a linear SVM, we can define
w =
X
l
j=1
y
j
α
j
x
j
, (11)
so (10) becomes
i
f(α) = y
i
w
T
x
i
1 + D
ii
α
i
. (12)
To evaluate (12), the main cost is O(¯n) for calculating
w
T
x
i
. This is much smaller than O(l¯n). To apply
(12), w must be maintained throughout the coordinate
descent procedure. Calculating w by (11) takes O(l¯n)
operations, which are too expensive. Fortunately, if
¯α
i
is the current value and α
i
is the value after the
updating, we can maintain w by
w w + (α
i
¯α
i
)y
i
x
i
. (13)
The number of operations is only O(¯n). To have the
first w, one can use α
0
= 0 so w = 0. In the end, we
obtain the optimal w of the primal problem (1) as the
primal-dual relationship implies (11).
If
¯
Q
ii
= 0, we have D
ii
= 0, Q
ii
= x
T
i
x
i
= 0, and
hence x
i
= 0. This occurs only in L1-SVM without
the bias term by (3). From (12), if x
i
= 0, then
i
f(α
k,i
) = 1. As U = C < for L1-SVM, the
solution of (5) makes the new α
k,i+1
i
= U. We can
easily include this case in (9) by setting 1/
¯
Q
ii
= .
Briefly, our algorithm uses (12) to compute
i
f(α
k,i
),
checks the optimality of the sub-problem (5) by (7),
updates α
i
by (9), and then maintains w by (13). A
description is in Algorithm 1. The cost per iteration
(i.e., from α
k
to α
k+1
) is O(l¯n). The main memory
requirement is on storing x
1
, . . . , x
l
. For the conver-
gence, we prove the following theorem using techniques
in (Luo & Tseng, 1992):
Theorem 1 For L1-SVM and L2-SVM, {α
k,i
} gen-
erated by Algorithm 1 globally converges to an optimal
solution α
. The convergence rate is at least linear:
there are 0 < µ < 1 and an iteration k
0
such that
f(α
k+1
) f(α
) µ(f(α
k
) f(α
)), k k
0
. (14)
The proof is in Appendix 7.1. The global convergence
result is quite remarkable. Usually for a convex but
not strictly convex problem (e.g., L1-SVM), one can
only obtain that any limit point is optimal. We define
an -accurate solution α if f (α) f(α
) + . By
(14), our algorithm obtains an -accurate solution in
O(log(1/)) iterations.
2
3. Implementation Issues
3.1. Random Permutation of Sub-problems
In Algorithm 1, the coordinate descent algorithm
solves the one-variable sub-problems in the order of
α
1
, . . . , α
l
. Past results such as (Chang et al., 2008)
show that solving sub-problems in an arbitrary order
may give faster convergence. This inspires us to ran-
domly permute the sub-problems at each outer itera-
tion. Formally, at the kth outer iteration, we permute
{1, . . . , l} to {π(1), . . . , π(l)}, and solve sub-problems
in the order of α
π(1)
, α
π(2)
, . . . , α
π(l)
. Similar to Al-
gorithm 1, the algorithm generates a sequence {α
k,i
}
such that α
k,1
= α
k
, α
k,l+1
= α
k+1,1
and
α
k,i
t
=
(
α
k+1
t
if π
1
k
(t) < i,
α
k
t
if π
1
k
(t) i.
2
A constant k
0
appears in (14). A newer result without
needing k
0
is in Wang and Lin (2014).

A Dual Coordinate Descent Method for Large-scale Linear SVM
The update from α
k,i
to α
k,i+1
is by
α
k,i+1
t
=α
k,i
t
+arg min
0α
k,i
t
+dU
f(α
k,i
+de
t
) if π
1
k
(t) = i.
We prove that Theorem 1 is still valid. Hence, the new
setting obtains an -accurate solution in O(log(1/)) it-
erations. A simple experiment reveals that this setting
of permuting sub-problems is much faster than Algo-
rithm 1. The improvement is also bigger than that
observed in (Chang et al., 2008) for primal coordinate
descent methods.
3.2. Shrinking
Eq. (4) contains constraints 0 α
i
U . If an
α
i
is 0 or U for many iterations, it may remain the
same. To speed up decomposition methods for non-
linear SVM (discussed in Section 4.1), the shrinking
technique (Joachims, 1998) reduces the size of the op-
timization problem without considering some bounded
variables. Below we show it is much easier to apply this
technique to linear SVM than the nonlinear case.
If A is the subset after removing some elements and
¯
A = {1, . . . , l} \ A, then the new problem is
min
α
A
1
2
α
T
A
¯
Q
AA
α
A
+ (
¯
Q
A
¯
A
α
¯
A
e
A
)
T
α
A
subject to 0 α
i
U, i A, (15)
where
¯
Q
AA
,
¯
Q
A
¯
A
are sub-matrices of
¯
Q, and α
¯
A
is
considered as a constant vector. Solving this smaller
problem consumes less time and memory. Once (15) is
solved, we must check if the vector α is optimal for (4).
This check needs the whole gradient f(α). Since
i
f(α) =
¯
Q
i,A
α
A
+
¯
Q
i,
¯
A
α
¯
A
1,
if i A, and one stores
¯
Q
i,
¯
A
α
¯
A
before solving (15), we
already have
i
f(α). However, for all i / A, we must
calculate the corresponding rows of
¯
Q. This step, re-
ferred to as the reconstruction of gradients in training
nonlinear SVM, is very time consuming. It may cost
up to O(l
2
¯n) if each kernel evaluation is O(¯n).
For linear SVM, in solving the smaller problem (15),
we still have the vector
w =
X
iA
y
i
α
i
x
i
+
X
i
¯
A
y
i
α
i
x
i
though only the first part
P
iA
y
i
α
i
x
i
is updated.
Therefore, using (12), f(α) is easily available. Below
we demonstrate a shrinking implementation so that re-
constructing the whole f (α) is never needed.
Our method is related to what LIBSVM (Chang & Lin,
2011) uses. From the optimality condition of bound-
constrained problems, α is optimal for (4) if and only if
Algorithm 2 Coordinate descent algorithm with ran-
domly selecting one instance at a time
Given α and the corresponding w =
P
i
y
i
α
i
x
i
.
While α is not optimal
Randomly choose i {1, . . . , l}.
Do steps (a)-(c) of Algorithm 1 to update α
i
.
P
f(α) = 0, where
P
f(α) is the projected gradient
defined in (8). We then prove the following result:
Theorem 2 Let α
be the convergent point of {α
k,i
}.
1. If α
i
= 0 and
i
f(α
) > 0, then k
i
such that
k k
i
, s, α
k,s
i
= 0.
2. If α
i
= U and
i
f(α
) < 0, then k
i
such that
k k
i
, s, α
k,s
i
= U.
3. lim
k→∞
max
j
P
j
f(α
k,j
)= lim
k→∞
min
j
P
j
f(α
k,j
)=0.
The proof is in Appendix 7.3. During the opti-
mization procedure,
P
f(α
k
) 6= 0, and in general
max
j
P
j
f(α
k
) > 0 and min
j
P
j
f(α
k
) < 0. These
two values measure how the current solution violates
the optimality condition. In our iterative procedure,
what we have are
i
f(α
k,i
), i = 1, . . . , l. Hence, at
the (k 1)st iteration, we obtain
M
k1
max
j
P
j
f(α
k1,j
), m
k1
min
j
P
j
f(α
k1,j
).
Then at each inner step of the kth iteration, before
updating α
k,i
i
to α
k,i+1
i
, this element is shrunken if
one of the following two conditions holds:
α
k,i
i
= 0 and
i
f(α
k,i
) >
¯
M
k1
,
α
k,i
i
= U and
i
f(α
k,i
) < ¯m
k1
,
(16)
where
¯
M
k1
=
(
M
k1
if M
k1
> 0,
otherwise,
¯m
k1
=
(
m
k1
if m
k1
< 0
−∞ otherwise.
In (16),
¯
M
k1
must be strictly positive, so we set it be
if M
k1
0. From Theorem 2, elements satisfying
the “if condition” of properties 1 and 2 meet (16) after
certain iterations, and are then correctly removed for
optimization. To have a more aggressive shrinking,
one may multiply both
¯
M
k1
and ¯m
k1
in (16) by a
threshold smaller than one.
Property 3 of Theorem 2 indicates that with a toler-
ance ,
M
k
m
k
< (17)
is satisfied after a finite number of iterations. Hence
(17) is a valid stopping condition. We also use it for

A Dual Coordinate Descent Method for Large-scale Linear SVM
Table 1. A comparison between decomposition methods
(Decomp.) and dual coordinate descent (DCD). For both
methods, we consider that one α
i
is updated at a time. We
assume Decomp. maintains gradients, but DCD does not.
The average number of nonzeros per instance is ¯n.
Nonlinear SVM Linear SVM
Decomp. DCD Decomp. DCD
Update α
i
O(1) O(l¯n) O(1) O(¯n)
Maintain f(α) O(l¯n) NA O(l¯n) NA
smaller problems (15). If at the kth iteration, (17)
for (15) is reached, we enlarge A to {1, . . . , l}, set
¯
M
k
= , ¯m
k
= −∞ (so no shrinking at the (k + 1)st
iteration), and continue regular iterations. Thus, we
do shrinking without reconstructing gradients.
In Appendix 7.4, we provide an algorithm to show the
convergence and finite termination of the Algorithm 1
with shrinking.
3.3. An Online Setting
In some applications, the number of instances is huge,
so going over all α
1
, . . . , α
l
causes an expensive outer
iteration. Instead, one can randomly choose an index
i
k
at a time, and update only α
i
k
at the kth outer
iteration. A description is in Algorithm 2. The setting
is related to (Crammer & Singer, 2003; Collins et al.,
2008). See also the discussion in Section 4.2.
4. Relations with Other Methods
4.1. Decomposition Methods for Nonlinear
SVM
Decomposition methods are one of the most popular
approaches for training nonlinear SVM. As the kernel
matrix is dense and cannot be stored in the computer
memory, decomposition methods solve a sub-problem
of few variables at each iteration. Only a small num-
ber of corresponding kernel columns are needed, so the
memory problem is resolved. If the number of vari-
ables is restricted to one, a decomposition method is
like the online coordinate descent in Section 3.3, but
it differs in the way it selects variables for updating.
It has been shown (Keerthi & DeCoste, 2005) that,
for linear SVM decomposition methods are inefficient.
On the other hand, here we are pointing out that dual
coordinate descent is efficient for linear SVM. There-
fore, it is important to discuss the relationship between
decomposition methods and our method.
In early decomposition methods that were first pro-
posed (Osuna et al., 1997; Platt, 1998), variables min-
imized at an iteration are selected by certain heuristics.
However, subsequent developments (Joachims, 1998;
Chang & Lin, 2011; Keerthi et al., 2001) all use gra-
dient information to conduct the selection. The main
reason is that maintaining the whole gradient does not
introduce extra cost. Here we explain the detail by as-
suming that one variable of α is chosen and updated at
a time
3
. To set-up and solve the sub-problem (6), one
uses (10) to calculate
i
f(α). If O(¯n) effort is needed
for each kernel evaluation, obtaining the ith row of
the kernel matrix takes O(l¯n) effort. If instead one
maintains the whole gradient, then
i
f(α) is directly
available. After updating α
k,i
i
to α
k,i+1
i
, we obtain
¯
Q’s
ith column (same as the ith row due to the symmetry
of
¯
Q), and calculate the new whole gradient:
f(α
k,i+1
) = f(α
k,i
) +
¯
Q
:,i
(α
k,i+1
i
α
k,i
i
), (18)
where
¯
Q
:,i
is the ith column of
¯
Q. The cost is O(l¯n)
for
¯
Q
:,i
and O(l) for (18). Therefore, maintaining the
whole gradient does not cost more. As using the whole
gradient implies fewer iterations (i.e., faster conver-
gence due to the ability to choose for updating the vari-
able that violates optimality most), one should take
this advantage. However, the situation for linear SVM
is very different. With the different way (12) to calcu-
late
i
f(α), the cost to update one α
i
is only O(¯n). If
we still maintain the whole gradient, evaluating (12) l
times takes O(l¯n) effort. We gather this comparison of
different situations in Table 1. Clearly, for nonlinear
SVM, one should use decomposition methods by main-
taining the whole gradient. However, for linear SVM,
if l is large, the cost per iteration without maintaining
gradients is much smaller than that with. Hence, the
coordinate descent method can be faster than the de-
composition method by using many cheap iterations.
An earlier attempt to speed up decomposition methods
for linear SVM is (Kao et al., 2004). However, it failed
to derive our method here because it does not give up
maintaining gradients.
4.2. Existing Linear SVM Methods
We discussed in Section 1 and other places the dif-
ference between our method and a primal coordinate
descent method (Chang et al., 2008). Below we de-
scribe the relations with other linear SVM methods.
We mentioned in Section 3.3 that our Algorithm 2 is
related to the online mode in (Collins et al., 2008).
They aim at solving multi-class and structured prob-
lems. At each iteration an instance is used; then a
sub-problem of several variables is solved. They ap-
proximately minimize the sub-problem, but for two-
class case, one can exactly solve it by (9). For the
3
Solvers like LIBSVM update at least two variables due
to a linear constraint in their dual problems. Here (4) has
no such a constraint, so selecting one variable is possible.

Figures
Citations
More filters
Journal Article

LIBLINEAR: A Library for Large Linear Classification

TL;DR: LIBLINEAR is an open source library for large-scale linear classification that supports logistic regression and linear support vector machines and provides easy-to-use command-line tools and library calls for users and developers.
Proceedings Article

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

TL;DR: It is proved that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG), but the analysis is significantly simpler and more intuitive.
Book

Foundations of Machine Learning

TL;DR: This graduate-level textbook introduces fundamental concepts and methods in machine learning, and provides the theoretical underpinnings of these algorithms, and illustrates key aspects for their application.
Journal ArticleDOI

Pegasos: primal estimated sub-gradient solver for SVM

TL;DR: A simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines, which is particularly well suited for large text classification problems, and demonstrates an order-of-magnitude speedup over previous SVM learning methods.
Journal Article

Stochastic dual coordinate ascent methods for regularized loss

TL;DR: In this article, a convergence analysis of stochastic dual coordinate coordinate ascent (SDCA) is presented, showing that this class of methods enjoy strong theoretical guarantees that are comparable or better than SGD.
References
More filters
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Proceedings ArticleDOI

A training algorithm for optimal margin classifiers

TL;DR: A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented, applicable to a wide variety of the classification functions, including Perceptrons, polynomials, and Radial Basis Functions.
Proceedings ArticleDOI

Advances in kernel methods: support vector learning

TL;DR: Support vector machines for dynamic reconstruction of a chaotic system, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel.

Fast training of support vector machines using sequential minimal optimization, advances in kernel methods

J. C. Platt
TL;DR: SMO breaks this large quadratic programming problem into a series of smallest possible QP problems, which avoids using a time-consuming numerical QP optimization as an inner loop and hence SMO is fastest for linear SVMs and sparse data sets.
Book

Fast training of support vector machines using sequential minimal optimization

TL;DR: In this article, the authors proposed a new algorithm for training Support Vector Machines (SVM) called SMO (Sequential Minimal Optimization), which breaks this large QP problem into a series of smallest possible QP problems.
Frequently Asked Questions (19)
Q1. What have the authors contributed in "A dual coordinate descent method for large-scale linear svm" ?

This paper presents a novel dual coordinate descent method for linear SVM with L1and L2loss functions. 

As the kernel matrix is dense and cannot be stored in the computer memory, decomposition methods solve a sub-problem of few variables at each iteration. 

Two common loss functions are:max(1− yiwTxi, 0) and max(1− yiwTxi, 0)2. (2)The former is called L1-SVM, while the latter is L2SVM. 

for a linear SVM, the authors can definew = ∑lj=1 yjαjxj , (11)so (10) becomes∇if(α) = yiwTxi − 1 +Diiαi. (12)To evaluate (12), the main cost is O(n̄) for calculating wTxi. 

From the optimality condition of boundconstrained problems, α is optimal for (4) if and only ifAlgorithm 2 Coordinate descent algorithm with randomly selecting one instance at a timeGiven α and the corresponding w = ∑ i yiαixi. 

The reference solutions of L1- and L2-SVM are respectively obtained by solving DCDL1 and DCDL2 until the duality gaps are less than 10−6. 

As using the whole gradient implies fewer iterations (i.e., faster convergence due to the ability to choose for updating the variable that violates optimality most), one should take this advantage. 

If n̄ is the average number of nonzero elements per instance, and O(n̄) is needed for each kernel evaluation, then calculating the ith row of the kernel matrix takes O(ln̄). 

to evaluate ∇if(αk,i), the authors use∇if(α) = (Q̄α)i − 1 = ∑lj=1 Q̄ijαj − 1. (10)Q̄ may be too large to be stored, so one calculates Q̄’s ith row when doing (10). 

After updating αk,ii to α k,i+1 i , the authors obtain Q̄’s ith column (same as the ith row due to the symmetry of Q̄), and calculate the new whole gradient:∇f(αk,i+1) = ∇f(αk,i) + Q̄:,i(αk,i+1i − α k,i i ), (18)where Q̄:,i is the ith column of Q̄. 

The convergence rate is at least linear: there are 0 < µ < 1 and an iteration k0 such thatf(αk+1)− f(α∗) ≤ µ(f(αk)− f(α∗)),∀k ≥ k0. (14)The proof is in Appendix 7.1. 

For linear SVM, in solving the smaller problem (15), the authors still have the vectorw = ∑i∈A yiαixi + ∑ i∈Ā yiαixithough only the first part ∑ i∈A yiαixi is updated. 

For updating αk,i to αk,i+1, the authors solve the following one-variable sub-problem:min df(αk,i + dei) subject to 0 ≤ αki + d ≤ U, (5)where ei = [0, . . . , 0, 1, 0, . . . , 0] T . 

In their iterative procedure, what the authors have are ∇if(αk,i), i = 1, . . . , l. Hence, at the (k − 1)st iteration, the authors obtainMk−1 ≡ max j ∇Pj f(αk−1,j),mk−1 ≡ min j ∇Pj f(αk−1,j). 

their algorithm uses (12) to compute ∇if(αk,i), checks the optimality of the sub-problem (5) by (7), updates αi by (9), and then maintains w by (13). 

In practice, one thus should try from a small C. Moreover, if n l and C is too large, then their DCDL2 is slower than TRON or PCD (see problem a9a in Table 2, where the accuracy does not change after C ≥ 0.25). 

As discussed in Section 4.2, the learning rate of stochastic gradient descent may be the cause, but for DCDL1 the authors exactly solve sub-problems to obtain the step size in updating w. 

Then at each inner step of the kth iteration, before updating αk,ii to α k,i+1 i , this element is shrunken if one of the following two conditions holds:αk,ii = 0 and ∇if(α k,i) > 

as primal L2-SVM is differentiable but not twice differentiable, certain considerations are needed in solving the single-variable sub-problem.