What is the common loss function for SVM?

Two common loss functions are:max(1− yiwTxi, 0) and max(1− yiwTxi, 0)2. (2)The former is called L1-SVM, while the latter is L2SVM.

What is the cost of calculating wTxi?

for a linear SVM, the authors can definew = ∑lj=1 yjαjxj , (11)so (10) becomes∇if(α) = yiwTxi − 1 +Diiαi. (12)To evaluate (12), the main cost is O(n̄) for calculating wTxi.

How fast is the solution for L1- and L2-SVM?

The reference solutions of L1- and L2-SVM are respectively obtained by solving DCDL1 and DCDL2 until the duality gaps are less than 10−6.

What is the cost of calculating w?

to evaluate ∇if(αk,i), the authors use∇if(α) = (Q̄α)i − 1 = ∑lj=1 Q̄ijαj − 1. (10)Q̄ may be too large to be stored, so one calculates Q̄’s ith row when doing (10).

What is the cost of maintaining the whole gradient?

After updating αk,ii to α k,i+1 i , the authors obtain Q̄’s ith column (same as the ith row due to the symmetry of Q̄), and calculate the new whole gradient:∇f(αk,i+1) = ∇f(αk,i) + Q̄:,i(αk,i+1i − α k,i i ), (18)where Q̄:,i is the ith column of Q̄.

What is the proof of the convergence of the two algorithms?

The convergence rate is at least linear: there are 0 < µ < 1 and an iteration k0 such thatf(αk+1)− f(α∗) ≤ µ(f(αk)− f(α∗)),∀k ≥ k0. (14)The proof is in Appendix 7.1.

What is the way to solve a linear SVM problem?

For linear SVM, in solving the smaller problem (15), the authors still have the vectorw = ∑i∈A yiαixi + ∑ i∈Ā yiαixithough only the first part ∑ i∈A yiαixi is updated.

how do the authors update k to i?

For updating αk,i to αk,i+1, the authors solve the following one-variable sub-problem:min df(αk,i + dei) subject to 0 ≤ αki + d ≤ U, (5)where ei = [0, . . . , 0, 1, 0, . . . , 0] T .

What is the convergent point of k,i?

In their iterative procedure, what the authors have are ∇if(αk,i), i = 1, . . . , l. Hence, at the (k − 1)st iteration, the authors obtainMk−1 ≡ max j ∇Pj f(αk−1,j),mk−1 ≡ min j ∇Pj f(αk−1,j).

What is the way to solve a n-l problem?

In practice, one thus should try from a small C. Moreover, if n l and C is too large, then their DCDL2 is slower than TRON or PCD (see problem a9a in Table 2, where the accuracy does not change after C ≥ 0.25).

What is the reason for the slow convergence of DCDL1?

As discussed in Section 4.2, the learning rate of stochastic gradient descent may be the cause, but for DCDL1 the authors exactly solve sub-problems to obtain the step size in updating w.

What is the convergent point of k,ii?

Then at each inner step of the kth iteration, before updating αk,ii to α k,i+1 i , this element is shrunken if one of the following two conditions holds:αk,ii = 0 and ∇if(α k,i) >

What are the main considerations needed in solving the single-variable sub-problem?

as primal L2-SVM is differentiable but not twice differentiable, certain considerations are needed in solving the single-variable sub-problem.

(Open Access) A dual coordinate descent method for large-scale linear SVM (2008) | Cho-Jui Hsieh

Q: What is the common method for training nonlinear SVM?

As the kernel matrix is dense and cannot be stored in the computer memory, decomposition methods solve a sub-problem of few variables at each iteration.

Q: What is the optimality condition of boundconstrained problems?

From the optimality condition of boundconstrained problems, α is optimal for (4) if and only ifAlgorithm 2 Coordinate descent algorithm with randomly selecting one instance at a timeGiven α and the corresponding w = ∑ i yiαixi.

Q: What is the advantage of using the whole gradient?

As using the whole gradient implies fewer iterations (i.e., faster convergence due to the ability to choose for updating the variable that violates optimality most), one should take this advantage.

Q: What is the cost of calculating the ith row of the kernel matrix?

If n̄ is the average number of nonzero elements per instance, and O(n̄) is needed for each kernel evaluation, then calculating the ith row of the kernel matrix takes O(ln̄).

A Dual Coordinate Descent Method for Large-scale Linear SVM

Cho-Jui Hsieh b92085@csie.ntu.edu.tw

Kai-Wei Chang b92084@csie.ntu.edu.tw

Chih-Jen Lin cjlin@csie.ntu.edu.tw

Department of Computer Science, National Taiwan University, Taipei 106, Taiwan

S. Sathiya Keerthi selvarak@yahoo-inc.com

Yahoo! Research, Santa Clara, USA

S. Sundararajan ssrajan@yahoo-inc.com

Yahoo! Labs, Bangalore, India

Abstract

In many applications, data appear with a

huge number of instances as well as features.

Linear Support Vector Machines (SVM) is

one of the most popular tools to deal with

such large-scale sparse data. This paper

presents a novel dual coordinate descent

method for linear SVM with L1- and L2-

loss functions. The proposed method is sim-

ple and reaches an -accurate solution in

O(log(1/)) iterations. Experiments indicate

that our method is much faster than state

of the art solvers such as Pegasos, TRON,

SVM

perf

, and a recent primal coordinate de-

scent implementation.

1. Introduction

Support vector machines (SVM) (Boser et al., 1992)

are useful for data classiﬁcation. Given a set of

instance-label pairs (x

, y

), i = 1, . . . , l, x

∈ R

, y

∈

{−1, +1}, SVM requires the solution of the following

unconstrained optimization problem:

min

w + C

i=1

ξ(w; x

, y

), (1)

where ξ(w; x

, y

) is a loss function, and C > 0 is a

penalty parameter. Two common loss functions are:

max(1 −y

, 0) and max(1 − y

, 0)

. (2)

The former is called L1-SVM, while the latter is L2-

SVM. In some applications, an SVM problem appears

Appearing in Proceedings of the 25

International Confer-

ence on Machine Learning, Helsinki, Finland, 2008. Copy-

right 2008 by the author(s)/owner(s).

with a bias term b. One often deal with this term by

appending each instance with an additional dimension:

← [x

, 1] w

← [w

, b]. (3)

Problem (1) is often referred to as the primal form of

SVM. One may instead solve its dual problem:

min

f(α) =

Qα − e

subject to 0 ≤ α

≤ U, ∀i, (4)

where

Q = Q + D, D is a diagonal matrix, and Q

. For L1-SVM, U = C and D

= 0, ∀i. For

L2-SVM, U = ∞ and D

= 1/(2C), ∀i.

An SVM usually maps training vectors into a high-

dimensional space via a nonlinear function φ(x). Due

to the high dimensionality of the vector variable w,

one solves the dual problem (4) by the kernel trick

(i.e., using a closed form of φ(x

)

φ(x

)). We call

such a problem as a nonlinear SVM. In some applica-

tions, data appear in a rich dimensional feature space,

the performances are similar with/without nonlinear

mapping. If data are not mapped, we can often train

much larger data sets. We indicate such cases as linear

SVM; these are often encountered in applications such

as document classiﬁcation. In this paper, we aim at

solving very large linear SVM problems.

Recently, many methods have been proposed for lin-

ear SVM in large-scale scenarios. For L1-SVM, Zhang

(2004), Shalev-Shwartz et al. (2007), Bottou (2007)

propose various stochastic gradient descent methods.

Collins et al. (2008) apply an exponentiated gradi-

ent method. SVM

perf

(Joachims, 2006) uses a cutting

plane technique. Smola et al. (2008) apply bundle

methods, and view SVM

perf

as a special case. For

L2-SVM, Keerthi and DeCoste (2005) propose mod-

iﬁed Newton methods. A trust region Newton method

(TRON) (Lin et al., 2008) is proposed for logistic re-

A Dual Coordinate Descent Method for Large-scale Linear SVM

gression and L2-SVM. These algorithms focus on dif-

ferent aspects of the training speed. Some aim at

quickly obtaining a usable model, but some achieve

fast ﬁnal convergence of solving the optimization prob-

lem in (1) or (4). Moreover, among these methods,

Joachims (2006), Smola et al. (2008) and Collins et al.

(2008) solve SVM via the dual (4). Others consider the

primal form (1). The decision of using primal or dual

is of course related to the algorithm design.

Very recently, Chang et al. (2008) propose using co-

ordinate descent methods for solving primal L2-SVM.

Experiments show that their approach more quickly

obtains a useful model than some of the above meth-

ods. Coordinate descent, a popular optimization tech-

nique, updates one variable at a time by minimizing a

single-variable sub-problem. If one can eﬃciently solve

this sub-problem, then it can be a competitive opti-

mization method. Due to the non-diﬀerentiability of

the primal L1-SVM, Chang et al’s work is restricted to

L2-SVM. Moreover, as primal L2-SVM is diﬀerentiable

but not twice diﬀerentiable, certain considerations are

needed in solving the single-variable sub-problem.

While the dual form (4) involves bound constraints

0≤α

≤U, its objective function is twice diﬀerentiable

for both L1- and L2-SVM. In this paper, we investi-

gate coordinate descent methods for the dual problem

(4). We prove that an -optimal solution is obtained

in O(log(1/)) iterations. We propose an implemen-

tation using a random order of sub-problems at each

iteration, which leads to very fast training. Experi-

ments indicate that our method is more eﬃcient than

the primal coordinate descent method. As Chang et al.

(2008) solve the primal, they require the easy access

of a feature’s corresponding data values. However, in

practice one often has an easier access of values per in-

stance. Solving the dual takes this advantage, so our

implementation is simpler than Chang et al. (2008).

Early SVM papers (Mangasarian & Musicant, 1999;

Friess et al., 1998) have discussed coordinate descent

methods for the SVM dual form.

However, they

do not focus on large data using the linear kernel.

Crammer and Singer (2003) proposed an online setting

for multi-class SVM without considering large sparse

data. Recently, Bordes et al. (2007) applied a coor-

dinate descent method to multi-class SVM, but they

focus on nonlinear kernels. In this paper, we point

out that dual coordinate descent methods make crucial

advantage of the linear kernel and outperform other

solvers when the numbers of data and features are both

Note that coordinate descent methods for uncon-

strained quadratic programming can be traced back to Hil-

dreth (1957).

large.

Coordinate descent methods for (4) are related to the

popular decomposition methods for training nonlinear

SVM. In this paper, we show their key diﬀerences and

explain why earlier studies on decomposition meth-

ods failed to modify their algorithms in an eﬃcient

way like ours for large-scale linear SVM. We also dis-

cuss the connection to other linear SVM works such as

(Crammer & Singer, 2003; Collins et al., 2008; Shalev-

Shwartz et al., 2007).

This paper is organized as follows. In Section 2, we

describe our proposed algorithm. Implementation is-

sues are investigated in Section 3. Section 4 discusses

the connection to other methods. In Section 5, we

compare our method with state of the art implemen-

tations for large linear SVM. Results show that the

new method is more eﬃcient.

2. A Dual Coordinate Descent Method

In this section, we describe our coordinate descent

method for L1- and L2-SVM. The optimization pro-

cess starts from an initial point α

∈ R

and generates

a sequence of vectors {α

}

∞

k=0

. We refer to the process

from α

to α

k+1

as an outer iteration. In each outer

iteration we have l inner iterations, so that sequen-

tially α

, α

, . . . , α

are updated. Each outer iteration

thus generates vectors α

k,i

∈ R

, i = 1, . . . , l + 1, such

that α

k,1

= α

, α

k,l+1

= α

k+1

, and

k,i

= [α

k+1

, . . . , α

k+1

i−1

, α

, . . . , α

]

, ∀i = 2, . . . , l.

For updating α

k,i

to α

k,i+1

, we solve the following

one-variable sub-problem:

min

f(α

k,i

+ de

) subject to 0 ≤ α

+ d ≤ U, (5)

where e

= [0, . . . , 0, 1, 0, . . . , 0]

. The objective func-

tion of (5) is a simple quadratic function of d:

f(α

k,i

+ de

) =

+ ∇

f(α

k,i

)d + constant, (6)

where ∇

f is the ith component of the gradient ∇f .

One can easily see that (5) has an optimum at d = 0

(i.e., no need to update α

) if and only if

∇

f(α

k,i

) = 0, (7)

where ∇

f(α) means the projected gradient

∇

f(α) =











∇

f(α) if 0 < α

< U,

min(0, ∇

f(α)) if α

= 0,

max(0, ∇

f(α)) if α

= U.

(8)

A Dual Coordinate Descent Method for Large-scale Linear SVM

Algorithm 1 A dual coordinate descent method for

Linear SVM



Given α and the corresponding w =



While α is not optimal

For i = 1, . . . , l

(a) G = y

− 1 + D

(b)

P G =











min(G, 0) if α

= 0,

max(G, 0) if α

= U,

G if 0 < α

< U

¯α

← α

← min(max(α

− G/

, 0), U)

w ← w + (α

− ¯α

If (7) holds, we move to the index i+1 without updat-

ing α

k,i

. Otherwise, we must ﬁnd the optimal solution

of (5). If

> 0, easily the solution is:

k,i+1

= min



max



k,i

−

∇

f(α

k,i

)

, 0



, U



. (9)

We thus need to calculate

and ∇

f(α

k,i

). First,

= x

+ D

can be precomputed and stored in

the memory. Second, to evaluate ∇

f(α

k,i

), we use

∇

f(α) = (

Qα)

− 1 =

j=1

− 1. (10)

Q may be too large to be stored, so one calculates

Q’s

ith row when doing (10). If ¯n is the average number

of nonzero elements per instance, and O(¯n) is needed

for each kernel evaluation, then calculating the ith row

of the kernel matrix takes O(l¯n). Such operations are

expensive. However, for a linear SVM, we can deﬁne

w =

j=1

, (11)

so (10) becomes

∇

f(α) = y

− 1 + D

. (12)

To evaluate (12), the main cost is O(¯n) for calculating

. This is much smaller than O(l¯n). To apply

(12), w must be maintained throughout the coordinate

descent procedure. Calculating w by (11) takes O(l¯n)

operations, which are too expensive. Fortunately, if

¯α

is the current value and α

is the value after the

updating, we can maintain w by

w ← w + (α

− ¯α

. (13)

The number of operations is only O(¯n). To have the

ﬁrst w, one can use α

= 0 so w = 0. In the end, we

obtain the optimal w of the primal problem (1) as the

primal-dual relationship implies (11).

= 0, we have D

= 0, Q

= x

= 0, and

hence x

= 0. This occurs only in L1-SVM without

the bias term by (3). From (12), if x

= 0, then

∇

f(α

k,i

) = −1. As U = C < ∞ for L1-SVM, the

solution of (5) makes the new α

k,i+1

= U. We can

easily include this case in (9) by setting 1/

= ∞.

Brieﬂy, our algorithm uses (12) to compute ∇

f(α

k,i

checks the optimality of the sub-problem (5) by (7),

updates α

by (9), and then maintains w by (13). A

description is in Algorithm 1. The cost per iteration

(i.e., from α

to α

k+1

) is O(l¯n). The main memory

requirement is on storing x

, . . . , x

. For the conver-

gence, we prove the following theorem using techniques

in (Luo & Tseng, 1992):

Theorem 1 For L1-SVM and L2-SVM, {α

k,i

} gen-

erated by Algorithm 1 globally converges to an optimal

solution α

∗

. The convergence rate is at least linear:

there are 0 < µ < 1 and an iteration k

such that

f(α

k+1

) − f(α

∗

) ≤ µ(f(α

) − f(α

∗

)), ∀k ≥ k

. (14)

The proof is in Appendix 7.1. The global convergence

result is quite remarkable. Usually for a convex but

not strictly convex problem (e.g., L1-SVM), one can

only obtain that any limit point is optimal. We deﬁne

an -accurate solution α if f (α) ≤ f(α

∗

) + . By

(14), our algorithm obtains an -accurate solution in

O(log(1/)) iterations.

3. Implementation Issues

3.1. Random Permutation of Sub-problems

In Algorithm 1, the coordinate descent algorithm

solves the one-variable sub-problems in the order of

, . . . , α

. Past results such as (Chang et al., 2008)

show that solving sub-problems in an arbitrary order

may give faster convergence. This inspires us to ran-

domly permute the sub-problems at each outer itera-

tion. Formally, at the kth outer iteration, we permute

{1, . . . , l} to {π(1), . . . , π(l)}, and solve sub-problems

in the order of α

π(1)

, α

π(2)

, . . . , α

π(l)

. Similar to Al-

gorithm 1, the algorithm generates a sequence {α

k,i

}

such that α

k,1

= α

, α

k,l+1

= α

k+1,1

and

k,i

(

k+1

if π

−1

(t) < i,

if π

−1

(t) ≥ i.

A constant k

appears in (14). A newer result without

needing k

is in Wang and Lin (2014).

A Dual Coordinate Descent Method for Large-scale Linear SVM

The update from α

k,i

to α

k,i+1

is by

k,i+1

=α

k,i

+arg min

0≤α

k,i

+d≤U

f(α

k,i

+de

) if π

−1

(t) = i.

We prove that Theorem 1 is still valid. Hence, the new

setting obtains an -accurate solution in O(log(1/)) it-

erations. A simple experiment reveals that this setting

of permuting sub-problems is much faster than Algo-

rithm 1. The improvement is also bigger than that

observed in (Chang et al., 2008) for primal coordinate

descent methods.

3.2. Shrinking

Eq. (4) contains constraints 0 ≤ α

≤ U . If an

is 0 or U for many iterations, it may remain the

same. To speed up decomposition methods for non-

linear SVM (discussed in Section 4.1), the shrinking

technique (Joachims, 1998) reduces the size of the op-

timization problem without considering some bounded

variables. Below we show it is much easier to apply this

technique to linear SVM than the nonlinear case.

If A is the subset after removing some elements and

A = {1, . . . , l} \ A, then the new problem is

min

+ (

− e

)

subject to 0 ≤ α

≤ U, i ∈ A, (15)

where

are sub-matrices of

Q, and α

considered as a constant vector. Solving this smaller

problem consumes less time and memory. Once (15) is

solved, we must check if the vector α is optimal for (4).

This check needs the whole gradient ∇f(α). Since

∇

f(α) =

i,A

− 1,

if i ∈ A, and one stores

before solving (15), we

already have ∇

f(α). However, for all i /∈ A, we must

calculate the corresponding rows of

Q. This step, re-

ferred to as the reconstruction of gradients in training

nonlinear SVM, is very time consuming. It may cost

up to O(l

¯n) if each kernel evaluation is O(¯n).

For linear SVM, in solving the smaller problem (15),

we still have the vector

w =

i∈A

i∈

though only the ﬁrst part

i∈A

is updated.

Therefore, using (12), ∇f(α) is easily available. Below

we demonstrate a shrinking implementation so that re-

constructing the whole ∇f (α) is never needed.

Our method is related to what LIBSVM (Chang & Lin,

2011) uses. From the optimality condition of bound-

constrained problems, α is optimal for (4) if and only if

Algorithm 2 Coordinate descent algorithm with ran-

domly selecting one instance at a time



Given α and the corresponding w =



While α is not optimal

– Randomly choose i ∈ {1, . . . , l}.

– Do steps (a)-(c) of Algorithm 1 to update α

∇

f(α) = 0, where ∇

f(α) is the projected gradient

deﬁned in (8). We then prove the following result:

Theorem 2 Let α

∗

be the convergent point of {α

k,i

1. If α

∗

= 0 and ∇

f(α

∗

) > 0, then ∃k

such that

∀k ≥ k

, ∀s, α

k,s

= 0.

2. If α

∗

= U and ∇

f(α

∗

) < 0, then ∃k

such that

∀k ≥ k

, ∀s, α

k,s

= U.

3. lim

k→∞

max

∇

f(α

k,j

)= lim

k→∞

min

∇

f(α

k,j

)=0.

The proof is in Appendix 7.3. During the opti-

mization procedure, ∇

f(α

) 6= 0, and in general

max

∇

f(α

) > 0 and min

∇

f(α

) < 0. These

two values measure how the current solution violates

the optimality condition. In our iterative procedure,

what we have are ∇

f(α

k,i

), i = 1, . . . , l. Hence, at

the (k − 1)st iteration, we obtain

k−1

≡ max

∇

f(α

k−1,j

), m

k−1

≡ min

∇

f(α

k−1,j

Then at each inner step of the kth iteration, before

updating α

k,i

to α

k,i+1

, this element is shrunken if

one of the following two conditions holds:

k,i

= 0 and ∇

f(α

k,i

) >

k−1

k,i

= U and ∇

f(α

k,i

) < ¯m

k−1

(16)

where

k−1

(

k−1

if M

k−1

> 0,

∞ otherwise,

¯m

k−1

(

k−1

if m

k−1

< 0

−∞ otherwise.

In (16),

k−1

must be strictly positive, so we set it be

∞ if M

k−1

≤ 0. From Theorem 2, elements satisfying

the “if condition” of properties 1 and 2 meet (16) after

certain iterations, and are then correctly removed for

optimization. To have a more aggressive shrinking,

one may multiply both

k−1

and ¯m

k−1

in (16) by a

threshold smaller than one.

Property 3 of Theorem 2 indicates that with a toler-

ance ,

− m

<  (17)

is satisﬁed after a ﬁnite number of iterations. Hence

(17) is a valid stopping condition. We also use it for

A Dual Coordinate Descent Method for Large-scale Linear SVM

Table 1. A comparison between decomposition methods

(Decomp.) and dual coordinate descent (DCD). For both

methods, we consider that one α

is updated at a time. We

assume Decomp. maintains gradients, but DCD does not.

The average number of nonzeros per instance is ¯n.

Nonlinear SVM Linear SVM

Decomp. DCD Decomp. DCD

Update α

O(1) O(l¯n) O(1) O(¯n)

Maintain ∇f(α) O(l¯n) NA O(l¯n) NA

smaller problems (15). If at the kth iteration, (17)

for (15) is reached, we enlarge A to {1, . . . , l}, set

= ∞, ¯m

= −∞ (so no shrinking at the (k + 1)st

iteration), and continue regular iterations. Thus, we

do shrinking without reconstructing gradients.

In Appendix 7.4, we provide an algorithm to show the

convergence and ﬁnite termination of the Algorithm 1

with shrinking.

3.3. An Online Setting

In some applications, the number of instances is huge,

so going over all α

, . . . , α

causes an expensive outer

iteration. Instead, one can randomly choose an index

at a time, and update only α

at the kth outer

iteration. A description is in Algorithm 2. The setting

is related to (Crammer & Singer, 2003; Collins et al.,

2008). See also the discussion in Section 4.2.

4. Relations with Other Methods

4.1. Decomposition Methods for Nonlinear

SVM

Decomposition methods are one of the most popular

approaches for training nonlinear SVM. As the kernel

matrix is dense and cannot be stored in the computer

memory, decomposition methods solve a sub-problem

of few variables at each iteration. Only a small num-

ber of corresponding kernel columns are needed, so the

memory problem is resolved. If the number of vari-

ables is restricted to one, a decomposition method is

like the online coordinate descent in Section 3.3, but

it diﬀers in the way it selects variables for updating.

It has been shown (Keerthi & DeCoste, 2005) that,

for linear SVM decomposition methods are ineﬃcient.

On the other hand, here we are pointing out that dual

coordinate descent is eﬃcient for linear SVM. There-

fore, it is important to discuss the relationship between

decomposition methods and our method.

In early decomposition methods that were ﬁrst pro-

posed (Osuna et al., 1997; Platt, 1998), variables min-

imized at an iteration are selected by certain heuristics.

However, subsequent developments (Joachims, 1998;

Chang & Lin, 2011; Keerthi et al., 2001) all use gra-

dient information to conduct the selection. The main

reason is that maintaining the whole gradient does not

introduce extra cost. Here we explain the detail by as-

suming that one variable of α is chosen and updated at

a time

. To set-up and solve the sub-problem (6), one

uses (10) to calculate ∇

f(α). If O(¯n) eﬀort is needed

for each kernel evaluation, obtaining the ith row of

the kernel matrix takes O(l¯n) eﬀort. If instead one

maintains the whole gradient, then ∇

f(α) is directly

available. After updating α

k,i

to α

k,i+1

, we obtain

Q’s

ith column (same as the ith row due to the symmetry

Q), and calculate the new whole gradient:

∇f(α

k,i+1

) = ∇f(α

k,i

) +

:,i

(α

k,i+1

− α

k,i

), (18)

where

:,i

is the ith column of

Q. The cost is O(l¯n)

for

:,i

and O(l) for (18). Therefore, maintaining the

whole gradient does not cost more. As using the whole

gradient implies fewer iterations (i.e., faster conver-

gence due to the ability to choose for updating the vari-

able that violates optimality most), one should take

this advantage. However, the situation for linear SVM

is very diﬀerent. With the diﬀerent way (12) to calcu-

late ∇

f(α), the cost to update one α

is only O(¯n). If

we still maintain the whole gradient, evaluating (12) l

times takes O(l¯n) eﬀort. We gather this comparison of

diﬀerent situations in Table 1. Clearly, for nonlinear

SVM, one should use decomposition methods by main-

taining the whole gradient. However, for linear SVM,

if l is large, the cost per iteration without maintaining

gradients is much smaller than that with. Hence, the

coordinate descent method can be faster than the de-

composition method by using many cheap iterations.

An earlier attempt to speed up decomposition methods

for linear SVM is (Kao et al., 2004). However, it failed

to derive our method here because it does not give up

maintaining gradients.

4.2. Existing Linear SVM Methods

We discussed in Section 1 and other places the dif-

ference between our method and a primal coordinate

descent method (Chang et al., 2008). Below we de-

scribe the relations with other linear SVM methods.

We mentioned in Section 3.3 that our Algorithm 2 is

related to the online mode in (Collins et al., 2008).

They aim at solving multi-class and structured prob-

lems. At each iteration an instance is used; then a

sub-problem of several variables is solved. They ap-

proximately minimize the sub-problem, but for two-

class case, one can exactly solve it by (9). For the

Solvers like LIBSVM update at least two variables due

to a linear constraint in their dual problems. Here (4) has

no such a constraint, so selecting one variable is possible.

A dual coordinate descent method for large-scale linear SVM

Figures

Citations

LIBLINEAR: A Library for Large Linear Classification

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

Foundations of Machine Learning

Pegasos: primal estimated sub-gradient solver for SVM

Stochastic dual coordinate ascent methods for regularized loss

References

LIBSVM: A library for support vector machines

A training algorithm for optimal margin classifiers

Advances in kernel methods: support vector learning

Fast training of support vector machines using sequential minimal optimization, advances in kernel methods

Fast training of support vector machines using sequential minimal optimization

Related Papers (5)

LIBSVM: A library for support vector machines

LIBLINEAR: A Library for Large Linear Classification

Support-Vector Networks

Fast training of support vector machines using sequential minimal optimization

Statistical learning theory

Frequently Asked Questions (19)

Q1. What have the authors contributed in "A dual coordinate descent method for large-scale linear svm" ?

Q2. What is the common method for training nonlinear SVM?

Q3. What is the common loss function for SVM?

Q4. What is the cost of calculating wTxi?

Q5. What is the optimality condition of boundconstrained problems?

Q6. How fast is the solution for L1- and L2-SVM?

Q7. What is the advantage of using the whole gradient?

Q8. What is the cost of calculating the ith row of the kernel matrix?

Q9. What is the cost of calculating w?

Q10. What is the cost of maintaining the whole gradient?

Q11. What is the proof of the convergence of the two algorithms?

Q12. What is the way to solve a linear SVM problem?

Q13. how do the authors update k to i?

Q14. What is the convergent point of k,i?

Q15. What is the simplest way to solve a convex problem?

Q16. What is the way to solve a n-l problem?

Q17. What is the reason for the slow convergence of DCDL1?

Q18. What is the convergent point of k,ii?

Q19. What are the main considerations needed in solving the single-variable sub-problem?