scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Multiple kernel learning, conic duality, and the SMO algorithm

TL;DR: Experimental results are presented that show that the proposed novel dual formulation of the QCQP as a second-order cone programming problem is significantly more efficient than the general-purpose interior point methods available in current optimization toolboxes.
Abstract: While classical kernel-based classifiers are based on a single kernel, in practice it is often desirable to base classifiers on combinations of multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for the support vector machine (SVM), and showed that the optimization of the coefficients of such a combination reduces to a convex optimization problem known as a quadratically-constrained quadratic program (QCQP). Unfortunately, current convex optimization toolboxes can solve this problem only for a small number of kernels and a small number of data points; moreover, the sequential minimal optimization (SMO) techniques that are essential in large-scale implementations of the SVM cannot be applied because the cost function is non-differentiable. We propose a novel dual formulation of the QCQP as a second-order cone programming problem, and show how to exploit the technique of Moreau-Yosida regularization to yield a formulation to which SMO techniques can be applied. We present experimental results that show that our SMO-based algorithm is significantly more efficient than the general-purpose interior point methods available in current optimization toolboxes.

Summary (3 min read)

1. Introduction

  • One of the major reasons for the rise to prominence of the support vector machine (SVM) is its ability to cast nonlinear classification as a convex optimization problem, in particular a quadratic program (QP).
  • Con-vexity implies that the solution is unique and brings a suite of standard numerical software to bear in finding the solution.
  • Recent developments in the literature on the SVM and other kernel methods have emphasized the need to consider multiple kernels, or parameterizations of kernels, and not a single fixed kernel.
  • One class of solutions to non-smooth optimization problems involves constructing a smooth approximate problem out of a non-smooth problem.
  • In this paper the authors show how these problems can be resolved by considering a novel dual formulation of the QCQP as a second-order cone programming (SOCP) problem.

2.2. Support kernel machine

  • The authors now introduce a novel classification algorithm that they refer to as the "support kernel machine" (SKM).
  • But their underlying motivation is the fact that the dual of the SKM is exactly the problem (L).
  • The authors establish this equivalence in the following section.

2.2.1. Linear classification

  • In the spirit of the soft margin SVM, the authors achieve this by minimizing a linear combination of the inverse of the margin and the training error.
  • Various norms can be used to combine the two terms, and indeed many different algorithms have been explored for various combinations of 1 -norms and 2 -norms.

2.2.2. Conic duality and optimality conditions

  • For a given optimization problem there are many ways of deriving a dual problem.
  • Equations (a) and (b) are the same as in the classical SVM, where they define the notion of a "support vector.".
  • While the KKT conditions (a) and (b) refer to the index i over data points, the KKT conditions (c) and (d) refer to the index j over components of the input vector.
  • These conditions thus imply a form of sparsity not over data points but over "input dimensions.".
  • Sparsity thus emerges from the optimization problem.

2.2.3. Kernelization

  • The authors now "kernelize" the problem (P ) using this kernel function.
  • The sparsity that emerges via the KKT conditions (c) and (d) now refers to the kernels K j , and the authors refer to the kernels with nonzero η j as "support kernels.".

2.3. Equivalence of the two formulations

  • Care must be taken here though-the weights η j are defined for (L) as Lagrange multipliers and for (D K ) through the anti-proportionality of orthogonal elements of a second-order cone, and a priori they might not coincide: although (D K ) and (L) are equivalent, their dual problems have different formulations.
  • It is straightforward, however, to write the KKT optimality conditions for (α, η) for both problems and verify that they are indeed equivalent.

3. Optimality conditions

  • The authors formulate their problem (in either of its two equivalent forms) as the minimization of a non-differentiable convex function subject to linear constraints.
  • Exact and approximate optimality conditions are then readily derived using subdifferentials.
  • In later sections the authors will show how these conditions lead to an MY-regularized algorithmic formulation that will be amenable to SMO techniques.

3.2. Optimality conditions and subdifferential

  • Elements of the subdifferential ∂J(α) are called subgradients.
  • The notion of subdifferential is especially useful for characterizing optimality conditions of nonsmooth problems (Bertsekas, 1995) .

3.3. Approximate optimality conditions

  • Note that for one kernel, i.e., when the SKM reduces to the SVM, this corresponds to the approximate KKT conditions usually employed for the standard SVM (Platt, 1998; Keerthi et al., 2001; Joachims, 1998) .
  • Indeed, the iterative algorithm that the authors present in Section 4 outputs a pair (α, η) and only these sufficient optimality conditions need to be checked.

3.4. Improving sparsity

  • Indeed, if some of the kernels are close to identical, then some of the η's can potentially be removed-for a general SVM, the optimal α is not unique if data points coincide, and for a general SKM, the optimal α and η are not unique if data points or kernels coincide.
  • When searching for the minimum 0 -norm η which satisfies the constraints (OP T 3 ), the authors can thus consider a simple heuristic approach where they loop through all the nonzero η j and check whether each such component can be removed.

4. Regularized support kernel machine

  • The function J(α) is convex but not differentiable.
  • It is well known that in this situation, steepest descent and coordinate descent methods do not necessarily converge to the global optimum (Bertsekas, 1995) .
  • SMO unfortunately falls into this class of methods.
  • Therefore, in order to develop an SMO-like algorithm for the SKM, the authors make use of Moreau-Yosida regularization.

4.2. Solving the MY-regularized SKM using SMO

  • Since the objective function G(α) is differentiable, the authors can now safely envisage an SMO-like approach, which consists in a sequence of local optimizations over only two components of α.
  • In addition, caching and shrinking techniques (Joachims, 1998) that prevent redundant computations of kernel matrix values can also be employed.
  • A difference between their setting and the SVM setting is the line search, which cannot be performed in closed form for the MY-regularized SKM.
  • Since each line search is the minimization of a convex function, one can use efficient one-dimensional root finding, such as Brent's method (Brent, 1973) .the authors.

4.4. A minimization algorithm

  • In their simulations, the kernel matrices are all normalized, i.e., have unit diagonal, so the authors can choose all d j equal.
  • Once they are satisfied, the algorithm stops.
  • Since each SMO optimization is performed on a differentiable function with Lipschitz gradient and SMO is equivalent to steepest descent for the 1norm (Joachims, 1998) , classical optimization results show that each of those SMO optimizations is finitely convergent (Bertsekas, 1995) .
  • Additional speed-ups can be easily achieved here.
  • If for successive values of κ, some kernels have a zero weight, the authors might as well remove them from the algorithm and check after convergence if they can be safely kept out.

5. Simulations

  • The authors compare the algorithm presented in Section 4.4 with solving the QCQP (L) using Mosek for two datasets, ionosphere and breast cancer, from the UCI repository, and nested subsets of the adult dataset from Platt (1998) .
  • The basis kernels are Gaussian kernels on random subsets of features, with varying widths.
  • The authors vary the number of kernels m for fixed number of data points n, and vice versa.
  • Thus the algorithm presented in this paper appears to provide a significant improvement over Mosek in computational complexity, both in terms of the number of kernels and the number of data points.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Multiple Kernel Learning, Conic Duality, and the SMO Algorithm
Francis R. Bach & Gert R. G. Lanckriet {fbach,gert}@cs.berkeley.edu
Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA 94720, USA
Michael I. Jordan jordan@cs.berkeley.edu
Computer Science Division and Department of Statistics, University of California, Berkeley, CA 94720, USA
Abstract
While classical kernel-based classifiers are
based on a single kernel, in practice it is often
desirable to base classifiers on combinations
of multiple kernels. Lanckriet et al. (2004)
considered conic combinations of kernel ma-
trices for the support vector machine (SVM),
and showed that the optimization of the co-
efficients of such a combination reduces to
a convex optimization problem known as a
quadratically-constrained quadratic program
(QCQP). Unfortunately, current convex op-
timization toolboxes can solve this problem
only for a small number of kernels and a
small number of data points; moreover, the
sequential minimal optimization (SMO) tech-
niques that are essential in large-scale imple-
mentations of the SVM cannot be applied be-
cause the cost function is non-differentiable.
We propose a novel dual formulation of
the QCQP as a second-order cone program-
ming problem, and show how to exploit the
technique of Moreau-Yosida regularization to
yield a formulation to which SMO techniques
can be applied. We present experimental re-
sults that show that our SMO-based algo-
rithm is significantly more efficient than the
general-purpose interior point methods avail-
able in current optimization toolboxes.
1. Introduction
One of the major reasons for the rise to prominence
of the support vector machine (SVM) is its ability to
cast nonlinear classification as a convex optimization
problem, in particular a quadratic program (QP). Con-
App earing in Proceedings of the 21
st
International Confer-
ence on Machine Learning, Banff, Canada, 2004. Copyright
2004 by the first author.
vexity implies that the solution is unique and br ings a
suite of standard numerical software to bear in finding
the solution. Convexity alone, however, does not imply
that the available algorithms scale well to problems of
interest. Indeed, off-the-shelf algorithms do not suffice
in large-scale applications of the SVM, and a second
major reason for the rise to prominence of the SVM
is the development of special-purpose algorithms for
solving the QP (Platt, 1998; Joachims, 1998; Keerthi
et al., 2001).
Recent developments in the literature on the SVM
and other kernel methods have emphasized the need
to consider multiple kernels, or parameterizations of
kernels, and not a single fixed kernel. This provides
needed flexibility and also reflects the fact that prac-
tical learning problems often involve multiple, hetero-
geneous data sources. While this so-called “multiple
kernel learning” problem can in principle be solved via
cross-validation, several recent papers have focused on
more efficient methods for kernel learning (Chapelle
et al., 2002; Grandvalet & Canu, 2003; Lanckriet et al.,
2004; Ong et al., 2003). In this paper we focus on the
framework proposed by Lanckriet et al. (2004), which
involves joint optimization of the co efficients in a conic
combination of kernel matrices and the coefficients of
a discriminative classifier. In the SVM setting, this
problem turns out to again be a convex optimization
problem—a quadratically-constrained quadratic pro-
gram (QCQP). This problem is more challenging than
a QP, but it can also be solved in principle by general-
purpose optimization toolboxes such as Mosek (Ander-
sen & Andersen, 2000). Again, however, this existing
algorithmic solution suffices only for small problems
(small numbers of kernels and data points), and im-
proved algorithmic solutions akin to sequential mini-
mization optimization (SMO) are needed.
While the multiple kernel learning problem is convex,
it is also non-smooth—it can be cast as the minimiza-
tion of a non-differentiable function subject to linear

constraints (see Section 3.1). Unfortunately, as is well
known in the non-smooth optimization literature, this
means that simple local descent algorithms such as
SMO may fail to converge or may converge to incor-
rect values (Bertsekas, 1995). Indeed, in preliminary
attempts to solve the QCQP using SMO we ran into
exactly these convergence problems.
One class of solutions to non-smooth optimization
problems involves constructing a smooth approximate
problem out of a non-smooth problem. In particu-
lar, Moreau-Yosida (MY) regularization is an effec-
tive general solution methodology that is based on
inf-convolution (Lemarechal & Sagastizabal, 1997). It
can be viewed in terms of the dual problem as simply
adding a quadratic regularization term to the dual ob-
jective function. Unfortunately, in our setting, this
creates a new difficulty—we lose the sparsity that
makes the SVM amenable to SMO optimization. In
particular, the QCQP formulation of Lanckriet et al.
(2004) does not lead to an MY-regularized problem
that can be solved efficiently by SMO techniques.
In this paper we show how these problems can be re-
solved by considering a novel dual formulation of the
QCQP as a second-order cone programming (SOCP)
problem. This new formulation is of interest on its
own merit, because of various connections to existing
algorithms. In particular, it is closely related to the
classical maximum margin formulation of the SVM,
differing only by the choice of the norm of the in-
verse margin. Moreover, the KKT conditions arising
in the new formulation not only lead to support vec-
tors as in the classical SVM, but also to a dual notion
of “support kernels”—those kernels that are active in
the conic combination. We thus refer to the new for-
mulation as the support kernel machine (SKM).
As we will show, the conic dual problem defining the
SKM is exactly the multiple kernel learning problem
of Lanckriet et al. (2004).
1
Moreover, given this new
formulation, we can design a Moreau-Yosida regular-
ization which preserves the sparse SVM structure, and
therefore we can apply SMO techniques.
Making this circle of ideas precise requires a number
of tools from convex analysis. In particular, Section 3
defines appropriate approximate optimality conditions
for the SKM in terms of subdifferentials and approxi-
mate subdifferentials. These conditions are then used
in Section 4 in the design of an MY regularization for
the SKM and an SMO-based algorithm. We present
1
It is worth noting that this dual problem cannot be
obtained directly as the Lagrangian dual of the QCQP
problem—Lagrangian duals of QCQPs are semidefinite
programming problems.
the results of numerical experiments with the new
method in Section 5.
2. Learning the kernel matrix
In this section, we (1) begin with a brief review
of the multiple kernel learning problem of Lanckriet
et al. (2004), (2) introduce the support kernel ma-
chine (SKM), and (3) show that the dual of the SKM
is equivalent to the multiple kernel learning primal.
2.1. Multiple kernel learning problem
In the multiple kernel learning problem, we assume
that we are given n data points (x
i
, y
i
), where x
i
X
for some input space X, and where y
i
{−1, 1}. We
also assume that we are given m matrices K
j
R
n×n
,
which are assumed to be symmetric positive semidef-
inite (and might or might not be obtained from eval-
uating a kernel function on the data {x
i
}). We con-
sider the problem of learning the best linear combi-
nation
P
m
j=1
η
j
K
j
of the kernels K
j
with nonnega-
tive coefficients η
j
> 0 and with a trace constraint
tr
P
m
j=1
η
j
K
j
=
P
m
j=1
η
j
tr K
j
= c, where c > 0 is
fixed. Lanckriet et al. (2004) show that this setup
yields the following optimization problem:
min ζ 2e
>
α
(L) w.r.t. ζ R, α R
n
s.t. 0 6 α 6 C, α
>
y = 0
α
>
D(y)K
j
D(y)α 6
tr K
j
c
ζ, j {1, . . . , m},
where D(y) is the diagonal matrix with diagonal y, e
R
n
the vector of all ones, and C a positive constant.
The coefficients η
j
are recovered as Lagrange multipli-
ers for the constraints α
>
D(y)K
j
D(y)α 6
tr K
j
c
ζ.
2.2. Support kernel machine
We now introduce a novel classification algorithm that
we refer to as the “support kernel machine” (SKM).
It will be motivated as a block-based variant of the
SVM and related margin-based classification algo-
rithms. But our underlying motivation is the fact that
the dual of the SKM is exactly the problem (L). We
establish this equivalence in the following section.
2.2.1. Linear classification
In this section we let X = R
k
. We also assume we are
given a decomposition of R
k
as a product of m blocks:
R
k
= R
k
1
× ··· × R
k
m
, so that each data point x
i
can
be decomposed into m block components, i.e. x
i
=
(x
1i
, . . . , x
mi
), where each x
ji
is in general a vector.
The goal is to find a linear classifier of the form

y = sign(w
>
x + b) where w has the same block de-
composition w = (w
1
, . . . , w
m
) R
k
1
+···+k
m
. In the
spirit of the soft margin SVM, we achieve this by min-
imizing a linear combination of the inverse of the mar-
gin and the training error. Various norms can be used
to combine the two terms, and indeed many different
algorithms have been explored for various combina-
tions of `
1
-norms and `
2
-norms. In this paper, our
goal is to encourage the sparsity of the vector w at
the level of blocks; in particular, we want most of its
(multivariate) components w
i
to be zero. A natural
way to achieve this is to penalize the `
1
-norm of w.
Since w is defined by blocks, we minimize the square
of a weighted block `
1
-norm, (
P
m
j=1
d
j
||w
j
||
2
)
2
, where
within every block, an `
2
-norm is used. Note that a
standard `
2
-based SVM is obtained if we minimize the
square of a block `
2
-norm,
P
m
j=1
||w
j
||
2
2
, which corre-
sponds to ||w||
2
2
, i.e., ignoring the block structure. On
the other hand, if m = k and d
j
= 1, we minimize the
square of the `
1
-norm of w, which is very similar to
the LP-SVM proposed by Bradley and Mangasarian
(1998). The primal pr oblem for the SKM is thus:
min
1
2
(
P
m
j=1
d
j
||w
j
||
2
)
2
+ C
P
n
i=1
ξ
i
(P ) w.r.t. w R
k
1
× ··· × R
k
m
, ξ R
n
+
, b R
s.t. y
i
(
P
j
w
>
j
x
ji
+ b) > 1 ξ
i
, i{1, . . . , n}.
2.2.2. Conic duality and optimality
conditions
For a given optimization problem there are many ways
of deriving a dual problem. In our particular case,
we treat problem (P ) as a second-order cone program
(SOCP) (Lobo et al., 1998), which yields the following
dual (see Appendix A for the derivation):
min
1
2
γ
2
α
>
e
(D) w.r.t. γ R, α R
n
s.t. 0 6 α 6 C, α
>
y = 0
||
P
i
α
i
y
i
x
ji
||
2
6 d
j
γ, j {1, . . . , m}.
In addition, the Karush-Kuhn-Tucker (KKT) optimal-
ity conditions give the following complementary slack-
ness equations:
(a) α
i
(y
i
(
P
j
w
>
j
x
ji
+ b) 1 + ξ
i
) = 0, i
(b) (C α
i
)ξ
i
= 0, i
(c)
w
j
||w
j
||
2
>
P
i
α
i
y
i
x
ji
d
j
γ
= 0, j
(d) γ(
P
d
j
t
j
γ) = 0.
Equations (a) and (b) are the same as in the classi-
cal SVM, where they define the notion of a “support
vector.” That is, at the optimum, we can divide the
v
2
1
u
0
w’
w
u
Figure 1. Orthogonality of elements of the second-order
cone K
2
= {w = (u, v), u R
2
, v R, ||u||
2
6 v}: two ele-
ments w, w
0
of K
2
are orthogonal and nonzero if and only
if they belong to the boundary and are anti-proportional.
data points into three disjoint sets: I
0
= {i, α
i
= 0},
I
M
= {i, α
i
(0, C)}, and I
C
= {i, α
i
= C}, such that
points belonging to I
0
are correctly classified points
not on the margin and such that ξ
i
= 0; points in
I
M
are correctly classified points on the margin such
that ξ
i
= 0 and y
i
(
P
j
w
>
j
x
ji
+ b) = 1, and points
in I
C
are points on the “wrong” side of the margin
for which ξ
i
> 0 (incorrectly classified if ξ
i
> 1) and
y
i
(
P
j
w
>
j
x
ji
+ b) = 1 ξ
i
. The points whose indices i
are in I
M
or I
C
are the support vectors.
While the KKT conditions (a) and (b) refer to the in-
dex i over data points, the KKT conditions (c) and (d)
refer to the index j over components of the input vec-
tor. These conditions thus imply a form of sparsity not
over data points but over “input dimensions.” Indeed,
two non-zero elements (u, v) and (u
0
, v
0
) of a second-
order cone K
d
= {(u, v) R
d
× R, ||u||
2
6 v} are or-
thogonal if and only if they both belong to the bound-
ary, and they are “anti-proportional” (Lobo et al.,
1998); that is, η > 0 such that ||u||
2
= v, ||u
0
||
2
=
v
0
, (u, v) = η(u
0
, v
0
) (see Figure 1).
Thus, if γ > 0, we have:
if ||
P
i
α
i
y
i
x
ji
||
2
< d
j
γ, then w
j
= 0,
if ||
P
i
α
i
y
i
x
ji
||
2
= d
j
γ, then η
j
> 0, such that
w
j
=η
j
P
i
α
i
y
i
x
ji
, ||w
j
||
2
=η
j
d
j
γ.
Sparsity thus emerges from the optimization prob-
lem. Let J denote the set of active dimensions, i.e.,
J(α, γ) = {j : ||
P
i
α
i
y
i
x
ji
||
2
= d
j
γ}. We can rewrite
the optimality conditions as
j, w
j
= η
j
P
i
α
i
y
i
x
ji
, with η
j
= 0 if j / J.
Equation (d) implies that γ =
P
j
d
j
||w
j
||
2
=
P
j
d
j
(η
j
d
j
γ), which in turn implies
P
j∈J
d
2
j
η
j
= 1.
2.2.3. Kernelization
We now remove the assumption that X is a Euclidean
space, and consider embeddings of the data points x
i
in a Euclidean space via a mapping φ : X R
f
. In
correspondence with our block-based formulation of

the classification problem, we assume that φ(x) has m
distinct block components φ(x) = (φ
1
(x), . . . , φ
m
(x)).
Following the usual recipe for kernel methods, we as-
sume that this embedding is performed implicitly, by
specifying the inner product in R
f
using a kernel func-
tion, which in this case is the sum of individual kernel
functions on each block component:
k(x
i
, x
j
) = φ(x
i
)
>
φ(x
j
) =
P
m
s=1
φ
s
(x
i
)
>
φ
s
(x
j
)
=
P
m
s=1
k
s
(x
i
, x
j
).
We now “kernelize” the problem (P ) using this ker-
nel function. I n particular, we consider the dual of
(P ) and substitute the kernel function for the inner
products in (D):
min
1
2
γ
2
e
>
α
(D
K
) w.r.t. γ R, α R
n
s.t. 0 6 α 6 C, α
>
y = 0
(α
>
D(y)K
j
D(y)α)
1/2
6 γd
j
, j,
where K
j
is the j-th Gram matrix of the points {x
i
}
corresponding to the j-th kernel.
The sparsity that emerges via the KKT conditions (c)
and (d) now refers to the kernels K
j
, and we refer
to the kernels with nonzero η
j
as “support kernels.”
The resulting classifier has the same form as the SVM
classifier, but is based on the kernel matrix combina-
tion K =
P
j
η
j
K
j
, which is a sparse combination of
“support kernels.”
2.3. Equivalence of the two formulations
By simply taking d
j
=
q
tr K
j
c
, we see that problem
(D
K
) and (L) are indeed equivalent—thus the dual of
the SKM is the multiple kernel learning primal. Care
must be taken here though—the weights η
j
are defined
for (L) as Lagrange multipliers and for (D
K
) through
the anti-proportionality of orthogonal elements of a
second-order cone, and a priori they might not coin-
cide: although (D
K
) and (L) are equivalent, their dual
problems have different formulations. It is straightfor-
ward, however, to write the KKT optimality condi-
tions for (α, η) for both problems and verify that they
are indeed equivalent. One direct consequence is that
for an optimal pair (α, η), α is an optimal solution of
the SVM with kernel matrix
P
j
η
j
K
j
.
3. Optimality conditions
In this section, we formulate our problem (in either
of its two equivalent forms) as the minimization of
a non-differentiable convex function subject to linear
constraints. Exact and approximate optimality condi-
tions are then readily derived using subdifferentials. In
later sections we will show how these conditions lead
to an MY-regularized algorithmic formulation that will
be amenable to SMO techniques.
3.1. Max-function formulation
A rearrangement of the problem (D
K
) yields an equiv-
alent formulation in which the quadratic constraints
are moved into the objective function:
min max
j
n
1
2d
2
j
α
>
D(y)K
j
D(y)α α
>
e
o
(S) w.r.t. α R
n
s.t. 0 6 α 6 C, α
>
y = 0.
We let J
j
(α) denote
1
2d
2
j
α
>
D(y)K
j
D(y)α α
>
e and
J(α) = max
j
J
j
(α). Problem (S) is the minimization
of the non-differentiable convex function J(α) subject
to linear constraints. Let J(α) be the set of active
kernels, i.e., the set of indices j such that J
j
(α) =
J(α). We let F
j
(α) R
n
denote the gradient of J
j
,
that is, F
j
=
J
j
α
=
1
d
2
j
D(y)K
j
D(y)α e.
3.2. Optimality conditions and subdifferential
Given any function J(α), the subdifferential of J at α
J(α) is defined as (Bertsekas, 1995):
J(α) = {g R
n
, α
0
, J(α
0
) > J(α) + g
>
(α
0
α)}.
Elements of the subdifferential J(α) are called sub-
gradients. When J is convex and differentiable at α,
then the subdifferential is a singleton and reduces to
the gradient. The notion of subdifferential is especially
useful for characterizing optimality conditions of non-
smooth problems (Bertsekas, 1995).
The function J(α) defined in the earlier s ection is a
pointwise maximum of convex differentiable functions,
and using subgradient calculus we can easily see that
the subdifferential J(α) of J at α is equal to the
convex hull of the gradients F
j
of J
j
for the active
kernels. That is:
J(α) = convex hull{F
j
(α), j J(α)}.
The Lagrangian for (S) is equal to L(α) = J(α)
δ
>
α+ ξ
>
(αCe) +
>
y, where b R, ξ, δ R
n
+
, and
the global minimum of L(α, δ, ξ, b) with respect to α
is characterized by the equation
0 L(α) δ ξ by J(α).
The optimality conditions are thus the following:
α, (b, δ, ξ) is a pair of optimal primal/dual variables

if and only if:
δ ξ by J(α)
(OP T
0
) i, δ
i
α
i
= 0, ξ
i
(C α
i
) = 0
α
>
y = 0, 0 6 α 6 C.
As before, we define I
M
(α) = {i, 0 < α
i
< C},
I
0
(α) = {i, α
i
= 0}, I
C
(α) = {i, α
i
= C}. We also de-
fine, following (Keerthi et al., 2001), I
0+
= I
0
∩{i, y
i
=
1} and I
0
= I
0
{i, y
i
= 1}, I
C+
= I
C
{i, y
i
= 1},
I
C
= I
C
{i, y
i
= 1}. We can then rewrite the
optimality conditions as
ν be = D(y)
P
j∈J (α)
d
2
j
η
j
F
j
(α)
η > 0,
P
j
d
2
j
η
j
= 1
(OP T
1
) i I
M
I
0+
I
C
, ν
i
> 0
i I
M
I
0+
I
C
, ν
i
6 0.
3.3. Approximate optimality conditions
Exact optimality conditions such as (OP T
0
) or
(OP T
1
) are generally not suitable for numerical op-
timization. In non-smooth optimization theory, one
instead formulates optimality criteria in terms of the
ε-subdifferential, which is defined as
ε
J(α) = {g R
n
, α
0
, J(α
0
) > J(α)ε+g
>
(α
0
α)}.
When J(α) = max
j
J
j
(α), then the ε-subdifferential
contains (potentially strictly) the convex hull of the
gradients F
j
(α), for all ε-active functions, i.e., for all
j such that max
i
J
i
(α) ε 6 J
j
(α). We let J
ε
(α)
denote the set of all such kernels. So, we have C
ε
(α) =
convex hull{F
j
(α), j J
ε
(α)}
ε
J(α).
Our stopping criterion, referred to as (ε
1
, ε
2
)-
optimality, requires that the ε
1
-subdifferential is
within ε
2
of zero, and that the usual KKT conditions
are met. That is, we stop whenever there exist ν, b, g
such that
g
ε
1
J(α)
(OP T
2
) i I
M
I
0+
I
C
, ν
i
> 0
i I
M
I
0+
I
C
, ν
i
6 0
||ν be D(y)g||
6 ε
2
.
Note that for one kernel, i.e., when the SKM re-
duces to the SVM, this corresponds to the approxi-
mate KKT conditions usually employed for the stan-
dard SVM (Platt, 1998; Keerthi et al., 2001; Joachims,
1998). For a given α, checking optimality is hard, since
even computing
ε
1
J(α) is hard in closed form. How-
ever, a sufficient condition for optimality can b e ob-
tained by using the inner approximation C
ε
1
(α) of this
ε
1
-subdifferential, i.e., the convex hull of gradients of
ε
1
-active kernels. Checking this sufficient condition
is a linear programming (LP) existence problem, i.e.,
find η such that:
η > 0, η
j
= 0 if j / J
ε
1
(α),
P
j
d
2
j
η
j
= 1
(OP T
3
) max
iI
M
I
0
I
C+
{(K(η)D(y)α)
i
y
i
}
6 min
iI
M
I
0+
I
C
{(K(η)D(y)α)
i
y
i
} + 2ε
2
,
where K(η) =
P
j∈J
ε
1
(α)
η
j
K
j
. Given α, we can de-
termine whether it is (ε
1
, ε
2
)-optimal by solving the
potentially large LP (OP T
3
). If in addition to having
α, we know a potential candidate for η, then a suf-
ficient condition for optimality is that this η verifies
(OP T
3
), which doesn’t require solving the LP. Indeed,
the iterative algorithm that we present in Section 4
outputs a pair (α, η) and only these sufficient optimal-
ity conditions need to be checked.
3.4. Improving sparsity
Once we have an approximate solution, i.e., values α
and η that satisfy (OP T
3
), we can ask whether η can
be made sparser. Indeed, if some of the kernels are
close to identical, then some of the η’s can potentially
be removed—for a general SVM, the optimal α is not
unique if data points coincide, and for a general SKM,
the optimal α and η are not unique if data points or
kernels coincide. When searching for the minimum
`
0
-norm η which satisfies the constraints (OP T
3
), we
can thus consider a simple heuristic approach where
we loop through all the nonzero η
j
and check whether
each such component can be removed. That is, for all
j J
ε
1
(α), we force η
j
to zero and solve the LP. If it
is feasible, then the j-th kernel can be removed.
4. Regularized support kernel machine
The function J(α) is convex but not differentiable.
It is well known that in this situation, steepest de-
scent and coordinate descent methods do not necessar-
ily converge to the global optimum (Bertsekas, 1995).
SMO unfortunately falls into this class of methods.
Therefore, in order to develop an SMO-like algorithm
for the SKM, we make use of Moreau-Yosida regu-
larization. In our s pecific case, this simply involves
adding a second regularization term to the objective
function of the SKM, as follows:
min
1
2
(
P
j
d
j
||w
j
||
2
)
2
+
1
2
P
j
a
2
j
||w
j
||
2
2
+ C
P
i
ξ
i
(R) w.r.t. w R
k
1
× ··· × R
k
m
, ξ R
n
+
, b R
s.t. y
i
(
P
j
w
>
j
x
ji
+ b) > 1 ξ
i
, i {1, . . . , n},
where (a
j
) are the MY-regularization parameters.

Citations
More filters
BookDOI
31 Mar 2010
TL;DR: Semi-supervised learning (SSL) as discussed by the authors is the middle ground between supervised learning (in which all training examples are labeled) and unsupervised training (where no label data are given).
Abstract: In the field of machine learning, semi-supervised learning (SSL) occupies the middle ground, between supervised learning (in which all training examples are labeled) and unsupervised learning (in which no label data are given). Interest in SSL has increased in recent years, particularly because of application domains in which unlabeled data are plentiful, such as images, text, and bioinformatics. This first comprehensive overview of SSL presents state-of-the-art algorithms, a taxonomy of the field, selected applications, benchmark experiments, and perspectives on ongoing and future research. Semi-Supervised Learning first presents the key assumptions and ideas underlying the field: smoothness, cluster or low-density separation, manifold structure, and transduction. The core of the book is the presentation of SSL methods, organized according to algorithmic strategies. After an examination of generative models, the book describes algorithms that implement the low-density separation assumption, graph-based methods, and algorithms that perform two-step learning. The book then discusses SSL applications and offers guidelines for SSL practitioners by analyzing the results of extensive benchmark experiments. Finally, the book looks at interesting directions for SSL research. The book closes with a discussion of the relationship between semi-supervised learning and transduction. Adaptive Computation and Machine Learning series

3,773 citations

Journal ArticleDOI
10 Jul 2015-PLOS ONE
TL;DR: This work proposes a general solution to the problem of understanding classification decisions by pixel-wise decomposition of nonlinear classifiers by introducing a methodology that allows to visualize the contributions of single pixels to predictions for kernel-based classifiers over Bag of Words features and for multilayered neural networks.
Abstract: Understanding and interpreting classification decisions of automated image classification systems is of high value in many applications, as it allows to verify the reasoning of the system and provides additional information to the human expert. Although machine learning methods are solving very successfully a plethora of tasks, they have in most cases the disadvantage of acting as a black box, not providing any information about what made them arrive at a particular decision. This work proposes a general solution to the problem of understanding classification decisions by pixel-wise decomposition of nonlinear classifiers. We introduce a methodology that allows to visualize the contributions of single pixels to predictions for kernel-based classifiers over Bag of Words features and for multilayered neural networks. These pixel contributions can be visualized as heatmaps and are provided to a human expert who can intuitively not only verify the validity of the classification decision, but also focus further analysis on regions of potential interest. We evaluate our method for classifiers trained on PASCAL VOC 2009 images, synthetic image data containing geometric shapes, the MNIST handwritten digits data set and for the pre-trained ImageNet model available as part of the Caffe open source package.

3,330 citations

Proceedings ArticleDOI
16 Dec 2008
TL;DR: Results show that learning the optimum kernel combination of multiple features vastly improves the performance, from 55.1% for the best single feature to 72.8% forThe combination of all features.
Abstract: We investigate to what extent combinations of features can improve classification performance on a large dataset of similar classes. To this end we introduce a 103 class flower dataset. We compute four different features for the flowers, each describing different aspects, namely the local shape/texture, the shape of the boundary, the overall spatial distribution of petals, and the colour. We combine the features using a multiple kernel framework with a SVM classifier. The weights for each class are learnt using the method of Varma and Ray, which has achieved state of the art performance on other large dataset, such as Caltech 101/256. Our dataset has a similar challenge in the number of classes, but with the added difficulty of large between class similarity and small within class similarity. Results show that learning the optimum kernel combination of multiple features vastly improves the performance, from 55.1% for the best single feature to 72.8% for the combination of all features.

2,619 citations


Cites methods from "Multiple kernel learning, conic dua..."

  • ...The classifier is a SVM [15] using multiple kernels [1]....

    [...]

Journal Article
TL;DR: Overall, using multiple kernels instead of a single one is useful and it is believed that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels.
Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels.

1,762 citations


Cites background or methods or result from "Multiple kernel learning, conic dua..."

  • ...They show that their formulation is the multiclass generalization of the previously developed binary classification methods of Bach et al. (2004) and Sonnenburg et al. (2006b)....

    [...]

  • ...Özen et al. (2009) use the formulation of Bach et al. (2004) in order to combine different feature subsets for protein stability prediction problem and extract information about the importance of these subsets by looking at the learned kernel weights....

    [...]

  • ...This method give similar performance results when compared to the SMO-like algorithm of Bach et al. (2004) for a protein-protein interaction prediction problem using much less time and memory....

    [...]

  • ...Sonnenburg et al. (2006a,b) rewrite the QCQP formulation of Bach et al. (2004): minimize γ with respect to γ ∈ R,α ∈ RN+ subject to N∑ i=1 αiyi = 0 C ≥ αi ≥ 0 ∀i γ ≥ 1 2 N∑ i=1 N∑ j=1 αiαjyiyjkm(xi,xj)− N∑ i=1 αi︸ ︷︷ ︸ Sm(α) ∀m and convert this problem into the following SILP problem: maximize…...

    [...]

  • ...Yan et al. (2009) compare l1-norm and l2-norm for image and video classification tasks, and conclude that l2-norm should be used when the combined kernels carry complementary information....

    [...]

Proceedings ArticleDOI
09 Jul 2007
TL;DR: This work introduces a descriptor that represents local image shape and its spatial layout, together with a spatial pyramid kernel that is designed so that the shape correspondence between two images can be measured by the distance between their descriptors using the kernel.
Abstract: The objective of this paper is classifying images by the object categories they contain, for example motorbikes or dolphins. There are three areas of novelty. First, we introduce a descriptor that represents local image shape and its spatial layout, together with a spatial pyramid kernel. These are designed so that the shape correspondence between two images can be measured by the distance between their descriptors using the kernel. Second, we generalize the spatial pyramid kernel, and learn its level weighting parameters (on a validation set). This significantly improves classification performance. Third, we show that shape and appearance kernels may be combined (again by learning parameters on a validation set).Results are reported for classification on Caltech-101 and retrieval on the TRECVID 2006 data sets. For Caltech-101 it is shown that the class specific optimization that we introduce exceeds the state of the art performance by more than 10%.

1,496 citations

References
More filters
Book
01 Jan 1995

12,671 citations

01 Jan 1999
TL;DR: SMO breaks this large quadratic programming problem into a series of smallest possible QP problems, which avoids using a time-consuming numerical QP optimization as an inner loop and hence SMO is fastest for linear SVMs and sparse data sets.

5,350 citations


"Multiple kernel learning, conic dua..." refers methods in this paper

  • ...4 with solving the QCQP (L) using Mosek for two datasets, ionosphere and breast cancer, from the UCI repository, and nested subsets of the adult dataset from Platt (1998). The basis kernels are Gaussian kernels on random subsets of features, with varying widths....

    [...]

  • ...Since the ε-optimality conditions for the MY-regularized SKM are exactly the same as for the SVM, but with a different objective function (Platt, 1998; Keerthi et al., 2001):...

    [...]

  • ..., when the SKM reduces to the SVM, this corresponds to the approximate KKT conditions usually employed for the standard SVM (Platt, 1998; Keerthi et al., 2001; Joachims, 1998)....

    [...]

  • ...Indeed, off-the-shelf algorithms do not suffice in large-scale applications of the SVM, and a second major reason for the rise to prominence of the SVM is the development of special-purpose algorithms for solving the QP (Platt, 1998; Joachims, 1998; Keerthi et al., 2001)....

    [...]

Book
John Platt1
08 Feb 1999
TL;DR: In this article, the authors proposed a new algorithm for training Support Vector Machines (SVM) called SMO (Sequential Minimal Optimization), which breaks this large QP problem into a series of smallest possible QP problems.
Abstract: This chapter describes a new algorithm for training Support Vector Machines: Sequential Minimal Optimization, or SMO Training a Support Vector Machine (SVM) requires the solution of a very large quadratic programming (QP) optimization problem SMO breaks this large QP problem into a series of smallest possible QP problems These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets Because large matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while a standard projected conjugate gradient (PCG) chunking algorithm scales somewhere between linear and cubic in the training set size SMO's computation time is dominated by SVM evaluation, hence SMO is fastest for linear SVMs and sparse data sets For the MNIST database, SMO is as fast as PCG chunking; while for the UCI Adult database and linear SVMs, SMO can be more than 1000 times faster than the PCG chunking algorithm

5,019 citations

Book
01 Jan 1972
TL;DR: In this paper, a monograph describes and analyzes some practical methods for finding approximate zeros and minima of functions, and some of these methods can be used to find approximate minima as well.
Abstract: This monograph describes and analyzes some practical methods for finding approximate zeros and minima of functions.

2,477 citations

Journal ArticleDOI
TL;DR: This paper shows how the kernel matrix can be learned from data via semidefinite programming (SDP) techniques and leads directly to a convex method for learning the 2-norm soft margin parameter in support vector machines, solving an important open problem.
Abstract: Kernel-based learning algorithms work by embedding the data into a Euclidean space, and then searching for linear relations among the embedded data points. The embedding is performed implicitly, by specifying the inner products between each pair of points in the embedding space. This information is contained in the so-called kernel matrix, a symmetric and positive semidefinite matrix that encodes the relative positions of all points. Specifying this matrix amounts to specifying the geometry of the embedding space and inducing a notion of similarity in the input space---classical model selection problems in machine learning. In this paper we show how the kernel matrix can be learned from data via semidefinite programming (SDP) techniques. When applied to a kernel matrix associated with both training and test data this gives a powerful transductive algorithm---using the labeled part of the data one can learn an embedding also for the unlabeled part. The similarity between test points is inferred from training points and their labels. Importantly, these learning problems are convex, so we obtain a method for learning both the model class and the function without local minima. Furthermore, this approach leads directly to a convex method for learning the 2-norm soft margin parameter in support vector machines, solving an important open problem.

2,419 citations


"Multiple kernel learning, conic dua..." refers background or methods in this paper

  • ...As we will show, the conic dual problem defining the SKM is exactly the multiple kernel learning problem of Lanckriet et al. (2004).1 Moreover, given this new formulation, we can design a Moreau-Yosida regularization which preserves the sparse SVM structure, and therefore we can apply SMO…...

    [...]

  • ...Lanckriet et al. (2004) show that this setup yields the following optimization problem: min ζ − 2e>α (L) w.r.t. ζ ∈ R, α ∈ Rn s.t. 0 6 α 6 C, α>y = 0 α>D(y)KjD(y)α 6 trKj c ζ, j ∈ {1, . . . ,m}, where D(y) is the diagonal matrix with diagonal y, e ∈ R n the vector of all ones, and C a positive…...

    [...]

  • ...In particular, the QCQP formulation of Lanckriet et al. (2004) does not lead to an MY-regularized problem that can be solved efficiently by SMO techniques....

    [...]

  • ...Unfortunately, in our setting, this creates a new difficulty—we lose the sparsity that makes the SVM amenable to SMO optimization....

    [...]

  • ...In this paper we focus on the framework proposed by Lanckriet et al. (2004), which involves joint optimization of the coefficients in a conic combination of kernel matrices and the coefficients of a discriminative classifier....

    [...]

Frequently Asked Questions (11)
Q1. What are the contributions in "Multiple kernel learning, conic duality, and the smo algorithm" ?

The authors propose a novel dual formulation of the QCQP as a second-order cone programming problem, and show how to exploit the technique of Moreau-Yosida regularization to yield a formulation to which SMO techniques can be applied. The authors present experimental results that show that their SMO-based algorithm is significantly more efficient than the general-purpose interior point methods available in current optimization toolboxes. 

The good scaling with respect to the number of data points makes it possible to learn kernels for large scale problems, while the good scaling with respect to the number of basis kernels opens up the possibility of application to largescale feature selection, in which the algorithm selects kernels that define non-linear mappings on subsets of input features. 

Their algorithm is based on applying sequential minimization techniques to a smoothed version of a convex nonsmooth optimization problem. 

One of the major reasons for the rise to prominence of the support vector machine (SVM) is its ability to cast nonlinear classification as a convex optimization problem, in particular a quadratic program (QP). 

Their stopping criterion, referred to as (ε1, ε2)optimality, requires that the ε1-subdifferential is within ε2 of zero, and that the usual KKT conditions are met. 

Checking this sufficient condition is a linear programming (LP) existence problem, i.e., find η such that:η > 0, ηj = 0 if j /∈ Jε1(α), ∑ j d 2 jηj = 1(OPT3) max i∈IM∪I0−∪IC+{(K(η)D(y)α)i − yi}6 min i∈IM∪I0+∪IC−{(K(η)D(y)α)i − yi} + 2ε2,where K(η) = ∑j∈Jε1 (α) ηjKj . 

In this section, the authors show that if (aj) are small enough, then an ε2/2optimal solution of the MY-regularized SKM α, together with η̃(α), is an (ε1, ε2)-optimal solution of the SKM, and an a priori bound on (aj) is obtained that does not depend on the solution α.Theorem 1 Let 0 < ε < 1. Let y ∈ {−1, 1}n and Kj, j = 1, . . . ,m be m positive semidefinite kernel matrices. 

Copyright 2004 by the first author.vexity implies that the solution is unique and brings a suite of standard numerical software to bear in finding the solution. 

If in addition to having α, the authors know a potential candidate for η, then a sufficient condition for optimality is that this η verifies (OPT3), which doesn’t require solving the LP. 

While the multiple kernel learning problem is convex, it is also non-smooth—it can be cast as the minimization of a non-differentiable function subject to linearconstraints (see Section 3.1). 

If the authors define the function G(α) asG(α) = minγ∈R+,µ∈Rm{ 12γ2 + 12 ∑ j (µj−γdj)2a2 j − ∑i αi, ||∑i αiyixji||2 6 µj ,∀j},then the dual problem is equivalent to minimizing G(α) subject to 0 6 α 6 C and α>y =