scispace - formally typeset
Open AccessProceedings ArticleDOI

Multiple kernel learning, conic duality, and the SMO algorithm

Reads0
Chats0
TLDR
Experimental results are presented that show that the proposed novel dual formulation of the QCQP as a second-order cone programming problem is significantly more efficient than the general-purpose interior point methods available in current optimization toolboxes.
Abstract
While classical kernel-based classifiers are based on a single kernel, in practice it is often desirable to base classifiers on combinations of multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for the support vector machine (SVM), and showed that the optimization of the coefficients of such a combination reduces to a convex optimization problem known as a quadratically-constrained quadratic program (QCQP). Unfortunately, current convex optimization toolboxes can solve this problem only for a small number of kernels and a small number of data points; moreover, the sequential minimal optimization (SMO) techniques that are essential in large-scale implementations of the SVM cannot be applied because the cost function is non-differentiable. We propose a novel dual formulation of the QCQP as a second-order cone programming problem, and show how to exploit the technique of Moreau-Yosida regularization to yield a formulation to which SMO techniques can be applied. We present experimental results that show that our SMO-based algorithm is significantly more efficient than the general-purpose interior point methods available in current optimization toolboxes.

read more

Content maybe subject to copyright    Report

Multiple Kernel Learning, Conic Duality, and the SMO Algorithm
Francis R. Bach & Gert R. G. Lanckriet {fbach,gert}@cs.berkeley.edu
Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA 94720, USA
Michael I. Jordan jordan@cs.berkeley.edu
Computer Science Division and Department of Statistics, University of California, Berkeley, CA 94720, USA
Abstract
While classical kernel-based classifiers are
based on a single kernel, in practice it is often
desirable to base classifiers on combinations
of multiple kernels. Lanckriet et al. (2004)
considered conic combinations of kernel ma-
trices for the support vector machine (SVM),
and showed that the optimization of the co-
efficients of such a combination reduces to
a convex optimization problem known as a
quadratically-constrained quadratic program
(QCQP). Unfortunately, current convex op-
timization toolboxes can solve this problem
only for a small number of kernels and a
small number of data points; moreover, the
sequential minimal optimization (SMO) tech-
niques that are essential in large-scale imple-
mentations of the SVM cannot be applied be-
cause the cost function is non-differentiable.
We propose a novel dual formulation of
the QCQP as a second-order cone program-
ming problem, and show how to exploit the
technique of Moreau-Yosida regularization to
yield a formulation to which SMO techniques
can be applied. We present experimental re-
sults that show that our SMO-based algo-
rithm is significantly more efficient than the
general-purpose interior point methods avail-
able in current optimization toolboxes.
1. Introduction
One of the major reasons for the rise to prominence
of the support vector machine (SVM) is its ability to
cast nonlinear classification as a convex optimization
problem, in particular a quadratic program (QP). Con-
App earing in Proceedings of the 21
st
International Confer-
ence on Machine Learning, Banff, Canada, 2004. Copyright
2004 by the first author.
vexity implies that the solution is unique and br ings a
suite of standard numerical software to bear in finding
the solution. Convexity alone, however, does not imply
that the available algorithms scale well to problems of
interest. Indeed, off-the-shelf algorithms do not suffice
in large-scale applications of the SVM, and a second
major reason for the rise to prominence of the SVM
is the development of special-purpose algorithms for
solving the QP (Platt, 1998; Joachims, 1998; Keerthi
et al., 2001).
Recent developments in the literature on the SVM
and other kernel methods have emphasized the need
to consider multiple kernels, or parameterizations of
kernels, and not a single fixed kernel. This provides
needed flexibility and also reflects the fact that prac-
tical learning problems often involve multiple, hetero-
geneous data sources. While this so-called “multiple
kernel learning” problem can in principle be solved via
cross-validation, several recent papers have focused on
more efficient methods for kernel learning (Chapelle
et al., 2002; Grandvalet & Canu, 2003; Lanckriet et al.,
2004; Ong et al., 2003). In this paper we focus on the
framework proposed by Lanckriet et al. (2004), which
involves joint optimization of the co efficients in a conic
combination of kernel matrices and the coefficients of
a discriminative classifier. In the SVM setting, this
problem turns out to again be a convex optimization
problem—a quadratically-constrained quadratic pro-
gram (QCQP). This problem is more challenging than
a QP, but it can also be solved in principle by general-
purpose optimization toolboxes such as Mosek (Ander-
sen & Andersen, 2000). Again, however, this existing
algorithmic solution suffices only for small problems
(small numbers of kernels and data points), and im-
proved algorithmic solutions akin to sequential mini-
mization optimization (SMO) are needed.
While the multiple kernel learning problem is convex,
it is also non-smooth—it can be cast as the minimiza-
tion of a non-differentiable function subject to linear

constraints (see Section 3.1). Unfortunately, as is well
known in the non-smooth optimization literature, this
means that simple local descent algorithms such as
SMO may fail to converge or may converge to incor-
rect values (Bertsekas, 1995). Indeed, in preliminary
attempts to solve the QCQP using SMO we ran into
exactly these convergence problems.
One class of solutions to non-smooth optimization
problems involves constructing a smooth approximate
problem out of a non-smooth problem. In particu-
lar, Moreau-Yosida (MY) regularization is an effec-
tive general solution methodology that is based on
inf-convolution (Lemarechal & Sagastizabal, 1997). It
can be viewed in terms of the dual problem as simply
adding a quadratic regularization term to the dual ob-
jective function. Unfortunately, in our setting, this
creates a new difficulty—we lose the sparsity that
makes the SVM amenable to SMO optimization. In
particular, the QCQP formulation of Lanckriet et al.
(2004) does not lead to an MY-regularized problem
that can be solved efficiently by SMO techniques.
In this paper we show how these problems can be re-
solved by considering a novel dual formulation of the
QCQP as a second-order cone programming (SOCP)
problem. This new formulation is of interest on its
own merit, because of various connections to existing
algorithms. In particular, it is closely related to the
classical maximum margin formulation of the SVM,
differing only by the choice of the norm of the in-
verse margin. Moreover, the KKT conditions arising
in the new formulation not only lead to support vec-
tors as in the classical SVM, but also to a dual notion
of “support kernels”—those kernels that are active in
the conic combination. We thus refer to the new for-
mulation as the support kernel machine (SKM).
As we will show, the conic dual problem defining the
SKM is exactly the multiple kernel learning problem
of Lanckriet et al. (2004).
1
Moreover, given this new
formulation, we can design a Moreau-Yosida regular-
ization which preserves the sparse SVM structure, and
therefore we can apply SMO techniques.
Making this circle of ideas precise requires a number
of tools from convex analysis. In particular, Section 3
defines appropriate approximate optimality conditions
for the SKM in terms of subdifferentials and approxi-
mate subdifferentials. These conditions are then used
in Section 4 in the design of an MY regularization for
the SKM and an SMO-based algorithm. We present
1
It is worth noting that this dual problem cannot be
obtained directly as the Lagrangian dual of the QCQP
problem—Lagrangian duals of QCQPs are semidefinite
programming problems.
the results of numerical experiments with the new
method in Section 5.
2. Learning the kernel matrix
In this section, we (1) begin with a brief review
of the multiple kernel learning problem of Lanckriet
et al. (2004), (2) introduce the support kernel ma-
chine (SKM), and (3) show that the dual of the SKM
is equivalent to the multiple kernel learning primal.
2.1. Multiple kernel learning problem
In the multiple kernel learning problem, we assume
that we are given n data points (x
i
, y
i
), where x
i
X
for some input space X, and where y
i
{−1, 1}. We
also assume that we are given m matrices K
j
R
n×n
,
which are assumed to be symmetric positive semidef-
inite (and might or might not be obtained from eval-
uating a kernel function on the data {x
i
}). We con-
sider the problem of learning the best linear combi-
nation
P
m
j=1
η
j
K
j
of the kernels K
j
with nonnega-
tive coefficients η
j
> 0 and with a trace constraint
tr
P
m
j=1
η
j
K
j
=
P
m
j=1
η
j
tr K
j
= c, where c > 0 is
fixed. Lanckriet et al. (2004) show that this setup
yields the following optimization problem:
min ζ 2e
>
α
(L) w.r.t. ζ R, α R
n
s.t. 0 6 α 6 C, α
>
y = 0
α
>
D(y)K
j
D(y)α 6
tr K
j
c
ζ, j {1, . . . , m},
where D(y) is the diagonal matrix with diagonal y, e
R
n
the vector of all ones, and C a positive constant.
The coefficients η
j
are recovered as Lagrange multipli-
ers for the constraints α
>
D(y)K
j
D(y)α 6
tr K
j
c
ζ.
2.2. Support kernel machine
We now introduce a novel classification algorithm that
we refer to as the “support kernel machine” (SKM).
It will be motivated as a block-based variant of the
SVM and related margin-based classification algo-
rithms. But our underlying motivation is the fact that
the dual of the SKM is exactly the problem (L). We
establish this equivalence in the following section.
2.2.1. Linear classification
In this section we let X = R
k
. We also assume we are
given a decomposition of R
k
as a product of m blocks:
R
k
= R
k
1
× ··· × R
k
m
, so that each data point x
i
can
be decomposed into m block components, i.e. x
i
=
(x
1i
, . . . , x
mi
), where each x
ji
is in general a vector.
The goal is to find a linear classifier of the form

y = sign(w
>
x + b) where w has the same block de-
composition w = (w
1
, . . . , w
m
) R
k
1
+···+k
m
. In the
spirit of the soft margin SVM, we achieve this by min-
imizing a linear combination of the inverse of the mar-
gin and the training error. Various norms can be used
to combine the two terms, and indeed many different
algorithms have been explored for various combina-
tions of `
1
-norms and `
2
-norms. In this paper, our
goal is to encourage the sparsity of the vector w at
the level of blocks; in particular, we want most of its
(multivariate) components w
i
to be zero. A natural
way to achieve this is to penalize the `
1
-norm of w.
Since w is defined by blocks, we minimize the square
of a weighted block `
1
-norm, (
P
m
j=1
d
j
||w
j
||
2
)
2
, where
within every block, an `
2
-norm is used. Note that a
standard `
2
-based SVM is obtained if we minimize the
square of a block `
2
-norm,
P
m
j=1
||w
j
||
2
2
, which corre-
sponds to ||w||
2
2
, i.e., ignoring the block structure. On
the other hand, if m = k and d
j
= 1, we minimize the
square of the `
1
-norm of w, which is very similar to
the LP-SVM proposed by Bradley and Mangasarian
(1998). The primal pr oblem for the SKM is thus:
min
1
2
(
P
m
j=1
d
j
||w
j
||
2
)
2
+ C
P
n
i=1
ξ
i
(P ) w.r.t. w R
k
1
× ··· × R
k
m
, ξ R
n
+
, b R
s.t. y
i
(
P
j
w
>
j
x
ji
+ b) > 1 ξ
i
, i{1, . . . , n}.
2.2.2. Conic duality and optimality
conditions
For a given optimization problem there are many ways
of deriving a dual problem. In our particular case,
we treat problem (P ) as a second-order cone program
(SOCP) (Lobo et al., 1998), which yields the following
dual (see Appendix A for the derivation):
min
1
2
γ
2
α
>
e
(D) w.r.t. γ R, α R
n
s.t. 0 6 α 6 C, α
>
y = 0
||
P
i
α
i
y
i
x
ji
||
2
6 d
j
γ, j {1, . . . , m}.
In addition, the Karush-Kuhn-Tucker (KKT) optimal-
ity conditions give the following complementary slack-
ness equations:
(a) α
i
(y
i
(
P
j
w
>
j
x
ji
+ b) 1 + ξ
i
) = 0, i
(b) (C α
i
)ξ
i
= 0, i
(c)
w
j
||w
j
||
2
>
P
i
α
i
y
i
x
ji
d
j
γ
= 0, j
(d) γ(
P
d
j
t
j
γ) = 0.
Equations (a) and (b) are the same as in the classi-
cal SVM, where they define the notion of a “support
vector.” That is, at the optimum, we can divide the
v
2
1
u
0
w’
w
u
Figure 1. Orthogonality of elements of the second-order
cone K
2
= {w = (u, v), u R
2
, v R, ||u||
2
6 v}: two ele-
ments w, w
0
of K
2
are orthogonal and nonzero if and only
if they belong to the boundary and are anti-proportional.
data points into three disjoint sets: I
0
= {i, α
i
= 0},
I
M
= {i, α
i
(0, C)}, and I
C
= {i, α
i
= C}, such that
points belonging to I
0
are correctly classified points
not on the margin and such that ξ
i
= 0; points in
I
M
are correctly classified points on the margin such
that ξ
i
= 0 and y
i
(
P
j
w
>
j
x
ji
+ b) = 1, and points
in I
C
are points on the “wrong” side of the margin
for which ξ
i
> 0 (incorrectly classified if ξ
i
> 1) and
y
i
(
P
j
w
>
j
x
ji
+ b) = 1 ξ
i
. The points whose indices i
are in I
M
or I
C
are the support vectors.
While the KKT conditions (a) and (b) refer to the in-
dex i over data points, the KKT conditions (c) and (d)
refer to the index j over components of the input vec-
tor. These conditions thus imply a form of sparsity not
over data points but over “input dimensions.” Indeed,
two non-zero elements (u, v) and (u
0
, v
0
) of a second-
order cone K
d
= {(u, v) R
d
× R, ||u||
2
6 v} are or-
thogonal if and only if they both belong to the bound-
ary, and they are “anti-proportional” (Lobo et al.,
1998); that is, η > 0 such that ||u||
2
= v, ||u
0
||
2
=
v
0
, (u, v) = η(u
0
, v
0
) (see Figure 1).
Thus, if γ > 0, we have:
if ||
P
i
α
i
y
i
x
ji
||
2
< d
j
γ, then w
j
= 0,
if ||
P
i
α
i
y
i
x
ji
||
2
= d
j
γ, then η
j
> 0, such that
w
j
=η
j
P
i
α
i
y
i
x
ji
, ||w
j
||
2
=η
j
d
j
γ.
Sparsity thus emerges from the optimization prob-
lem. Let J denote the set of active dimensions, i.e.,
J(α, γ) = {j : ||
P
i
α
i
y
i
x
ji
||
2
= d
j
γ}. We can rewrite
the optimality conditions as
j, w
j
= η
j
P
i
α
i
y
i
x
ji
, with η
j
= 0 if j / J.
Equation (d) implies that γ =
P
j
d
j
||w
j
||
2
=
P
j
d
j
(η
j
d
j
γ), which in turn implies
P
j∈J
d
2
j
η
j
= 1.
2.2.3. Kernelization
We now remove the assumption that X is a Euclidean
space, and consider embeddings of the data points x
i
in a Euclidean space via a mapping φ : X R
f
. In
correspondence with our block-based formulation of

the classification problem, we assume that φ(x) has m
distinct block components φ(x) = (φ
1
(x), . . . , φ
m
(x)).
Following the usual recipe for kernel methods, we as-
sume that this embedding is performed implicitly, by
specifying the inner product in R
f
using a kernel func-
tion, which in this case is the sum of individual kernel
functions on each block component:
k(x
i
, x
j
) = φ(x
i
)
>
φ(x
j
) =
P
m
s=1
φ
s
(x
i
)
>
φ
s
(x
j
)
=
P
m
s=1
k
s
(x
i
, x
j
).
We now “kernelize” the problem (P ) using this ker-
nel function. I n particular, we consider the dual of
(P ) and substitute the kernel function for the inner
products in (D):
min
1
2
γ
2
e
>
α
(D
K
) w.r.t. γ R, α R
n
s.t. 0 6 α 6 C, α
>
y = 0
(α
>
D(y)K
j
D(y)α)
1/2
6 γd
j
, j,
where K
j
is the j-th Gram matrix of the points {x
i
}
corresponding to the j-th kernel.
The sparsity that emerges via the KKT conditions (c)
and (d) now refers to the kernels K
j
, and we refer
to the kernels with nonzero η
j
as “support kernels.”
The resulting classifier has the same form as the SVM
classifier, but is based on the kernel matrix combina-
tion K =
P
j
η
j
K
j
, which is a sparse combination of
“support kernels.”
2.3. Equivalence of the two formulations
By simply taking d
j
=
q
tr K
j
c
, we see that problem
(D
K
) and (L) are indeed equivalent—thus the dual of
the SKM is the multiple kernel learning primal. Care
must be taken here though—the weights η
j
are defined
for (L) as Lagrange multipliers and for (D
K
) through
the anti-proportionality of orthogonal elements of a
second-order cone, and a priori they might not coin-
cide: although (D
K
) and (L) are equivalent, their dual
problems have different formulations. It is straightfor-
ward, however, to write the KKT optimality condi-
tions for (α, η) for both problems and verify that they
are indeed equivalent. One direct consequence is that
for an optimal pair (α, η), α is an optimal solution of
the SVM with kernel matrix
P
j
η
j
K
j
.
3. Optimality conditions
In this section, we formulate our problem (in either
of its two equivalent forms) as the minimization of
a non-differentiable convex function subject to linear
constraints. Exact and approximate optimality condi-
tions are then readily derived using subdifferentials. In
later sections we will show how these conditions lead
to an MY-regularized algorithmic formulation that will
be amenable to SMO techniques.
3.1. Max-function formulation
A rearrangement of the problem (D
K
) yields an equiv-
alent formulation in which the quadratic constraints
are moved into the objective function:
min max
j
n
1
2d
2
j
α
>
D(y)K
j
D(y)α α
>
e
o
(S) w.r.t. α R
n
s.t. 0 6 α 6 C, α
>
y = 0.
We let J
j
(α) denote
1
2d
2
j
α
>
D(y)K
j
D(y)α α
>
e and
J(α) = max
j
J
j
(α). Problem (S) is the minimization
of the non-differentiable convex function J(α) subject
to linear constraints. Let J(α) be the set of active
kernels, i.e., the set of indices j such that J
j
(α) =
J(α). We let F
j
(α) R
n
denote the gradient of J
j
,
that is, F
j
=
J
j
α
=
1
d
2
j
D(y)K
j
D(y)α e.
3.2. Optimality conditions and subdifferential
Given any function J(α), the subdifferential of J at α
J(α) is defined as (Bertsekas, 1995):
J(α) = {g R
n
, α
0
, J(α
0
) > J(α) + g
>
(α
0
α)}.
Elements of the subdifferential J(α) are called sub-
gradients. When J is convex and differentiable at α,
then the subdifferential is a singleton and reduces to
the gradient. The notion of subdifferential is especially
useful for characterizing optimality conditions of non-
smooth problems (Bertsekas, 1995).
The function J(α) defined in the earlier s ection is a
pointwise maximum of convex differentiable functions,
and using subgradient calculus we can easily see that
the subdifferential J(α) of J at α is equal to the
convex hull of the gradients F
j
of J
j
for the active
kernels. That is:
J(α) = convex hull{F
j
(α), j J(α)}.
The Lagrangian for (S) is equal to L(α) = J(α)
δ
>
α+ ξ
>
(αCe) +
>
y, where b R, ξ, δ R
n
+
, and
the global minimum of L(α, δ, ξ, b) with respect to α
is characterized by the equation
0 L(α) δ ξ by J(α).
The optimality conditions are thus the following:
α, (b, δ, ξ) is a pair of optimal primal/dual variables

if and only if:
δ ξ by J(α)
(OP T
0
) i, δ
i
α
i
= 0, ξ
i
(C α
i
) = 0
α
>
y = 0, 0 6 α 6 C.
As before, we define I
M
(α) = {i, 0 < α
i
< C},
I
0
(α) = {i, α
i
= 0}, I
C
(α) = {i, α
i
= C}. We also de-
fine, following (Keerthi et al., 2001), I
0+
= I
0
∩{i, y
i
=
1} and I
0
= I
0
{i, y
i
= 1}, I
C+
= I
C
{i, y
i
= 1},
I
C
= I
C
{i, y
i
= 1}. We can then rewrite the
optimality conditions as
ν be = D(y)
P
j∈J (α)
d
2
j
η
j
F
j
(α)
η > 0,
P
j
d
2
j
η
j
= 1
(OP T
1
) i I
M
I
0+
I
C
, ν
i
> 0
i I
M
I
0+
I
C
, ν
i
6 0.
3.3. Approximate optimality conditions
Exact optimality conditions such as (OP T
0
) or
(OP T
1
) are generally not suitable for numerical op-
timization. In non-smooth optimization theory, one
instead formulates optimality criteria in terms of the
ε-subdifferential, which is defined as
ε
J(α) = {g R
n
, α
0
, J(α
0
) > J(α)ε+g
>
(α
0
α)}.
When J(α) = max
j
J
j
(α), then the ε-subdifferential
contains (potentially strictly) the convex hull of the
gradients F
j
(α), for all ε-active functions, i.e., for all
j such that max
i
J
i
(α) ε 6 J
j
(α). We let J
ε
(α)
denote the set of all such kernels. So, we have C
ε
(α) =
convex hull{F
j
(α), j J
ε
(α)}
ε
J(α).
Our stopping criterion, referred to as (ε
1
, ε
2
)-
optimality, requires that the ε
1
-subdifferential is
within ε
2
of zero, and that the usual KKT conditions
are met. That is, we stop whenever there exist ν, b, g
such that
g
ε
1
J(α)
(OP T
2
) i I
M
I
0+
I
C
, ν
i
> 0
i I
M
I
0+
I
C
, ν
i
6 0
||ν be D(y)g||
6 ε
2
.
Note that for one kernel, i.e., when the SKM re-
duces to the SVM, this corresponds to the approxi-
mate KKT conditions usually employed for the stan-
dard SVM (Platt, 1998; Keerthi et al., 2001; Joachims,
1998). For a given α, checking optimality is hard, since
even computing
ε
1
J(α) is hard in closed form. How-
ever, a sufficient condition for optimality can b e ob-
tained by using the inner approximation C
ε
1
(α) of this
ε
1
-subdifferential, i.e., the convex hull of gradients of
ε
1
-active kernels. Checking this sufficient condition
is a linear programming (LP) existence problem, i.e.,
find η such that:
η > 0, η
j
= 0 if j / J
ε
1
(α),
P
j
d
2
j
η
j
= 1
(OP T
3
) max
iI
M
I
0
I
C+
{(K(η)D(y)α)
i
y
i
}
6 min
iI
M
I
0+
I
C
{(K(η)D(y)α)
i
y
i
} + 2ε
2
,
where K(η) =
P
j∈J
ε
1
(α)
η
j
K
j
. Given α, we can de-
termine whether it is (ε
1
, ε
2
)-optimal by solving the
potentially large LP (OP T
3
). If in addition to having
α, we know a potential candidate for η, then a suf-
ficient condition for optimality is that this η verifies
(OP T
3
), which doesn’t require solving the LP. Indeed,
the iterative algorithm that we present in Section 4
outputs a pair (α, η) and only these sufficient optimal-
ity conditions need to be checked.
3.4. Improving sparsity
Once we have an approximate solution, i.e., values α
and η that satisfy (OP T
3
), we can ask whether η can
be made sparser. Indeed, if some of the kernels are
close to identical, then some of the η’s can potentially
be removed—for a general SVM, the optimal α is not
unique if data points coincide, and for a general SKM,
the optimal α and η are not unique if data points or
kernels coincide. When searching for the minimum
`
0
-norm η which satisfies the constraints (OP T
3
), we
can thus consider a simple heuristic approach where
we loop through all the nonzero η
j
and check whether
each such component can be removed. That is, for all
j J
ε
1
(α), we force η
j
to zero and solve the LP. If it
is feasible, then the j-th kernel can be removed.
4. Regularized support kernel machine
The function J(α) is convex but not differentiable.
It is well known that in this situation, steepest de-
scent and coordinate descent methods do not necessar-
ily converge to the global optimum (Bertsekas, 1995).
SMO unfortunately falls into this class of methods.
Therefore, in order to develop an SMO-like algorithm
for the SKM, we make use of Moreau-Yosida regu-
larization. In our s pecific case, this simply involves
adding a second regularization term to the objective
function of the SKM, as follows:
min
1
2
(
P
j
d
j
||w
j
||
2
)
2
+
1
2
P
j
a
2
j
||w
j
||
2
2
+ C
P
i
ξ
i
(R) w.r.t. w R
k
1
× ··· × R
k
m
, ξ R
n
+
, b R
s.t. y
i
(
P
j
w
>
j
x
ji
+ b) > 1 ξ
i
, i {1, . . . , n},
where (a
j
) are the MY-regularization parameters.

Citations
More filters
BookDOI

Semi-Supervised Learning

TL;DR: Semi-supervised learning (SSL) as discussed by the authors is the middle ground between supervised learning (in which all training examples are labeled) and unsupervised training (where no label data are given).
Journal ArticleDOI

On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation.

TL;DR: This work proposes a general solution to the problem of understanding classification decisions by pixel-wise decomposition of nonlinear classifiers by introducing a methodology that allows to visualize the contributions of single pixels to predictions for kernel-based classifiers over Bag of Words features and for multilayered neural networks.
Proceedings ArticleDOI

Automated Flower Classification over a Large Number of Classes

TL;DR: Results show that learning the optimum kernel combination of multiple features vastly improves the performance, from 55.1% for the best single feature to 72.8% forThe combination of all features.
Journal Article

Multiple Kernel Learning Algorithms

TL;DR: Overall, using multiple kernels instead of a single one is useful and it is believed that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels.
Proceedings ArticleDOI

Representing shape with a spatial pyramid kernel

TL;DR: This work introduces a descriptor that represents local image shape and its spatial layout, together with a spatial pyramid kernel that is designed so that the shape correspondence between two images can be measured by the distance between their descriptors using the kernel.
References
More filters
Book

Nonlinear Programming

Fast training of support vector machines using sequential minimal optimization, advances in kernel methods

J. C. Platt
TL;DR: SMO breaks this large quadratic programming problem into a series of smallest possible QP problems, which avoids using a time-consuming numerical QP optimization as an inner loop and hence SMO is fastest for linear SVMs and sparse data sets.
Book

Fast training of support vector machines using sequential minimal optimization

TL;DR: In this article, the authors proposed a new algorithm for training Support Vector Machines (SVM) called SMO (Sequential Minimal Optimization), which breaks this large QP problem into a series of smallest possible QP problems.
Book

Algorithms for Minimization Without Derivatives

TL;DR: In this paper, a monograph describes and analyzes some practical methods for finding approximate zeros and minima of functions, and some of these methods can be used to find approximate minima as well.
Journal ArticleDOI

Learning the Kernel Matrix with Semidefinite Programming

TL;DR: This paper shows how the kernel matrix can be learned from data via semidefinite programming (SDP) techniques and leads directly to a convex method for learning the 2-norm soft margin parameter in support vector machines, solving an important open problem.
Frequently Asked Questions (11)
Q1. What are the contributions in "Multiple kernel learning, conic duality, and the smo algorithm" ?

The authors propose a novel dual formulation of the QCQP as a second-order cone programming problem, and show how to exploit the technique of Moreau-Yosida regularization to yield a formulation to which SMO techniques can be applied. The authors present experimental results that show that their SMO-based algorithm is significantly more efficient than the general-purpose interior point methods available in current optimization toolboxes. 

The good scaling with respect to the number of data points makes it possible to learn kernels for large scale problems, while the good scaling with respect to the number of basis kernels opens up the possibility of application to largescale feature selection, in which the algorithm selects kernels that define non-linear mappings on subsets of input features. 

Their algorithm is based on applying sequential minimization techniques to a smoothed version of a convex nonsmooth optimization problem. 

One of the major reasons for the rise to prominence of the support vector machine (SVM) is its ability to cast nonlinear classification as a convex optimization problem, in particular a quadratic program (QP). 

Their stopping criterion, referred to as (ε1, ε2)optimality, requires that the ε1-subdifferential is within ε2 of zero, and that the usual KKT conditions are met. 

Checking this sufficient condition is a linear programming (LP) existence problem, i.e., find η such that:η > 0, ηj = 0 if j /∈ Jε1(α), ∑ j d 2 jηj = 1(OPT3) max i∈IM∪I0−∪IC+{(K(η)D(y)α)i − yi}6 min i∈IM∪I0+∪IC−{(K(η)D(y)α)i − yi} + 2ε2,where K(η) = ∑j∈Jε1 (α) ηjKj . 

In this section, the authors show that if (aj) are small enough, then an ε2/2optimal solution of the MY-regularized SKM α, together with η̃(α), is an (ε1, ε2)-optimal solution of the SKM, and an a priori bound on (aj) is obtained that does not depend on the solution α.Theorem 1 Let 0 < ε < 1. Let y ∈ {−1, 1}n and Kj, j = 1, . . . ,m be m positive semidefinite kernel matrices. 

Copyright 2004 by the first author.vexity implies that the solution is unique and brings a suite of standard numerical software to bear in finding the solution. 

If in addition to having α, the authors know a potential candidate for η, then a sufficient condition for optimality is that this η verifies (OPT3), which doesn’t require solving the LP. 

While the multiple kernel learning problem is convex, it is also non-smooth—it can be cast as the minimization of a non-differentiable function subject to linearconstraints (see Section 3.1). 

If the authors define the function G(α) asG(α) = minγ∈R+,µ∈Rm{ 12γ2 + 12 ∑ j (µj−γdj)2a2 j − ∑i αi, ||∑i αiyixji||2 6 µj ,∀j},then the dual problem is equivalent to minimizing G(α) subject to 0 6 α 6 C and α>y =