scispace - formally typeset
Open AccessProceedings ArticleDOI

Localized multiple kernel learning

Reads0
Chats0
TLDR
A localized multiple kernel learning (LMKL) algorithm using a gating model for selecting the appropriate kernel function locally and the kernel-based classifier are coupled and their optimization is done in a joint manner.
Abstract
Recently, instead of selecting a single kernel, multiple kernel learning (MKL) has been proposed which uses a convex combination of kernels, where the weight of each kernel is optimized during training. However, MKL assigns the same weight to a kernel over the whole input space. In this paper, we develop a localized multiple kernel learning (LMKL) algorithm using a gating model for selecting the appropriate kernel function locally. The localizing gating model and the kernel-based classifier are coupled and their optimization is done in a joint manner. Empirical results on ten benchmark and two bioinformatics data sets validate the applicability of our approach. LMKL achieves statistically similar accuracy results compared with MKL by storing fewer support vectors. LMKL can also combine multiple copies of the same kernel function localized in different parts. For example, LMKL with multiple linear kernels gives better accuracy results than using a single linear kernel on bioinformatics data sets.

read more

Content maybe subject to copyright    Report

Localized Multiple Kernel Learning
Mehmet onen gonen@boun.edu.tr
Ethem Alpaydın alpaydin@boun.edu.tr
Department of Computer Engineering, Bo˘gazi¸ci University, TR-34342, Bebek,
˙
Istanbul, Turkey
Abstract
Recently, instead of selecting a single kernel,
multiple kernel learning (MKL) has been pro-
posed which uses a convex combination of
kernels, where the weight of each kernel is
optimized during training. However, MKL
assigns the same weight to a kernel over the
whole input space. In this paper, we develop
a localized multiple kernel learning (LMKL)
algorithm using a gating model for select-
ing the appropriate kernel function locally.
The localizing gating model and the kernel-
based classifier are coupled and their opti-
mization is done in a joint manner. Empiri-
cal results on ten benchmark and two bioin-
formatics data sets validate the applicability
of our approach. LMKL achieves statistically
similar accuracy results compared with MKL
by storing fewer support vectors. LMKL can
also combine multiple copies of the same ker-
nel function localized in different parts. For
example, LMKL with multiple linear kernels
gives better accuracy results than using a sin-
gle linear kernel on bioinformatics data sets.
1. Introduction
Kernel-based methods such as the support vector ma-
chine (SVM) gained much popularity due to their suc-
cess. For classification tasks, the basic idea is to map
the training instances from the input space to a feature
space (generally a higher dimensional space than the
input space) where they are linearly separable. The
SVM discriminant function obtained after training is:
f(x) = hw, Φ(x)i + b (1)
where w is the weight coefficients, b is the threshold,
and Φ(x) is the mapping function to the corresponding
Appearing in Proceedings of the 25
th
International Confer-
ence on Machine Learning, Helsinki, Finland, 2008. Copy-
right 2008 by the author(s)/owner(s).
feature space. We do not need to define the mapping
function explicitly and if we plug w vector from dual
formulation into (1), we obtain the discriminant:
f(x) =
n
X
i=1
α
i
y
i
hΦ(x), Φ(x
i
)i
| {z }
K(x, x
i
)
+b
where n is the number of training instances, x
i
, and
K(x, x
i
) = hΦ(x), Φ(x
i
)i is the corresponding kernel.
Each Φ(x) function has its own characteristics and cor-
responds to a different kernel function and leads to a
different discriminant function in the original space.
Selecting the kernel function (i.e., selecting the map-
ping function) is an important step in SVM training
and is generally performed using cross-validation.
In recent studies (Lanckriet et al., 2004a; Sonnenburg
et al., 2006), it is reported that using multiple different
kernels instead of a single kernel improves the classifi-
cation performance. The simplest way is to use an un-
weighted sum of kernel functions (Pavlidis et al., 2001;
Moguerza et al., 2004). Using an unweighted sum gives
equal preference to all kernels and this may not be
ideal. A better strategy is to learn a weighted sum
(e.g., convex combination); this also allows extract-
ing information from the weights assigned to kernels.
Lanckriet et al. (2004b) formulate this as a semidef-
inite programming problem which allows finding the
combination weights and support vector coefficients
together. Bach et al. (2004) reformulate the prob-
lem and propose an efficient algorithm using sequen-
tial minimal optimization (SMO). Their discriminant
function can be seen as an unweighted summation of
discriminant values (but a weighted summation of ker-
nel functions) in different feature spaces:
f(x) =
p
X
m=1
hw
m
, Φ
m
(x)i + b (2)
where m indexes kernels, w
m
is the weight coefficients,
Φ
m
(x) is the mapping function for feature space m,
and p is the number of kernels. By plugging w
m
de-

Localized Multiple Kernel Learning
rived from duality conditions into (2), we obtain:
f(x) =
p
X
m=1
η
m
n
X
i=1
α
i
y
i
hΦ
m
(x), Φ
m
(x
i
)i
| {z }
K
m
(x, x
i
)
+b (3)
where the kernel weights satisfy η
m
0 and
P
p
m=1
η
m
= 1. The kernels we combine can be the
same kernel with different hyperparameters (e.g., de-
gree in polynomial kernel) or different kernels (e.g., lin-
ear, polynomial, and Gaussian kernels). We can also
combine kernels over different data representations or
different feature subsets.
Using a fixed combination rule (unweighted or
weighted) assigns the same weight to a kernel over the
whole input space. Assigning different weights to a
kernel in different regions of the input space may pro-
duce a better classifier. If data has underlying locali-
ties, we should give higher weights to appropriate ker-
nel functions (i.e., kernels which match the complexity
of data distribution) for each local region. Lewis et al.
(2006) propose to use a nonstationary combination
method derived with a large-margin latent variable
generative method. They use a log-ratio of Gaussian
mixtures as the classifier. Lee et al. (2007) combine
Gaussian kernels with different width parameters to
capture the underlying local distributions, by forming
a compositional kernel matrix from Gaussian kernels
and using it to train a single classifier.
In this paper, we introduce a lo calized formulation
of the multiple kernel learning (MKL) problem. In
Section 2, we modify the discriminant function of the
MKL framework proposed by Bach et al. (2004) with a
localized one and describe how to optimize the parame-
ters with a two-step optimization procedure. Section 3
explains the key prop erties of the proposed algorithm.
We then demonstrate the performance of our local-
ized multiple kernel learning (LMKL) method on toy,
benchmark, and bioinformatics data sets in Section 4.
We conclude in Section 5.
2. Localized Multiple Kernel Learning
We describe the LMKL framework for binary classi-
fication SVM but the derivations in this section can
easily be extended to other kernel-based learning algo-
rithms. We propose to rewrite the discriminant func-
tion (2) of Bach et al. (2004) as follows, in order to
allow local combinations of kernels:
f(x) =
p
X
m=1
η
m
(x)hw
m
, Φ
m
(x)i + b (4)
where η
m
(x) is the gating function which chooses fea-
ture space m as a function of input x. η
m
(x) is de-
fined up to a set of parameters which are also learned
from data, as we will discuss below. By modifying the
original SVM formulation with this new discriminant
function, we get the following optimization problem:
min
1
2
p
X
m=1
kw
m
k
2
+ C
n
X
i=1
ξ
i
w.r.t. w
m
, b, ξ, η
m
(x)
s.t. y
i
Ã
p
X
m=1
η
m
(x
i
)hw
m
, Φ
m
(x
i
)i + b
!
1 ξ
i
i
ξ
i
0 i (5)
where C is the regularization parameter and ξ is the
slack variables as usual. Note that the optimization
problem in (5) is not convex due to the nonlinearity
introduced in the separation constraints.
Instead of trying to solve (5) directly, we can use a two-
step alternate optimization algorithm inspired from
Rakotomamonjy et al. (2007), to find the parameters
of η
m
(x) and the discriminant function. The first step
is to solve (5) with respect to w
m
, b, and ξ while fixing
η
m
(x) and the second step is to update the parame-
ters of η
m
(x) using a gradient-descent step calculated
from the objective function in (5). The objective value
obtained for a fixed η
m
(x) is an upper b ound for (5)
and the parameters of η
m
(x) are updated according
to the current solution. The objective value obtained
at the next iteration can not be greater than the cur-
rent one due to the use of gradient-descent procedure
and as iterations progress with a proper step size se-
lection procedure (see Section 3.1), the objective value
of (5) never increases. Note that this does not guaran-
tee convergence to the global optimum and the initial
parameters of η
m
(x) may affect the solution quality.
For a fixed η
m
(x), we obtain the Lagrangian of the
primal problem in (5) as follows:
L
D
=
1
2
p
X
m=1
kw
m
k
2
+
n
X
i=1
(C α
i
β
i
)ξ
i
+
n
X
i=1
α
i
n
X
i=1
α
i
y
i
Ã
p
X
m=1
η
m
(x
i
)hw
m
, Φ
m
(x
i
)i + b
!
and taking the derivatives of L
D
with respect to the
primal variables gives:
L
D
w
m
w
m
=
n
X
i=1
α
i
y
i
η
m
(x
i
m
(x
i
) m
L
D
b
n
X
i=1
α
i
y
i
= 0
L
D
ξ
i
C = α
i
+ β
i
i . (6)

Localized Multiple Kernel Learning
From (5) and (6), the dual formulation is obtained as:
max
n
X
i=1
α
i
1
2
n
X
i=1
n
X
j=1
α
i
α
j
y
i
y
j
K
η
(x
i
, x
j
)
w.r.t. α
s.t.
n
X
i=1
α
i
y
i
= 0
C α
i
0 i (7)
where the locally combined kernel matrix is defined as:
K
η
(x
i
, x
j
) =
p
X
m=1
η
m
(x
i
) hΦ
m
(x
i
), Φ
m
(x
j
)i
| {z }
K
m
(x
i
, x
j
)
η
m
(x
j
) .
This formulation corresponds to solving a canonical
SVM dual problem with the kernel matrix K
η
(x
i
, x
j
),
which should be positive semidefinite. We know that
multiplying a kernel function with outputs of a non-
negative function for both input instances, known
as quasi-conformal transformation, gives a positive
semidefinite kernel matrix (Amari & Wu, 1998). So,
the locally combined kernel matrix can be viewed as
applying a quasi-conformal transformation to each ker-
nel function and summing them to construct a com-
bined kernel matrix. The only restriction is to have
nonnegative η
m
(x) to get a positive semidefinite ker-
nel matrix.
Choosing among possible kernels can be considered as
a classification problem and we assume that the re-
gions of use of kernels are linearly separable. In this
case, the gating model can be expressed as:
η
m
(x) =
exp(hv
m
, xi + v
m0
)
p
P
k=1
exp(hv
k
, xi + v
k0
)
where v
m
, v
m0
are the parameters of this gating model
and the softmax guarantees nonnegativity. One can
use more complex gating models for η
m
(x) or equiva-
lently implement the gating not in the original input
space but in a space defined by a basis function, which
can be one or some combination of the Φ
m
(x) in which
the SVM works (thereby also allowing the use of non-
vectorial data). If we use a gating mo del which is
constant (not a function of x), our algorithm finds a
fixed combination over the whole input space, similar
to the original MKL formulation.
The proposed method differs from taking subsets of
the training set and training a classifier in each subset
then combining them. For example, Collobert et al.
(2001) define such a procedure which learns an inde-
pendent SVM for each subset and reassigns instances
to subsets by training a gating model with a cost func-
tion. Our approach is different in that LMKL couples
subset selection and combination of local classifiers in
a joint optimization problem. LMKL is similar to but
also different from the mixture of experts framework
(Jacobs et al., 1991) in the sense that the gating model
combines kernel-based experts and is learned together
with experts; the difference is that in the mixture of
experts, experts individually are classifiers whereas in
our formulation, there is no discriminant per kernel.
For a given η
m
(x), we can say that the objective value
of (7) is equal to the objective value of (5) due to
strong duality. We can safely use the objective func-
tion of (7) as J(η) function to calculate the gradients
of the primal objective with respect to the parame-
ters of η
m
(x). To train the gating model, we take
derivatives of J(η) with respect to v
m
, v
m0
and use
gradient-descent:
J(η)
v
m0
=
1
2
n
X
i=1
n
X
j=1
p
X
k=1
α
i
α
j
y
i
y
j
η
k
(x
i
)K
k
(x
i
, x
j
)
η
k
(x
j
)
³
δ
k
m
η
m
(x
i
) + δ
k
m
η
m
(x
j
)
´
J(η)
v
m
=
1
2
n
X
i=1
n
X
j=1
p
X
k=1
α
i
α
j
y
i
y
j
η
k
(x
i
)K
k
(x
i
, x
j
)
η
k
(x
j
)
³
x
i
£
δ
k
m
η
m
(x
i
)
¤
+ x
j
£
δ
k
m
η
m
(x
j
)
¤
´
where δ
k
m
is 1 if m = k and 0 otherwise. After updating
the parameters of η
m
(x), we are required to solve a
single kernel SVM with K
η
(x
i
, x
j
) at each step.
The complete algorithm of LMKL with the linear gat-
ing model is summarized in Algorithm 1. Convergence
of the algorithm can be determined by observing the
change in α or the parameters of η
m
(x).
Algorithm 1 LMKL with the linear gating model
1: Initialize v
m
and v
m0
to small random numbers
for m = 1, . . . , p
2: repeat
3: Calculate K
η
(x
i
, x
j
) with gating model
4: Solve canonical SVM with K
η
(x
i
, x
j
)
5: v
(t+1)
m0
v
(t)
m0
µ
(t)
J(η)
v
m0
for m = 1, . . . , p
6: v
(t+1)
m
v
(t)
m
µ
(t)
J(η)
v
m
for m = 1, . . . , p
7: until convergence
After determining the final η
m
(x) and SVM solution,
the resulting discriminant function is:
f(x) =
n
X
i=1
p
X
m=1
α
i
y
i
η
m
(x)K
m
(x, x
i
)η
m
(x
i
) + b . (8)

Localized Multiple Kernel Learning
3. Discussions
We explain the key properties and possible extensions
of the proposed algorithm in this section.
3.1. Computational Complexity
In each iteration, we are required to solve a canoni-
cal SVM problem with the combined kernel obtained
with the current gating model and to calculate the gra-
dients of J(η). The gradient calculation step has ig-
norable time complexity compared to the SVM solver.
The step size of each iteration, µ
(t)
, should b e deter-
mined with a line search method which requires addi-
tional SVM optimizations for better convergence. The
computational complexity of our algorithm mainly de-
pends on the complexity of the canonical SVM solver
used in the main loop, which can be reduced by using
hot-start (i.e., giving previous α as input). The num-
ber of iterations before convergence clearly depends
on the training data and the step size selection proce-
dure. The time complexity for testing is also reduced
as a result of localizing. K
m
(x, x
i
) in (8) needs to be
evaluated only if both η
m
(x) and η
m
(x
i
) are nonzero.
3.2. Extensions to Other Kernel-Based
Algorithms
LMKL can also be applied to kernel-based algorithms
other than binary classification SVM, such as regres-
sion and one-class SVMs. We need to make two basic
changes: (a) optimization problem and (b) gradient
calculations from the objective value found. Other-
wise, the same algorithm applies.
3.3. Knowledge Extraction
The MKL framework is used to extract knowledge
about the relative contributions of kernel functions
used in combination. If kernel functions are evaluated
over different feature subsets or data representations,
the important ones have higher combination weights.
With our LMKL framework, we can deduce similar in-
formation based on different regions of the input space.
Our proposed method also allows combining multiple
copies of the same kernel to obtain localized discrim-
inants, thanks to the nonlinearity introduced by the
gating model. For example, we can combine linear
kernels with the gating model to obtain nearly piece-
wise linear boundaries.
4. Experiments
We implement the main body of our algorithm in C++
and solve the optimization problems with MOSEK op-
timization software (Mosek, 2008). Our experimental
methodology is as follows: Given a data set, a random
one-third is reserved as the test set and the remaining
two-thirds is resampled using 5 × 2 cross-validation to
generate ten training and validation sets, with strat-
ification. The validation sets of all folds are used to
optimize C by trying values 0.01, 0.1, 1, 10, and 100.
The best configuration (the one that has the highest
average accuracy on the validation folds) is used to
train the final SVMs on the training folds and their
performance is measured over the test set. So, for
each data set, we have ten test set results.
We perform simulations with three commonly used
kernels: linear kernel (K
L
), polynomial kernel (K
P
),
and Gaussian kernel (K
G
):
K
L
(x
i
, x
j
) = hx
i
, x
j
i
K
P
(x
i
, x
j
) = (hx
i
, x
j
i + 1)
q
K
G
(x
i
, x
j
) = exp
³
kx
i
x
j
k
2
/s
2
´
.
We use the second degree (q = 2) polynomial ker-
nel and estimate s in the Gaussian kernel as the av-
erage nearest neighbor distance between instances of
the training set. All kernel matrices are calculated and
normalized to unit trace before training. The step size
of each iteration, µ
(t)
, is fixed as 0.01 without per-
forming line search and a total of 50 iterations are
performed.
4.1. Toy Data Set
In order to illustrate our proposed algorithm, we create
a toy data set, named Gauss4, which consists of 1200
data instances generated from four Gaussian compo-
nents (two for each class) with the following prior prob-
abilities, mean vectors and covariance matrices:
p
11
= 0.25 µ
11
=
µ
3.0
+1.0
Σ
11
=
µ
0.8 0.0
0.0 2.0
p
12
= 0.25 µ
12
=
µ
+1.0
+1.0
Σ
12
=
µ
0.8 0.0
0.0 2.0
p
21
= 0.25 µ
21
=
µ
1.0
2.2
Σ
21
=
µ
0.8 0.0
0.0 4.0
p
22
= 0.25 µ
22
=
µ
+3.0
2.2
Σ
22
=
µ
0.8 0.0
0.0 4.0
where data instances from the first two components
are of class 1 (labeled as positive) and others are of
class 2 (labeled as negative)
1
. We perform two sets of
experiments on Gauss4 data set: (K
L
-K
P
) and (K
L
-
K
L
-K
L
).
1
MATLAB implementation of LMKL with an SMO-
based canonical SVM solver and Gauss4 dataset are avail-
able at http://www.cmpe.boun.edu.tr/~gonen/lmkl.

Localized Multiple Kernel Learning
First, we train both MKL and LMKL for (K
L
-K
P
)
combination. Figure 1(a) shows the classification
boundaries calculated and the support vectors stored
by MKL which assigns combination weights 0.30 and
0.70 to K
L
and K
P
, respectively. Using the kernel
matrix obtained combining K
L
and K
P
with these
weights, we do not achieve a good approximation to
the optimal Bayes’ boundary. As we see in Figure 1(b),
LMKL divides the input space into two regions and
uses the polynomial kernel to separate one component
from two others quadratically and the linear kernel for
the other component. We see that the locally com-
bined kernel matrix obtained from K
L
and K
P
with
the linear gating model learns a classification bound-
ary very similar to the optimal Bayes’ boundary. Note
that the softmax function in the gating model achieves
a smooth transition between kernels.
The effect of combining multiple copies of the same
kernel can be seen in Figure 1(c) which shows the
classification and gating model boundaries of LMKL
with (K
L
-K
L
-K
L
) combination. Using linear kernels
in three different regions enables us to approximate the
optimal Bayes’ boundary in a piecewise linear man-
ner. Instead of using complex kernels such as the
Gaussian kernel, local combination of simple kernels
(e.g., linear and polynomial kernels) can produce ac-
curate classifiers and avoid overfitting. For example,
the Gaussian kernel achieves 89.67 per cent average
testing accuracy by storing all training instances as
support vectors. However, LMKL with three linear
kernels achieves 92.00 per cent average testing accu-
racy by storing 23.18 per cent of training instances as
support vectors on the average.
Initially, we assign small random numbers to the gat-
ing model parameters and this gives nearly equal com-
bination weights for each kernel. This is equivalent to
taking an unweighted summation of the original kernel
matrices. The gating model starts to give crisp outputs
as iterations progress and the locally combined kernel
matrix becomes more sparse (see Figure 2). The ker-
nel function values b etween data instances from differ-
ent regions become 0 due to the multiplication of the
gating model outputs. This lo calizing characteristics
is also effective for the test instances. If the gating
model gives crisp outputs for a test instance, the dis-
criminant function in (8) is calculated over only the
support vectors having nonzero gating model outputs
for the selected kernels. Hence, discriminant function
value for a data instance is mainly determined by the
neighboring training instances and the active kernel
function in its region.
−5 −4 −3 −2 −1 0 1 2 3 4 5
−5
−4
−3
−2
−1
0
1
2
3
4
5
(a) MKL with (K
L
-K
P
).
−5 −4 −3 −2 −1 0 1 2 3 4 5
−5
−4
−3
−2
−1
0
1
2
3
4
5
P L
(b) LMKL with (K
L
-K
P
).
−5 −4 −3 −2 −1 0 1 2 3 4 5
−5
−4
−3
−2
−1
0
1
2
3
4
5
L L L
(c) LMKL with (K
L
-K
L
-K
L
).
Figure 1. Separating hyperplanes (black solid lines) and
support vectors (filled points) on Gauss4 data set. Dashed
lines show the Gaussians from which data are sampled and
the optimal Bayes’ discriminant. The gray solid lines shows
the boundaries calculated from the gating models by con-
sidering them as classifiers which select a kernel function.

Citations
More filters
Journal Article

Multiple Kernel Learning Algorithms

TL;DR: Overall, using multiple kernels instead of a single one is useful and it is believed that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels.
Posted Content

A Survey on Multi-view Learning

TL;DR: By exploring the consistency and complementary properties of different views, multi-View learning is rendered more effective, more promising, and has better generalization ability than single-view learning.
Journal ArticleDOI

l p -Norm Multiple Kernel Learning

TL;DR: Empirical applications of lp-norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art, and two efficient interleaved optimization strategies for arbitrary norms are developed.
Journal ArticleDOI

Multiple Kernel Fuzzy Clustering

TL;DR: A multiple kernel fuzzy c-means (MKFC) algorithm that is more immune to ineffective kernels and irrelevant features and automatically adjusting the kernel weights, which makes the choice of kernels less crucial.
Journal ArticleDOI

Multiple Kernel Learning for Visual Object Recognition: A Review

TL;DR: It is argued that given a sufficient number of training examples and feature/kernel types, MKL is more effective for object recognition than simple kernel combination, and among the various approaches proposed for MKL, the sequential minimal optimization, semi-infinite programming, and level method based ones are computationally most efficient.
References
More filters
Journal ArticleDOI

Adaptive mixtures of local experts

TL;DR: A new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases, which is demonstrated to be able to be solved by a very simple expert network.
Journal ArticleDOI

Learning the Kernel Matrix with Semidefinite Programming

TL;DR: This paper shows how the kernel matrix can be learned from data via semidefinite programming (SDP) techniques and leads directly to a convex method for learning the 2-norm soft margin parameter in support vector machines, solving an important open problem.
Proceedings ArticleDOI

Multiple kernel learning, conic duality, and the SMO algorithm

TL;DR: Experimental results are presented that show that the proposed novel dual formulation of the QCQP as a second-order cone programming problem is significantly more efficient than the general-purpose interior point methods available in current optimization toolboxes.
Journal Article

Large Scale Multiple Kernel Learning

TL;DR: It is shown that the proposed multiple kernel learning algorithm can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations, and generalize the formulation and the method to a larger class of problems, including regression and one-class classification.
Journal ArticleDOI

Improving support vector machine classifiers by modifying kernal functions

TL;DR: Simulation results for both artificial and real data show remarkable improvement of generalization errors, supporting the idea of modifying a kernel function to enlarge the spatial resolution around the separating boundary surface by a conformal mapping, such that the separability between classes is increased.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What have the authors contributed in "Localized multiple kernel learning" ?

In this paper, the authors develop a localized multiple kernel learning ( LMKL ) algorithm using a gating model for selecting the appropriate kernel function locally. Empirical results on ten benchmark and two bioinformatics data sets validate the applicability of their approach.