What have the authors contributed in "Localized multiple kernel learning" ?

Q: What have the authors contributed in "Localized multiple kernel learning" ?

In this paper, the authors develop a localized multiple kernel learning ( LMKL ) algorithm using a gating model for selecting the appropriate kernel function locally. Empirical results on ten benchmark and two bioinformatics data sets validate the applicability of their approach.

(Open Access) Localized multiple kernel learning (2008) | Mehmet Gönen

Localized Multiple Kernel Learning

Mehmet G¨onen gonen@boun.edu.tr

Ethem Alpaydın alpaydin@boun.edu.tr

Department of Computer Engineering, Bo˘gazi¸ci University, TR-34342, Bebek,

Istanbul, Turkey

Abstract

Recently, instead of selecting a single kernel,

multiple kernel learning (MKL) has been pro-

posed which uses a convex combination of

kernels, where the weight of each kernel is

optimized during training. However, MKL

assigns the same weight to a kernel over the

whole input space. In this paper, we develop

a localized multiple kernel learning (LMKL)

algorithm using a gating model for select-

ing the appropriate kernel function locally.

The localizing gating model and the kernel-

based classiﬁer are coupled and their opti-

mization is done in a joint manner. Empiri-

cal results on ten benchmark and two bioin-

formatics data sets validate the applicability

of our approach. LMKL achieves statistically

similar accuracy results compared with MKL

by storing fewer support vectors. LMKL can

also combine multiple copies of the same ker-

nel function localized in diﬀerent parts. For

example, LMKL with multiple linear kernels

gives better accuracy results than using a sin-

gle linear kernel on bioinformatics data sets.

1. Introduction

Kernel-based methods such as the support vector ma-

chine (SVM) gained much popularity due to their suc-

cess. For classiﬁcation tasks, the basic idea is to map

the training instances from the input space to a feature

space (generally a higher dimensional space than the

input space) where they are linearly separable. The

SVM discriminant function obtained after training is:

f(x) = hw, Φ(x)i + b (1)

where w is the weight coeﬃcients, b is the threshold,

and Φ(x) is the mapping function to the corresponding

Appearing in Proceedings of the 25

International Confer-

ence on Machine Learning, Helsinki, Finland, 2008. Copy-

right 2008 by the author(s)/owner(s).

feature space. We do not need to deﬁne the mapping

function explicitly and if we plug w vector from dual

formulation into (1), we obtain the discriminant:

f(x) =

i=1

hΦ(x), Φ(x

| {z }

K(x, x

)

where n is the number of training instances, x

, and

K(x, x

) = hΦ(x), Φ(x

)i is the corresponding kernel.

Each Φ(x) function has its own characteristics and cor-

responds to a diﬀerent kernel function and leads to a

diﬀerent discriminant function in the original space.

Selecting the kernel function (i.e., selecting the map-

ping function) is an important step in SVM training

and is generally performed using cross-validation.

In recent studies (Lanckriet et al., 2004a; Sonnenburg

et al., 2006), it is reported that using multiple diﬀerent

kernels instead of a single kernel improves the classiﬁ-

cation performance. The simplest way is to use an un-

weighted sum of kernel functions (Pavlidis et al., 2001;

Moguerza et al., 2004). Using an unweighted sum gives

equal preference to all kernels and this may not be

ideal. A better strategy is to learn a weighted sum

(e.g., convex combination); this also allows extract-

ing information from the weights assigned to kernels.

Lanckriet et al. (2004b) formulate this as a semidef-

inite programming problem which allows ﬁnding the

combination weights and support vector coeﬃcients

together. Bach et al. (2004) reformulate the prob-

lem and propose an eﬃcient algorithm using sequen-

tial minimal optimization (SMO). Their discriminant

function can be seen as an unweighted summation of

discriminant values (but a weighted summation of ker-

nel functions) in diﬀerent feature spaces:

f(x) =

m=1

, Φ

(x)i + b (2)

where m indexes kernels, w

is the weight coeﬃcients,

(x) is the mapping function for feature space m,

and p is the number of kernels. By plugging w

de-

Localized Multiple Kernel Learning

rived from duality conditions into (2), we obtain:

f(x) =

m=1

i=1

hΦ

(x), Φ

| {z }

(x, x

)

+b (3)

where the kernel weights satisfy η

≥ 0 and

m=1

= 1. The kernels we combine can be the

same kernel with diﬀerent hyperparameters (e.g., de-

gree in polynomial kernel) or diﬀerent kernels (e.g., lin-

ear, polynomial, and Gaussian kernels). We can also

combine kernels over diﬀerent data representations or

diﬀerent feature subsets.

Using a ﬁxed combination rule (unweighted or

weighted) assigns the same weight to a kernel over the

whole input space. Assigning diﬀerent weights to a

kernel in diﬀerent regions of the input space may pro-

duce a better classiﬁer. If data has underlying locali-

ties, we should give higher weights to appropriate ker-

nel functions (i.e., kernels which match the complexity

of data distribution) for each local region. Lewis et al.

(2006) propose to use a nonstationary combination

method derived with a large-margin latent variable

generative method. They use a log-ratio of Gaussian

mixtures as the classiﬁer. Lee et al. (2007) combine

Gaussian kernels with diﬀerent width parameters to

capture the underlying local distributions, by forming

a compositional kernel matrix from Gaussian kernels

and using it to train a single classiﬁer.

In this paper, we introduce a lo calized formulation

of the multiple kernel learning (MKL) problem. In

Section 2, we modify the discriminant function of the

MKL framework proposed by Bach et al. (2004) with a

localized one and describe how to optimize the parame-

ters with a two-step optimization procedure. Section 3

explains the key prop erties of the proposed algorithm.

We then demonstrate the performance of our local-

ized multiple kernel learning (LMKL) method on toy,

benchmark, and bioinformatics data sets in Section 4.

We conclude in Section 5.

2. Localized Multiple Kernel Learning

We describe the LMKL framework for binary classi-

ﬁcation SVM but the derivations in this section can

easily be extended to other kernel-based learning algo-

rithms. We propose to rewrite the discriminant func-

tion (2) of Bach et al. (2004) as follows, in order to

allow local combinations of kernels:

f(x) =

m=1

(x)hw

, Φ

(x)i + b (4)

where η

(x) is the gating function which chooses fea-

ture space m as a function of input x. η

(x) is de-

ﬁned up to a set of parameters which are also learned

from data, as we will discuss below. By modifying the

original SVM formulation with this new discriminant

function, we get the following optimization problem:

min

m=1

+ C

i=1

w.r.t. w

, b, ξ, η

(x)

s.t. y

m=1

)hw

, Φ

)i + b

≥ 1 − ξ

∀i

≥ 0 ∀i (5)

where C is the regularization parameter and ξ is the

slack variables as usual. Note that the optimization

problem in (5) is not convex due to the nonlinearity

introduced in the separation constraints.

Instead of trying to solve (5) directly, we can use a two-

step alternate optimization algorithm inspired from

Rakotomamonjy et al. (2007), to ﬁnd the parameters

of η

(x) and the discriminant function. The ﬁrst step

is to solve (5) with respect to w

, b, and ξ while ﬁxing

(x) and the second step is to update the parame-

ters of η

(x) using a gradient-descent step calculated

from the objective function in (5). The objective value

obtained for a ﬁxed η

(x) is an upper b ound for (5)

and the parameters of η

(x) are updated according

to the current solution. The objective value obtained

at the next iteration can not be greater than the cur-

rent one due to the use of gradient-descent procedure

and as iterations progress with a proper step size se-

lection procedure (see Section 3.1), the objective value

of (5) never increases. Note that this does not guaran-

tee convergence to the global optimum and the initial

parameters of η

(x) may aﬀect the solution quality.

For a ﬁxed η

(x), we obtain the Lagrangian of the

primal problem in (5) as follows:

m=1

i=1

(C − α

− β

)ξ

i=1

−

i=1

m=1

)hw

, Φ

)i + b

and taking the derivatives of L

with respect to the

primal variables gives:

∂L

∂w

⇒ w

i=1

)Φ

) ∀m

∂L

∂b

⇒

i=1

= 0

∂L

∂ξ

⇒ C = α

+ β

∀i . (6)

Localized Multiple Kernel Learning

From (5) and (6), the dual formulation is obtained as:

max

i=1

−

i=1

j=1

, x

)

w.r.t. α

s.t.

i=1

= 0

C ≥ α

≥ 0 ∀i (7)

where the locally combined kernel matrix is deﬁned as:

, x

) =

m=1

) hΦ

), Φ

| {z }

, x

)

) .

This formulation corresponds to solving a canonical

SVM dual problem with the kernel matrix K

, x

which should be positive semideﬁnite. We know that

multiplying a kernel function with outputs of a non-

negative function for both input instances, known

as quasi-conformal transformation, gives a positive

semideﬁnite kernel matrix (Amari & Wu, 1998). So,

the locally combined kernel matrix can be viewed as

applying a quasi-conformal transformation to each ker-

nel function and summing them to construct a com-

bined kernel matrix. The only restriction is to have

nonnegative η

(x) to get a positive semideﬁnite ker-

nel matrix.

Choosing among possible kernels can be considered as

a classiﬁcation problem and we assume that the re-

gions of use of kernels are linearly separable. In this

case, the gating model can be expressed as:

(x) =

exp(hv

, xi + v

)

k=1

exp(hv

, xi + v

)

where v

, v

are the parameters of this gating model

and the softmax guarantees nonnegativity. One can

use more complex gating models for η

(x) or equiva-

lently implement the gating not in the original input

space but in a space deﬁned by a basis function, which

can be one or some combination of the Φ

(x) in which

the SVM works (thereby also allowing the use of non-

vectorial data). If we use a gating mo del which is

constant (not a function of x), our algorithm ﬁnds a

ﬁxed combination over the whole input space, similar

to the original MKL formulation.

The proposed method diﬀers from taking subsets of

the training set and training a classiﬁer in each subset

then combining them. For example, Collobert et al.

(2001) deﬁne such a procedure which learns an inde-

pendent SVM for each subset and reassigns instances

to subsets by training a gating model with a cost func-

tion. Our approach is diﬀerent in that LMKL couples

subset selection and combination of local classiﬁers in

a joint optimization problem. LMKL is similar to but

also diﬀerent from the mixture of experts framework

(Jacobs et al., 1991) in the sense that the gating model

combines kernel-based experts and is learned together

with experts; the diﬀerence is that in the mixture of

experts, experts individually are classiﬁers whereas in

our formulation, there is no discriminant per kernel.

For a given η

(x), we can say that the objective value

of (7) is equal to the objective value of (5) due to

strong duality. We can safely use the objective func-

tion of (7) as J(η) function to calculate the gradients

of the primal objective with respect to the parame-

ters of η

(x). To train the gating model, we take

derivatives of J(η) with respect to v

, v

and use

gradient-descent:

∂J(η)

∂v

= −

i=1

j=1

k=1

, x

)

− η

) + δ

− η

)

∂J(η)

∂v

= −

i=1

j=1

k=1

, x

)

− η

)

+ x

− η

)

where δ

is 1 if m = k and 0 otherwise. After updating

the parameters of η

(x), we are required to solve a

single kernel SVM with K

, x

) at each step.

The complete algorithm of LMKL with the linear gat-

ing model is summarized in Algorithm 1. Convergence

of the algorithm can be determined by observing the

change in α or the parameters of η

(x).

Algorithm 1 LMKL with the linear gating model

1: Initialize v

and v

to small random numbers

for m = 1, . . . , p

2: repeat

3: Calculate K

, x

) with gating model

4: Solve canonical SVM with K

, x

)

5: v

(t+1)

⇐ v

(t)

− µ

(t)

∂J(η)

∂v

for m = 1, . . . , p

6: v

(t+1)

⇐ v

(t)

− µ

(t)

∂J(η)

∂v

for m = 1, . . . , p

7: until convergence

After determining the ﬁnal η

(x) and SVM solution,

the resulting discriminant function is:

f(x) =

i=1

m=1

(x)K

(x, x

)η

) + b . (8)

Localized Multiple Kernel Learning

3. Discussions

We explain the key properties and possible extensions

of the proposed algorithm in this section.

3.1. Computational Complexity

In each iteration, we are required to solve a canoni-

cal SVM problem with the combined kernel obtained

with the current gating model and to calculate the gra-

dients of J(η). The gradient calculation step has ig-

norable time complexity compared to the SVM solver.

The step size of each iteration, µ

(t)

, should b e deter-

mined with a line search method which requires addi-

tional SVM optimizations for better convergence. The

computational complexity of our algorithm mainly de-

pends on the complexity of the canonical SVM solver

used in the main loop, which can be reduced by using

hot-start (i.e., giving previous α as input). The num-

ber of iterations before convergence clearly depends

on the training data and the step size selection proce-

dure. The time complexity for testing is also reduced

as a result of localizing. K

(x, x

) in (8) needs to be

evaluated only if both η

(x) and η

) are nonzero.

3.2. Extensions to Other Kernel-Based

Algorithms

LMKL can also be applied to kernel-based algorithms

other than binary classiﬁcation SVM, such as regres-

sion and one-class SVMs. We need to make two basic

changes: (a) optimization problem and (b) gradient

calculations from the objective value found. Other-

wise, the same algorithm applies.

3.3. Knowledge Extraction

The MKL framework is used to extract knowledge

about the relative contributions of kernel functions

used in combination. If kernel functions are evaluated

over diﬀerent feature subsets or data representations,

the important ones have higher combination weights.

With our LMKL framework, we can deduce similar in-

formation based on diﬀerent regions of the input space.

Our proposed method also allows combining multiple

copies of the same kernel to obtain localized discrim-

inants, thanks to the nonlinearity introduced by the

gating model. For example, we can combine linear

kernels with the gating model to obtain nearly piece-

wise linear boundaries.

4. Experiments

We implement the main body of our algorithm in C++

and solve the optimization problems with MOSEK op-

timization software (Mosek, 2008). Our experimental

methodology is as follows: Given a data set, a random

one-third is reserved as the test set and the remaining

two-thirds is resampled using 5 × 2 cross-validation to

generate ten training and validation sets, with strat-

iﬁcation. The validation sets of all folds are used to

optimize C by trying values 0.01, 0.1, 1, 10, and 100.

The best conﬁguration (the one that has the highest

average accuracy on the validation folds) is used to

train the ﬁnal SVMs on the training folds and their

performance is measured over the test set. So, for

each data set, we have ten test set results.

We perform simulations with three commonly used

kernels: linear kernel (K

), polynomial kernel (K

and Gaussian kernel (K

, x

) = hx

, x

) = (hx

, x

i + 1)

, x

) = exp

− kx

− x

We use the second degree (q = 2) polynomial ker-

nel and estimate s in the Gaussian kernel as the av-

erage nearest neighbor distance between instances of

the training set. All kernel matrices are calculated and

normalized to unit trace before training. The step size

of each iteration, µ

(t)

, is ﬁxed as 0.01 without per-

forming line search and a total of 50 iterations are

performed.

4.1. Toy Data Set

In order to illustrate our proposed algorithm, we create

a toy data set, named Gauss4, which consists of 1200

data instances generated from four Gaussian compo-

nents (two for each class) with the following prior prob-

abilities, mean vectors and covariance matrices:

= 0.25 µ

−3.0

+1.0

0.8 0.0

0.0 2.0

= 0.25 µ

+1.0

0.8 0.0

0.0 2.0

= 0.25 µ

−1.0

−2.2

0.8 0.0

0.0 4.0

= 0.25 µ

+3.0

−2.2

0.8 0.0

0.0 4.0

where data instances from the ﬁrst two components

are of class 1 (labeled as positive) and others are of

class 2 (labeled as negative)

. We perform two sets of

experiments on Gauss4 data set: (K

-K

) and (K

-K

MATLAB implementation of LMKL with an SMO-

based canonical SVM solver and Gauss4 dataset are avail-

able at http://www.cmpe.boun.edu.tr/~gonen/lmkl.

Localized Multiple Kernel Learning

First, we train both MKL and LMKL for (K

-K

)

combination. Figure 1(a) shows the classiﬁcation

boundaries calculated and the support vectors stored

by MKL which assigns combination weights 0.30 and

0.70 to K

and K

, respectively. Using the kernel

matrix obtained combining K

and K

with these

weights, we do not achieve a good approximation to

the optimal Bayes’ boundary. As we see in Figure 1(b),

LMKL divides the input space into two regions and

uses the polynomial kernel to separate one component

from two others quadratically and the linear kernel for

the other component. We see that the locally com-

bined kernel matrix obtained from K

and K

with

the linear gating model learns a classiﬁcation bound-

ary very similar to the optimal Bayes’ boundary. Note

that the softmax function in the gating model achieves

a smooth transition between kernels.

The eﬀect of combining multiple copies of the same

kernel can be seen in Figure 1(c) which shows the

classiﬁcation and gating model boundaries of LMKL

with (K

-K

) combination. Using linear kernels

in three diﬀerent regions enables us to approximate the

optimal Bayes’ boundary in a piecewise linear man-

ner. Instead of using complex kernels such as the

Gaussian kernel, local combination of simple kernels

(e.g., linear and polynomial kernels) can produce ac-

curate classiﬁers and avoid overﬁtting. For example,

the Gaussian kernel achieves 89.67 per cent average

testing accuracy by storing all training instances as

support vectors. However, LMKL with three linear

kernels achieves 92.00 per cent average testing accu-

racy by storing 23.18 per cent of training instances as

support vectors on the average.

Initially, we assign small random numbers to the gat-

ing model parameters and this gives nearly equal com-

bination weights for each kernel. This is equivalent to

taking an unweighted summation of the original kernel

matrices. The gating model starts to give crisp outputs

as iterations progress and the locally combined kernel

matrix becomes more sparse (see Figure 2). The ker-

nel function values b etween data instances from diﬀer-

ent regions become 0 due to the multiplication of the

gating model outputs. This lo calizing characteristics

is also eﬀective for the test instances. If the gating

model gives crisp outputs for a test instance, the dis-

criminant function in (8) is calculated over only the

support vectors having nonzero gating model outputs

for the selected kernels. Hence, discriminant function

value for a data instance is mainly determined by the

neighboring training instances and the active kernel

function in its region.

−5 −4 −3 −2 −1 0 1 2 3 4 5

−5

−4

−3

−2

−1

(a) MKL with (K

-K

−5 −4 −3 −2 −1 0 1 2 3 4 5

−5

−4

−3

−2

−1

P L

(b) LMKL with (K

-K

−5 −4 −3 −2 −1 0 1 2 3 4 5

−5

−4

−3

−2

−1

L L L

-K

Figure 1. Separating hyperplanes (black solid lines) and

support vectors (ﬁlled points) on Gauss4 data set. Dashed

lines show the Gaussians from which data are sampled and

the optimal Bayes’ discriminant. The gray solid lines shows

the boundaries calculated from the gating models by con-

sidering them as classiﬁers which select a kernel function.

Localized multiple kernel learning

Figures

Citations

Multiple Kernel Learning Algorithms

A Survey on Multi-view Learning

l p -Norm Multiple Kernel Learning

Multiple Kernel Fuzzy Clustering

Multiple Kernel Learning for Visual Object Recognition: A Review

References

Adaptive mixtures of local experts

Learning the Kernel Matrix with Semidefinite Programming

Multiple kernel learning, conic duality, and the SMO algorithm

Large Scale Multiple Kernel Learning

Improving support vector machine classifiers by modifying kernal functions

Related Papers (5)

Multiple kernel learning, conic duality, and the SMO algorithm

Learning the Kernel Matrix with Semidefinite Programming

Large Scale Multiple Kernel Learning

Multiple Kernel Learning Algorithms

LIBSVM: A library for support vector machines

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Localized multiple kernel learning" ?