# Multiple kernel learning, conic duality, and the SMO algorithm

## Summary (3 min read)

### 1. Introduction

- One of the major reasons for the rise to prominence of the support vector machine (SVM) is its ability to cast nonlinear classification as a convex optimization problem, in particular a quadratic program (QP).
- Con-vexity implies that the solution is unique and brings a suite of standard numerical software to bear in finding the solution.
- Recent developments in the literature on the SVM and other kernel methods have emphasized the need to consider multiple kernels, or parameterizations of kernels, and not a single fixed kernel.
- One class of solutions to non-smooth optimization problems involves constructing a smooth approximate problem out of a non-smooth problem.
- In this paper the authors show how these problems can be resolved by considering a novel dual formulation of the QCQP as a second-order cone programming (SOCP) problem.

### 2.2. Support kernel machine

- The authors now introduce a novel classification algorithm that they refer to as the "support kernel machine" (SKM).
- But their underlying motivation is the fact that the dual of the SKM is exactly the problem (L).
- The authors establish this equivalence in the following section.

### 2.2.1. Linear classification

- In the spirit of the soft margin SVM, the authors achieve this by minimizing a linear combination of the inverse of the margin and the training error.
- Various norms can be used to combine the two terms, and indeed many different algorithms have been explored for various combinations of 1 -norms and 2 -norms.

### 2.2.2. Conic duality and optimality conditions

- For a given optimization problem there are many ways of deriving a dual problem.
- Equations (a) and (b) are the same as in the classical SVM, where they define the notion of a "support vector.".
- While the KKT conditions (a) and (b) refer to the index i over data points, the KKT conditions (c) and (d) refer to the index j over components of the input vector.
- These conditions thus imply a form of sparsity not over data points but over "input dimensions.".
- Sparsity thus emerges from the optimization problem.

### 2.2.3. Kernelization

- The authors now "kernelize" the problem (P ) using this kernel function.
- The sparsity that emerges via the KKT conditions (c) and (d) now refers to the kernels K j , and the authors refer to the kernels with nonzero η j as "support kernels.".

### 2.3. Equivalence of the two formulations

- Care must be taken here though-the weights η j are defined for (L) as Lagrange multipliers and for (D K ) through the anti-proportionality of orthogonal elements of a second-order cone, and a priori they might not coincide: although (D K ) and (L) are equivalent, their dual problems have different formulations.
- It is straightforward, however, to write the KKT optimality conditions for (α, η) for both problems and verify that they are indeed equivalent.

### 3. Optimality conditions

- The authors formulate their problem (in either of its two equivalent forms) as the minimization of a non-differentiable convex function subject to linear constraints.
- Exact and approximate optimality conditions are then readily derived using subdifferentials.
- In later sections the authors will show how these conditions lead to an MY-regularized algorithmic formulation that will be amenable to SMO techniques.

### 3.2. Optimality conditions and subdifferential

- Elements of the subdifferential ∂J(α) are called subgradients.
- The notion of subdifferential is especially useful for characterizing optimality conditions of nonsmooth problems (Bertsekas, 1995) .

### 3.3. Approximate optimality conditions

- Note that for one kernel, i.e., when the SKM reduces to the SVM, this corresponds to the approximate KKT conditions usually employed for the standard SVM (Platt, 1998; Keerthi et al., 2001; Joachims, 1998) .
- Indeed, the iterative algorithm that the authors present in Section 4 outputs a pair (α, η) and only these sufficient optimality conditions need to be checked.

### 3.4. Improving sparsity

- Indeed, if some of the kernels are close to identical, then some of the η's can potentially be removed-for a general SVM, the optimal α is not unique if data points coincide, and for a general SKM, the optimal α and η are not unique if data points or kernels coincide.
- When searching for the minimum 0 -norm η which satisfies the constraints (OP T 3 ), the authors can thus consider a simple heuristic approach where they loop through all the nonzero η j and check whether each such component can be removed.

### 4. Regularized support kernel machine

- The function J(α) is convex but not differentiable.
- It is well known that in this situation, steepest descent and coordinate descent methods do not necessarily converge to the global optimum (Bertsekas, 1995) .
- SMO unfortunately falls into this class of methods.
- Therefore, in order to develop an SMO-like algorithm for the SKM, the authors make use of Moreau-Yosida regularization.

### 4.2. Solving the MY-regularized SKM using SMO

- Since the objective function G(α) is differentiable, the authors can now safely envisage an SMO-like approach, which consists in a sequence of local optimizations over only two components of α.
- In addition, caching and shrinking techniques (Joachims, 1998) that prevent redundant computations of kernel matrix values can also be employed.
- A difference between their setting and the SVM setting is the line search, which cannot be performed in closed form for the MY-regularized SKM.
- Since each line search is the minimization of a convex function, one can use efficient one-dimensional root finding, such as Brent's method (Brent, 1973) .the authors.

### 4.4. A minimization algorithm

- In their simulations, the kernel matrices are all normalized, i.e., have unit diagonal, so the authors can choose all d j equal.
- Once they are satisfied, the algorithm stops.
- Since each SMO optimization is performed on a differentiable function with Lipschitz gradient and SMO is equivalent to steepest descent for the 1norm (Joachims, 1998) , classical optimization results show that each of those SMO optimizations is finitely convergent (Bertsekas, 1995) .
- Additional speed-ups can be easily achieved here.
- If for successive values of κ, some kernels have a zero weight, the authors might as well remove them from the algorithm and check after convergence if they can be safely kept out.

### 5. Simulations

- The authors compare the algorithm presented in Section 4.4 with solving the QCQP (L) using Mosek for two datasets, ionosphere and breast cancer, from the UCI repository, and nested subsets of the adult dataset from Platt (1998) .
- The basis kernels are Gaussian kernels on random subsets of features, with varying widths.
- The authors vary the number of kernels m for fixed number of data points n, and vice versa.
- Thus the algorithm presented in this paper appears to provide a significant improvement over Mosek in computational complexity, both in terms of the number of kernels and the number of data points.

Did you find this useful? Give us your feedback

##### Citations

3,773 citations

3,330 citations

2,619 citations

### Cites methods from "Multiple kernel learning, conic dua..."

...The classifier is a SVM [15] using multiple kernels [1]....

[...]

1,762 citations

### Cites background or methods or result from "Multiple kernel learning, conic dua..."

...They show that their formulation is the multiclass generalization of the previously developed binary classification methods of Bach et al. (2004) and Sonnenburg et al. (2006b)....

[...]

...Özen et al. (2009) use the formulation of Bach et al. (2004) in order to combine different feature subsets for protein stability prediction problem and extract information about the importance of these subsets by looking at the learned kernel weights....

[...]

...This method give similar performance results when compared to the SMO-like algorithm of Bach et al. (2004) for a protein-protein interaction prediction problem using much less time and memory....

[...]

...Sonnenburg et al. (2006a,b) rewrite the QCQP formulation of Bach et al. (2004): minimize γ with respect to γ ∈ R,α ∈ RN+ subject to N∑ i=1 αiyi = 0 C ≥ αi ≥ 0 ∀i γ ≥ 1 2 N∑ i=1 N∑ j=1 αiαjyiyjkm(xi,xj)− N∑ i=1 αi︸ ︷︷ ︸ Sm(α) ∀m and convert this problem into the following SILP problem: maximize…...

[...]

...Yan et al. (2009) compare l1-norm and l2-norm for image and video classification tasks, and conclude that l2-norm should be used when the combined kernels carry complementary information....

[...]

1,496 citations

##### References

5,350 citations

### "Multiple kernel learning, conic dua..." refers methods in this paper

...4 with solving the QCQP (L) using Mosek for two datasets, ionosphere and breast cancer, from the UCI repository, and nested subsets of the adult dataset from Platt (1998). The basis kernels are Gaussian kernels on random subsets of features, with varying widths....

[...]

...Since the ε-optimality conditions for the MY-regularized SKM are exactly the same as for the SVM, but with a different objective function (Platt, 1998; Keerthi et al., 2001):...

[...]

..., when the SKM reduces to the SVM, this corresponds to the approximate KKT conditions usually employed for the standard SVM (Platt, 1998; Keerthi et al., 2001; Joachims, 1998)....

[...]

...Indeed, off-the-shelf algorithms do not suffice in large-scale applications of the SVM, and a second major reason for the rise to prominence of the SVM is the development of special-purpose algorithms for solving the QP (Platt, 1998; Joachims, 1998; Keerthi et al., 2001)....

[...]

^{1}

5,019 citations

2,477 citations

2,419 citations

### "Multiple kernel learning, conic dua..." refers background or methods in this paper

...As we will show, the conic dual problem defining the SKM is exactly the multiple kernel learning problem of Lanckriet et al. (2004).1 Moreover, given this new formulation, we can design a Moreau-Yosida regularization which preserves the sparse SVM structure, and therefore we can apply SMO…...

[...]

...Lanckriet et al. (2004) show that this setup yields the following optimization problem: min ζ − 2e>α (L) w.r.t. ζ ∈ R, α ∈ Rn s.t. 0 6 α 6 C, α>y = 0 α>D(y)KjD(y)α 6 trKj c ζ, j ∈ {1, . . . ,m}, where D(y) is the diagonal matrix with diagonal y, e ∈ R n the vector of all ones, and C a positive…...

[...]

...In particular, the QCQP formulation of Lanckriet et al. (2004) does not lead to an MY-regularized problem that can be solved efficiently by SMO techniques....

[...]

...Unfortunately, in our setting, this creates a new difficulty—we lose the sparsity that makes the SVM amenable to SMO optimization....

[...]

...In this paper we focus on the framework proposed by Lanckriet et al. (2004), which involves joint optimization of the coefficients in a conic combination of kernel matrices and the coefficients of a discriminative classifier....

[...]

##### Related Papers (5)

##### Frequently Asked Questions (11)

###### Q2. What have the authors stated for future works in "Multiple kernel learning, conic duality, and the smo algorithm" ?

The good scaling with respect to the number of data points makes it possible to learn kernels for large scale problems, while the good scaling with respect to the number of basis kernels opens up the possibility of application to largescale feature selection, in which the algorithm selects kernels that define non-linear mappings on subsets of input features.

###### Q3. What is the algorithm for learning kernels?

Their algorithm is based on applying sequential minimization techniques to a smoothed version of a convex nonsmooth optimization problem.

###### Q4. What is the main reason for the rise to prominence of the support vector machine?

One of the major reasons for the rise to prominence of the support vector machine (SVM) is its ability to cast nonlinear classification as a convex optimization problem, in particular a quadratic program (QP).

###### Q5. What is the optimality of the function J()?

Their stopping criterion, referred to as (ε1, ε2)optimality, requires that the ε1-subdifferential is within ε2 of zero, and that the usual KKT conditions are met.

###### Q6. What is the simplest way to check the optimality of a given?

Checking this sufficient condition is a linear programming (LP) existence problem, i.e., find η such that:η > 0, ηj = 0 if j /∈ Jε1(α), ∑ j d 2 jηj = 1(OPT3) max i∈IM∪I0−∪IC+{(K(η)D(y)α)i − yi}6 min i∈IM∪I0+∪IC−{(K(η)D(y)α)i − yi} + 2ε2,where K(η) = ∑j∈Jε1 (α) ηjKj .

###### Q7. what is the a priori bound on aj?

In this section, the authors show that if (aj) are small enough, then an ε2/2optimal solution of the MY-regularized SKM α, together with η̃(α), is an (ε1, ε2)-optimal solution of the SKM, and an a priori bound on (aj) is obtained that does not depend on the solution α.Theorem 1 Let 0 < ε < 1. Let y ∈ {−1, 1}n and Kj, j = 1, . . . ,m be m positive semidefinite kernel matrices.

###### Q8. What does the author mean by the title of the paper?

Copyright 2004 by the first author.vexity implies that the solution is unique and brings a suite of standard numerical software to bear in finding the solution.

###### Q9. What is the way to check the optimality of a given LP?

If in addition to having α, the authors know a potential candidate for η, then a sufficient condition for optimality is that this η verifies (OPT3), which doesn’t require solving the LP.

###### Q10. What is the difference between a multiple kernel learning problem and a quadratic program?

While the multiple kernel learning problem is convex, it is also non-smooth—it can be cast as the minimization of a non-differentiable function subject to linearconstraints (see Section 3.1).

###### Q11. What is the inverse of the conic dual problem?

If the authors define the function G(α) asG(α) = minγ∈R+,µ∈Rm{ 12γ2 + 12 ∑ j (µj−γdj)2a2 j − ∑i αi, ||∑i αiyixji||2 6 µj ,∀j},then the dual problem is equivalent to minimizing G(α) subject to 0 6 α 6 C and α>y =