What are the future works in "A kernel method for the optimization of the margin distribution" ?

In future work, the authors would like to study under which conditions ( e. g. conditions related to the data distribution ) their method is to prefer to other state-of-the-art methods. Htm and its optimization could be another direction of their future research. All datasets can be downloaded from: http: //ida. first.

What is the difference between the two plots?

The two plots clearly show that learning with low λ values requires more training time, whereas models for higher λ values are faster to compute.

What is the way to train a model?

It is worth noting that, in most cases, even high λ values (for which the models are much faster to train) give anyway good performances, or at least acceptable when the computational time is an issue.

What is the simplest way to control the pay-off function of the game?

the authors propose to slightly modify the pay-off function of the game in order to have a flexible way to control the optimization w.r.t. the distribution of the margin in the training set.

(Open Access) A Kernel Method for the Optimization of the Margin Distribution (2008) | Fabio Aiolli

Q: What contributions have the authors mentioned in the paper "A kernel method for the optimization of the margin distribution" ?

In this paper, the authors propose a kernel based method for the direct optimization of the margin distribution ( KM-OMD ).

Q: What is the effect of the margin distribution on the training set?

Despite the AdaBoost ability to optimize the margin distribution on the training set, it has been shown in [1] that in certain cases, it can also increase the complexity of the weak hypotheses, thus possibly leading to overfitting phenomena.

Q: What is the ch(C) of a set C?

The convex hull ch(C) of a set C = {c1, . . . , cm|ci ∈ Rd}, is the set of all affine combinations of points in C such that the weights γi of the combination are non-negative, i.e. ch(C) = {γ1c1 + · · ·+γmcm|γi ∈ Γm}.

Q: What is the way to solve the problem?

||v(γ)||2 + λ||γ||2 (2)It can be shown that the optimal vector v(γ̂) which is solution of the problem above, represents the vector joining two points, v+ into the positive norm-restricted convex hull, i.e. v+ ∈ chη(S+), and v− into the negative norm-restricted convex hull, i.e. v− ∈ chη(S−), for opportune η.

A Kernel Method for the Optimization of the Margin

Distribution

F. Aiolli, G. Da San Martino, and A. Sperduti

Dept. of Pure and Applied Mathematics, Via Trieste 63, 35131 Padova - Italy,

Abstract. Recent results in theoretical machine learning seem to suggest that

nice properties of the margin distribution over a training set turns out in a good

performance of a classiﬁer. The same principle has been already used in SVM

and other kernel based methods as the associated optimization problems try to

maximize the minimum of these margins.

In this paper, we propose a kernel based method for the direct optimization of

the margin distribution (KM-OMD). The method is motivated and analyzed from

a game theoretical perspective. A quite efﬁcient optimization algorithm is then

proposed. Experimental results over a standard benchmark of 13 datasets have

clearly shown state-of-the-art performances.

Keywords: Kernel Methods, Margin Distribution, Game Theory

1 Introduction

Much of last-decade theoretical work on learning machines has been devoted to study

the aspects of learning methods that control the generalization performance. In essence,

two main features seem to be responsible for the generalization performance of a clas-

siﬁer, namely, keeping low the complexity of the hypothesis space (e.g. by limiting the

VC dimension) and producing models which achieve large margin (i.e. conﬁdence in

the prediction) over the training set.

The good empirical effectiveness of two of the most popular algorithms, Support

Vector Machines (SVM) and AdaBoost, have been in fact explained by the high margin

classiﬁers they are able to produce. Speciﬁcally, hard margin SVMs return the hyper-

plane which keeps all the examples farest away from it, thus maximizing the minimum

of the margin over the training set (worst-case optimization of the margin distribution).

Similarly, AdaBoost, has been demonstrated to greedily minimize a loss function which

is tightly related to the distribution of the margins on the training set. Despite the Ad-

aBoost ability to optimize the margin distribution on the training set, it has been shown

in [1] that in certain cases, it can also increase the complexity of the weak hypotheses,

thus possibly leading to overﬁtting phenomena.

The effect of the margin distribution on the generalization ability of learning ma-

chines have been studied in [2] and [3], while algorithms trying to optimize explicitly

the margin distribution include [4], [5] and [6]. More recently, it has been shown [7]

that quite good effectiveness can even be obtained by the optimization of the ﬁrst mo-

ment of the margin distribution (the simple average value over the training set). In this

2 F. Aiolli, G. Da San Martino, and A. Sperduti

case, the problem can be solved very efﬁciently, since computing the model has time

complexity O(n).

In this paper, we propose a kernel machine which explicitly tries to optimize the

margin distribution. Speciﬁcally, this boils down to an optimization of a weighted com-

bination of margins, via a distribution over the examples, with appropriate constraints

related to the entropy (as a measure of complexity) of the distribution.

In Section 1.1 some notation used through the paper is introduced. In Section 2 a

game-theoretical interpretation of hard margin SVM is given in the bipartite instance

ranking framework (i.e. the problem to induce a separation between positive and neg-

ative instances in a binary task) and the problem of optimizing the margin distribution

is studied from the same perspective. This game-theoretic analysis leads us to a simple

method for optimizing the distribution of the margins. Then, in Section 3, an efﬁcient

optimization algorithm is derived. Experimental results are presented in Section 4. Fi-

nally, conclusions are drawn.

1.1 Notation and Background

In the context of binary calssiﬁcation tasks, the aim of a learning algorithm is to return a

classiﬁer which minimizes the error on a (unknown) distribution D

X ×Y

of input/output

pairs (x

, y

), x

∈ R

, y

∈ {−1, +1}. The input to the algorithm is a set of pre-

classiﬁed examples pairs S = {(x

, y

), . . . , (x

, y

)}. With S

= {x

, . . . , x

}

we denote the set of p positive instances, where x

is the i-th positive instance in

S. Similarly S

−

= {x

−

, . . . , x

−

} denotes the set of n negative instances. Clearly,

N = n + p.

In this paper, we denote by Γ

⊆ R

the set of m-dimensional probability vectors,

i.e. Γ

= {γ ∈ R

i=1

= 1, γ

≥ 0}. The convex hull ch(C) of a set C =

, . . . , c

∈ R

}, is the set of all afﬁne combinations of points in C such that the

weights γ

of the combination are non-negative, i.e. ch(C) = {γ

+· · · +γ

|γ

∈

}. We also generalize this deﬁnition, by deﬁning the η-norm-convex hull of a set

C ∈ R

as the subset of ch(C) which has weights with (squared) norm smaller than a

given value η, i.e. ch

+ · · · + γ

∈ ch(C)| ||γ||

≤ η,

≤ η ≤ 1}.

Note that, whenever η =

, a trivial set consisting of a single point (the average of

points in C), is obtained, while whenever η = 1 this set will coincide with the convex

hull.

2 Game theory, learning and margin distribution

A binary classiﬁcation problem can be viewed from two different points of view. Specif-

ically, let h ∈ H be an hypothesis space, mapping instances on real values. In a ﬁrst

scenario, let call it instance classiﬁcation, a given hypothesis is said to be consistent

with a training set if y

h(x

) > 0 (classiﬁcation constraints) for each example of the

training set. In this case, the prediction on new instances can be performed by using the

sign of the decision function.

A Kernel Method for the Optimization of the Margin Distribution 3

In a second scenario, which we may call bipartite instance ranking, a given hypoth-

esis is said consistent with a training set if h(x

) − h(x

−

) > 0 (order constraints) for

any positive instance x

and any negative instance x

−

. Note that when an hypothesis

is consistent, then it is always possible to deﬁne a threshold which correctly separates

positive from negative instances in the training set. In this paper, we mainly focus on

this second view, even if a similar treatment can be pursued for the other setting.

In the following, we give an interpretation of the hard-margin SVM as a two players

zero-sum game in the bipartite instance ranking scenario presented above. First of all,

we recall that, in the classiﬁcation context, the formulation of the learning problem is

based on the maximization of the minimum of the margin in the training set. Then, we

propose to slightly modify the pay-off function of the game in order to have a ﬂexible

way to control the optimization w.r.t. the distribution of the margin in the training set.

2.1 Hard Margin SVM as a zero-sum game

Consider the following zero-sum game deﬁned for a bipartite instance ranking scenario.

Let P

MIN

(the nature) and P

MAX

(the learner) be the two players. On each round

of the game, P

MAX

picks an hypothesis h from a given hypotheses space H, while

(simultaneously) P

MIN

picks a pair of instances of different classes z = (x

, x

−

) ∈

× S

−

. P

MAX

wants to maximize its pay-off deﬁned as the achieved margin ρ

(z)

on the pair of examples which, in this particular setting, can be deﬁned by h(x

) −

h(x

−

). Note that the value of the margin deﬁned in this way is consistent with the

bipartite instance ranking setting since it is greater than zero for a pair whenever the

order constraint is satisﬁed for the pair, and less than zero otherwise.

Considering the hypothesis space of hyperplanes deﬁned by unit-length weight vectors

H = {h( x) = w

x − θ| w ∈ R

s.t. ||w|| = 1, and θ ∈ R},

the margin is deﬁned by the difference of the scores of the instances, that is

, x

−

) = w

− θ − w

−

+ θ = w

− x

−

)

Let now be given a mixed strategy for the P

MIN

player deﬁned by γ

∈ Γ

, the proba-

bility of each positive instance to be selected, and γ

−

∈ Γ

the correspondent probabil-

ities for negative instances. We can assume that these probabilities can be marginalized

as the associated events are independent. In other words, the probability to pick a pair

, x

−

) is simply given by γ

−

. Hence, the value of the game, i.e. the expected

margin obtained in a game, will be:

V ((γ

, γ

−

), w) =

i,j

−

− x

−

)

= w

(

−

) −

−

(

))

= w

(

−

)

Then, when the player P

MIN

is left free to choose its strategy, we obtain the following

problem which determines the equilibrium of the game, that is

min

∈Γ

,γ

−

∈Γ

max

w∈H

(

−

)

4 F. Aiolli, G. Da San Martino, and A. Sperduti

Now, it easy to realize that a pure strategy is right available to the P

MAX

player. In fact,

it can maximize its pay-off by setting

w =

(

any w ∈ H if v(γ

, γ

−

) = 0

v(γ

,γ

−

)

||v(γ

,γ

−

)||

otherwise

, where v(γ

, γ

−

) =

−

Note that the condition v(γ

, γ

−

) = 0 implies that the (signed) examples y

are

not linearly independent, i.e. there exists a linear combination of these instances with

not all null coefﬁcients which is equal to the null vector. This condition is demonstrated

to be necessary and sufﬁcient for the non linear separability of a set (see [8]).

When the optimal strategy for P

MAX

has been chosen, the expected value of the margin

according to the probability distributions (γ

, γ

−

), i.e. the value of the game, will be:

E[ρ

, x

−

)] =

v(γ

, γ

−

) = ||v(γ

, γ

−

)||

Note that the vector v(γ

, γ

−

) is deﬁned by the difference of two vectors in the

convex hulls of the positive and negative instances respectively, v

∈

ch(S

) and v

−

∈ ch(S

−

). Moreover, when γ

and γ

−

are uniform

on their respective sets, the vector v(γ

, γ

−

) will be the difference between average

points of the sets (a.k.a. their centroids).

Now, we are able to show that the best strategy for P

MIN

is the solution obtained

by an SVM. For this, let us rewrite the vector v(γ

, γ

−

) using a single vector of pa-

rameters,

v(γ

, γ

−

) ≡ v(γ) =

which can be obtained by a simple change of variables

if x

is the r-th positive example

−

if x

is the r-th negative example

Using the fact that minimizing the squared norm is equivalent to minimize the norm

itself, we may formulate the optimization problem to compute the best strategy for

MIN

(which aims to minimize the value of the game):

min

∈Γ

,γ

−

∈Γ

||v(γ

, γ

−

)|| =

min

||v(γ)||

s.t.

i:y

= 1, ∀y ∈ {−1, +1}, and γ

≥ 0

(1)

As already demonstrated in [9] the problem on the right of Eq. 1 is the same as hard

margin SVM when a bias term is present. Speciﬁcally, the bias term is chosen as the

score of the point standing in the middle between the points v

and v

−

, i.e.

θ =

+ v

−

Then, the solutions of the two problems are the same. Speciﬁcally, the solution

maximizes the minimum of the margins in the training set. Clearly, when the training

set is not linearly separable, the solution of the problem in Eq.1 will be v(γ) = 0.

A Kernel Method for the Optimization of the Margin Distribution 5

2.2 Playing with Margin Distributions

The maximization of the minimum margin is not necessarily the optimal choice when

dealing with a classiﬁcation task. In fact, many recent works, including [4, 3], have

demonstrated that the generalization error depends more properly on the distribution of

(lowest) margins in the training set.

Our main idea is then to construct a problem which makes easy to play with the

margin distribution. Speciﬁcally, we aim at a formulation that allow us to specify a

given trade-off between the minimal value and the average value of the margin on the

training set.

For this, we extend the previous game considering a further cost for the player

MIN

. Speciﬁcally, we want to penalize, to a given extent, too pure strategies in such

a way to have solutions which are robust with respect to different training example dis-

tributions. In this way, we expect to reduce the variance of the best strategy estimation

when different training sets are drawn from the true distribution of examples D

X ×Y

. A

good measure of the complexity of P

MIN

behavior would certainly be the normalized

entropy of its strategy which can be deﬁned by

E(γ) =

log(p)

log

log(n)

−

log

−

which has maximum value 1 whenever γ is the uniform distribution on both sets (com-

pletely unpredictable startegy) and is 0 when the distribution is picked on a single ex-

ample per set (completely predictable pure strategy).

However, a (simpler) approximate version of the entropy as deﬁned above, can be

obtained by considering the 2-norm of the distribution. In fact, it is well known that, for

any distribution γ ∈ Γ

it always holds that ||γ||

≤ ||γ||

. Moreover, ||γ||

is minimal

whenever γ is a uniform distribution and is equal to 1 whenever γ is a pure strategy.

Speciﬁcally, we can consider the following approximation:

E(γ) ≈

m − 1

(1 − ||γ||

)

Considering the squared norm of the distribution, we can reformulate the strategy

of P

MIN

as a trade-off between two objective functions, with a trade-off parameter λ:

min

∈Γ

,γ

−

∈Γ

(1 − λ)||v(γ)||

+ λ||γ||

(2)

It can be shown that the optimal vector v(ˆγ) which is solution of the problem above,

represents the vector joining two points, v

into the positive norm-restricted convex

hull, i.e. v

∈ ch

), and v

−

into the negative norm-restricted convex hull, i.e.

−

∈ ch

−

), for opportune η.

Similarly to the hard margin SVM, the threshold is deﬁned as the score of the point

which is in the middle between these two points, i.e. θ =

+ v

−

Finally, it is straightforward to see that this method generalizes (when λ = 1), the

baseline method presented in [10] where the simple difference between the centroid of

positives and the centroid of negatives is used as the weight vector, and obviously it

generalizes the hard-margin SVM for λ = 0.

A Kernel Method for the Optimization of the Margin Distribution

Figures

Citations

Recent Advances in Open Set Recognition: A Survey

The Extreme Value Machine

EasyMKL: a scalable multiple kernel learning algorithm

Large margin distribution machine

Large Margin Distribution Machine

References

Learning with kernels

Soft Margins for AdaBoost

A fast iterative nearest point algorithm for support vector machine classifier design

How boosting the margin can also boost classifier complexity

A Mathematical Programming Approach to the Kernel Fisher Algorithm

Related Papers (5)

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Support-Vector Networks

The Nature of Statistical Learning Theory

A dual coordinate descent method for large-scale linear SVM

Boosting the margin: a new explanation for the effectiveness of voting methods

Frequently Asked Questions (9)

Q1. What contributions have the authors mentioned in the paper "A kernel method for the optimization of the margin distribution" ?

Q2. What are the future works in "A kernel method for the optimization of the margin distribution" ?

Q3. What is the effect of the margin distribution on the training set?

Q4. What is the ch(C) of a set C?

Q5. What is the way to solve the problem?

Q6. What is the difference between the two plots?

Q7. What is the purpose of this article?

Q8. What is the way to train a model?

Q9. What is the simplest way to control the pay-off function of the game?

Trending Questions (1)