scispace - formally typeset
Open AccessBook ChapterDOI

A Kernel Method for the Optimization of the Margin Distribution

Fabio Aiolli, +2 more
- Vol. 5163, pp 305-314
TLDR
A kernel based method for the direct optimization of the margin distribution (KM-OMD) is proposed and motivated and analyzed from a game theoretical perspective and shows state-of-the-art performances.
Abstract
Recent results in theoretical machine learning seem to suggest that nice properties of the margin distribution over a training set turns out in a good performance of a classifier. The same principle has been already used in SVM and other kernel based methods as the associated optimization problems try to maximize the minimum of these margins. In this paper, we propose a kernel based method for the direct optimization of the margin distribution (KM-OMD). The method is motivated and analyzed from a game theoretical perspective. A quite efficient optimization algorithm is then proposed. Experimental results over a standard benchmark of 13 datasets have clearly shown state-of-the-art performances.

read more

Content maybe subject to copyright    Report

A Kernel Method for the Optimization of the Margin
Distribution
F. Aiolli, G. Da San Martino, and A. Sperduti
Dept. of Pure and Applied Mathematics, Via Trieste 63, 35131 Padova - Italy,
Abstract. Recent results in theoretical machine learning seem to suggest that
nice properties of the margin distribution over a training set turns out in a good
performance of a classifier. The same principle has been already used in SVM
and other kernel based methods as the associated optimization problems try to
maximize the minimum of these margins.
In this paper, we propose a kernel based method for the direct optimization of
the margin distribution (KM-OMD). The method is motivated and analyzed from
a game theoretical perspective. A quite efficient optimization algorithm is then
proposed. Experimental results over a standard benchmark of 13 datasets have
clearly shown state-of-the-art performances.
Keywords: Kernel Methods, Margin Distribution, Game Theory
1 Introduction
Much of last-decade theoretical work on learning machines has been devoted to study
the aspects of learning methods that control the generalization performance. In essence,
two main features seem to be responsible for the generalization performance of a clas-
sifier, namely, keeping low the complexity of the hypothesis space (e.g. by limiting the
VC dimension) and producing models which achieve large margin (i.e. confidence in
the prediction) over the training set.
The good empirical effectiveness of two of the most popular algorithms, Support
Vector Machines (SVM) and AdaBoost, have been in fact explained by the high margin
classifiers they are able to produce. Specifically, hard margin SVMs return the hyper-
plane which keeps all the examples farest away from it, thus maximizing the minimum
of the margin over the training set (worst-case optimization of the margin distribution).
Similarly, AdaBoost, has been demonstrated to greedily minimize a loss function which
is tightly related to the distribution of the margins on the training set. Despite the Ad-
aBoost ability to optimize the margin distribution on the training set, it has been shown
in [1] that in certain cases, it can also increase the complexity of the weak hypotheses,
thus possibly leading to overfitting phenomena.
The effect of the margin distribution on the generalization ability of learning ma-
chines have been studied in [2] and [3], while algorithms trying to optimize explicitly
the margin distribution include [4], [5] and [6]. More recently, it has been shown [7]
that quite good effectiveness can even be obtained by the optimization of the first mo-
ment of the margin distribution (the simple average value over the training set). In this

2 F. Aiolli, G. Da San Martino, and A. Sperduti
case, the problem can be solved very efficiently, since computing the model has time
complexity O(n).
In this paper, we propose a kernel machine which explicitly tries to optimize the
margin distribution. Specifically, this boils down to an optimization of a weighted com-
bination of margins, via a distribution over the examples, with appropriate constraints
related to the entropy (as a measure of complexity) of the distribution.
In Section 1.1 some notation used through the paper is introduced. In Section 2 a
game-theoretical interpretation of hard margin SVM is given in the bipartite instance
ranking framework (i.e. the problem to induce a separation between positive and neg-
ative instances in a binary task) and the problem of optimizing the margin distribution
is studied from the same perspective. This game-theoretic analysis leads us to a simple
method for optimizing the distribution of the margins. Then, in Section 3, an efficient
optimization algorithm is derived. Experimental results are presented in Section 4. Fi-
nally, conclusions are drawn.
1.1 Notation and Background
In the context of binary calssification tasks, the aim of a learning algorithm is to return a
classifier which minimizes the error on a (unknown) distribution D
X ×Y
of input/output
pairs (x
i
, y
i
), x
i
R
d
, y
i
{−1, +1}. The input to the algorithm is a set of pre-
classified examples pairs S = {(x
1
, y
1
), . . . , (x
N
, y
N
)}. With S
+
= {x
+
1
, . . . , x
+
p
}
we denote the set of p positive instances, where x
+
i
is the i-th positive instance in
S. Similarly S
= {x
1
, . . . , x
n
} denotes the set of n negative instances. Clearly,
N = n + p.
In this paper, we denote by Γ
m
R
m
the set of m-dimensional probability vectors,
i.e. Γ
m
= {γ R
m
|
P
m
i=1
γ
i
= 1, γ
i
0}. The convex hull ch(C) of a set C =
{c
1
, . . . , c
m
|c
i
R
d
}, is the set of all affine combinations of points in C such that the
weights γ
i
of the combination are non-negative, i.e. ch(C) = {γ
1
c
1
+· · · +γ
m
c
m
|γ
i
Γ
m
}. We also generalize this definition, by defining the η-norm-convex hull of a set
C R
d
as the subset of ch(C) which has weights with (squared) norm smaller than a
given value η, i.e. ch
η
(C) = {γ
1
c
1
+ · · · + γ
m
c
m
ch(C)| ||γ||
2
η,
1
m
η 1}.
Note that, whenever η =
1
m
, a trivial set consisting of a single point (the average of
points in C), is obtained, while whenever η = 1 this set will coincide with the convex
hull.
2 Game theory, learning and margin distribution
A binary classification problem can be viewed from two different points of view. Specif-
ically, let h H be an hypothesis space, mapping instances on real values. In a first
scenario, let call it instance classification, a given hypothesis is said to be consistent
with a training set if y
i
h(x
i
) > 0 (classification constraints) for each example of the
training set. In this case, the prediction on new instances can be performed by using the
sign of the decision function.

A Kernel Method for the Optimization of the Margin Distribution 3
In a second scenario, which we may call bipartite instance ranking, a given hypoth-
esis is said consistent with a training set if h(x
+
i
) h(x
j
) > 0 (order constraints) for
any positive instance x
+
i
and any negative instance x
j
. Note that when an hypothesis
is consistent, then it is always possible to define a threshold which correctly separates
positive from negative instances in the training set. In this paper, we mainly focus on
this second view, even if a similar treatment can be pursued for the other setting.
In the following, we give an interpretation of the hard-margin SVM as a two players
zero-sum game in the bipartite instance ranking scenario presented above. First of all,
we recall that, in the classification context, the formulation of the learning problem is
based on the maximization of the minimum of the margin in the training set. Then, we
propose to slightly modify the pay-off function of the game in order to have a flexible
way to control the optimization w.r.t. the distribution of the margin in the training set.
2.1 Hard Margin SVM as a zero-sum game
Consider the following zero-sum game defined for a bipartite instance ranking scenario.
Let P
MIN
(the nature) and P
MAX
(the learner) be the two players. On each round
of the game, P
MAX
picks an hypothesis h from a given hypotheses space H, while
(simultaneously) P
MIN
picks a pair of instances of different classes z = (x
+
, x
)
S
+
× S
. P
MAX
wants to maximize its pay-off defined as the achieved margin ρ
h
(z)
on the pair of examples which, in this particular setting, can be defined by h(x
+
)
h(x
). Note that the value of the margin defined in this way is consistent with the
bipartite instance ranking setting since it is greater than zero for a pair whenever the
order constraint is satisfied for the pair, and less than zero otherwise.
Considering the hypothesis space of hyperplanes defined by unit-length weight vectors
H = {h( x) = w
>
x θ| w R
d
s.t. ||w|| = 1, and θ R},
the margin is defined by the difference of the scores of the instances, that is
ρ
w
(x
+
, x
) = w
>
x
+
θ w
>
x
+ θ = w
>
(x
+
x
)
Let now be given a mixed strategy for the P
MIN
player defined by γ
+
Γ
p
, the proba-
bility of each positive instance to be selected, and γ
Γ
n
the correspondent probabil-
ities for negative instances. We can assume that these probabilities can be marginalized
as the associated events are independent. In other words, the probability to pick a pair
(x
+
i
, x
j
) is simply given by γ
+
i
γ
j
. Hence, the value of the game, i.e. the expected
margin obtained in a game, will be:
V ((γ
+
, γ
), w) =
P
i,j
γ
+
i
γ
j
w
>
(x
+
i
x
j
)
= w
>
(
P
i
γ
+
i
x
+
i
(
P
j
γ
j
)
P
j
γ
j
x
j
(
P
i
γ
+
i
))
= w
>
(
P
i
γ
+
i
x
+
i
P
j
γ
j
x
j
)
Then, when the player P
MIN
is left free to choose its strategy, we obtain the following
problem which determines the equilibrium of the game, that is
min
γ
+
Γ
p
Γ
n
max
w∈H
w
>
(
X
i
γ
+
i
x
+
i
X
j
γ
j
x
j
)

4 F. Aiolli, G. Da San Martino, and A. Sperduti
Now, it easy to realize that a pure strategy is right available to the P
MAX
player. In fact,
it can maximize its pay-off by setting
ˆ
w =
(
any w H if v(γ
+
, γ
) = 0
v(γ
+
)
||v(γ
+
)||
otherwise
, where v(γ
+
, γ
) =
X
i
γ
+
i
x
+
i
X
i
γ
i
x
i
Note that the condition v(γ
+
, γ
) = 0 implies that the (signed) examples y
i
x
i
are
not linearly independent, i.e. there exists a linear combination of these instances with
not all null coefficients which is equal to the null vector. This condition is demonstrated
to be necessary and sufficient for the non linear separability of a set (see [8]).
When the optimal strategy for P
MAX
has been chosen, the expected value of the margin
according to the probability distributions (γ
+
, γ
), i.e. the value of the game, will be:
E[ρ
ˆ
w
(x
+
, x
)] =
ˆ
w
>
v(γ
+
, γ
) = ||v(γ
+
, γ
)||
Note that the vector v(γ
+
, γ
) is defined by the difference of two vectors in the
convex hulls of the positive and negative instances respectively, v
+
=
P
i
γ
+
i
x
+
i
ch(S
+
) and v
=
P
i
γ
i
x
i
ch(S
). Moreover, when γ
+
and γ
are uniform
on their respective sets, the vector v(γ
+
, γ
) will be the difference between average
points of the sets (a.k.a. their centroids).
Now, we are able to show that the best strategy for P
MIN
is the solution obtained
by an SVM. For this, let us rewrite the vector v(γ
+
, γ
) using a single vector of pa-
rameters,
v(γ
+
, γ
) v(γ) =
N
X
i
y
i
γ
i
x
i
which can be obtained by a simple change of variables
γ
i
=
½
γ
+
r
if x
i
is the r-th positive example
γ
r
if x
i
is the r-th negative example
Using the fact that minimizing the squared norm is equivalent to minimize the norm
itself, we may formulate the optimization problem to compute the best strategy for
P
MIN
(which aims to minimize the value of the game):
min
γ
+
Γ
p
Γ
n
||v(γ
+
, γ
)|| =
½
min
γ
||v(γ)||
2
s.t.
P
i:y
i
=y
γ
i
= 1, y {−1, +1}, and γ
i
0
(1)
As already demonstrated in [9] the problem on the right of Eq. 1 is the same as hard
margin SVM when a bias term is present. Specifically, the bias term is chosen as the
score of the point standing in the middle between the points v
+
and v
, i.e.
θ =
1
2
ˆ
w
>
(v
+
+ v
).
Then, the solutions of the two problems are the same. Specifically, the solution
maximizes the minimum of the margins in the training set. Clearly, when the training
set is not linearly separable, the solution of the problem in Eq.1 will be v(γ) = 0.

A Kernel Method for the Optimization of the Margin Distribution 5
2.2 Playing with Margin Distributions
The maximization of the minimum margin is not necessarily the optimal choice when
dealing with a classification task. In fact, many recent works, including [4, 3], have
demonstrated that the generalization error depends more properly on the distribution of
(lowest) margins in the training set.
Our main idea is then to construct a problem which makes easy to play with the
margin distribution. Specifically, we aim at a formulation that allow us to specify a
given trade-off between the minimal value and the average value of the margin on the
training set.
For this, we extend the previous game considering a further cost for the player
P
MIN
. Specifically, we want to penalize, to a given extent, too pure strategies in such
a way to have solutions which are robust with respect to different training example dis-
tributions. In this way, we expect to reduce the variance of the best strategy estimation
when different training sets are drawn from the true distribution of examples D
X ×Y
. A
good measure of the complexity of P
MIN
behavior would certainly be the normalized
entropy of its strategy which can be defined by
E(γ) =
1
2
Ã
1
log(p)
X
i
γ
+
i
log
1
γ
+
i
+
1
log(n)
X
i
γ
i
log
1
γ
i
!
which has maximum value 1 whenever γ is the uniform distribution on both sets (com-
pletely unpredictable startegy) and is 0 when the distribution is picked on a single ex-
ample per set (completely predictable pure strategy).
However, a (simpler) approximate version of the entropy as defined above, can be
obtained by considering the 2-norm of the distribution. In fact, it is well known that, for
any distribution γ Γ
m
it always holds that ||γ||
2
||γ||
1
. Moreover, ||γ||
2
is minimal
whenever γ is a uniform distribution and is equal to 1 whenever γ is a pure strategy.
Specifically, we can consider the following approximation:
E(γ)
m
m 1
(1 ||γ||
2
2
)
Considering the squared norm of the distribution, we can reformulate the strategy
of P
MIN
as a trade-off between two objective functions, with a trade-off parameter λ:
min
γ
+
Γ
p
Γ
n
(1 λ)||v(γ)||
2
+ λ||γ||
2
(2)
It can be shown that the optimal vector v(ˆγ) which is solution of the problem above,
represents the vector joining two points, v
+
into the positive norm-restricted convex
hull, i.e. v
+
ch
η
(S
+
), and v
into the negative norm-restricted convex hull, i.e.
v
ch
η
(S
), for opportune η.
Similarly to the hard margin SVM, the threshold is defined as the score of the point
which is in the middle between these two points, i.e. θ =
1
2
ˆ
w
>
(v
+
+ v
).
Finally, it is straightforward to see that this method generalizes (when λ = 1), the
baseline method presented in [10] where the simple difference between the centroid of
positives and the centroid of negatives is used as the weight vector, and obviously it
generalizes the hard-margin SVM for λ = 0.

Citations
More filters
Journal ArticleDOI

Recent Advances in Open Set Recognition: A Survey

TL;DR: This paper provides a comprehensive survey of existing open set recognition techniques covering various aspects ranging from related definitions, representations of models, datasets, evaluation criteria, and algorithm comparisons to highlight the limitations of existing approaches and point out some promising subsequent research directions.
Journal ArticleDOI

The Extreme Value Machine

TL;DR: The Extreme Value Machine (EVM) is a novel, theoretically sound classifier that has a well-grounded interpretation derived from statistical Extreme Value Theory (EVT), and is the first classifier to be able to perform nonlinear kernel-free variable bandwidth incremental learning.
Journal ArticleDOI

EasyMKL: a scalable multiple kernel learning algorithm

TL;DR: It is shown empirically that the advantage of using the method proposed in this paper is even clearer when noise features are added, and the proposed method has been compared with other baselines and three state-of-the-art MKL methods showing that the approach is often superior.
Proceedings ArticleDOI

Large margin distribution machine

TL;DR: The Large margin distribution machine (LDM) as mentioned in this paper is a general learning approach which can be used in any place where SVM can be applied, and its superiority is verified both theoretically and empirically in this paper.
Posted Content

Large Margin Distribution Machine

TL;DR: The Large margin Distribution Machine (LDM), which tries to achieve a better generalization performance by optimizing the margin distribution, is proposed and its superiority is verified both theoretically and empirically in this paper.
References
More filters
Journal ArticleDOI

Soft Margins for AdaBoost

TL;DR: It is found that ADABOOST asymptotically achieves a hard margin distribution, i.e. the algorithm concentrates its resources on a few hard-to-learn patterns that are interestingly very similar to Support Vectors.
Journal ArticleDOI

A fast iterative nearest point algorithm for support vector machine classifier design

TL;DR: Comparative computational evaluation of the new fast iterative algorithm against powerful SVM methods such as Platt's sequential minimal optimization shows that the algorithm is very competitive.
Proceedings ArticleDOI

How boosting the margin can also boost classifier complexity

TL;DR: A close look at Breiman's compelling but puzzling results finds that the poorer performance of arc-gv can be explained by the increased complexity of the base classifiers it uses, an explanation supported by experiments and entirely consistent with the margins theory.
Proceedings Article

A Mathematical Programming Approach to the Kernel Fisher Algorithm

TL;DR: It is found that both, KFD and the proposed sparse KFD, can be understood in an unifying probabilistic context and connections to Support Vector Machines and Relevance Vector Machines are shown.
Frequently Asked Questions (9)
Q1. What contributions have the authors mentioned in the paper "A kernel method for the optimization of the margin distribution" ?

In this paper, the authors propose a kernel based method for the direct optimization of the margin distribution ( KM-OMD ). 

In future work, the authors would like to study under which conditions ( e. g. conditions related to the data distribution ) their method is to prefer to other state-of-the-art methods. Htm and its optimization could be another direction of their future research. All datasets can be downloaded from: http: //ida. first. 

Despite the AdaBoost ability to optimize the margin distribution on the training set, it has been shown in [1] that in certain cases, it can also increase the complexity of the weak hypotheses, thus possibly leading to overfitting phenomena. 

The convex hull ch(C) of a set C = {c1, . . . , cm|ci ∈ Rd}, is the set of all affine combinations of points in C such that the weights γi of the combination are non-negative, i.e. ch(C) = {γ1c1 + · · ·+γmcm|γi ∈ Γm}. 

||v(γ)||2 + λ||γ||2 (2)It can be shown that the optimal vector v(γ̂) which is solution of the problem above, represents the vector joining two points, v+ into the positive norm-restricted convex hull, i.e. v+ ∈ chη(S+), and v− into the negative norm-restricted convex hull, i.e. v− ∈ chη(S−), for opportune η. 

The two plots clearly show that learning with low λ values requires more training time, whereas models for higher λ values are faster to compute. 

Much of last-decade theoretical work on learning machines has been devoted to study the aspects of learning methods that control the generalization performance. 

It is worth noting that, in most cases, even high λ values (for which the models are much faster to train) give anyway good performances, or at least acceptable when the computational time is an issue. 

the authors propose to slightly modify the pay-off function of the game in order to have a flexible way to control the optimization w.r.t. the distribution of the margin in the training set. 

Trending Questions (1)
What are the disadvantages of an SVM hard margin classifier?

Recent results in theoretical machine learning seem to suggest that nice properties of the margin distribution over a training set turns out in a good performance of a classifier.