scispace - formally typeset
Open AccessJournal ArticleDOI

Two Criteria for Model Selection in Multiclass Support Vector Machines

TLDR
Two model selection criteria by combining or redefining the radius-margin bound used in binary SVMs are developed, which give rise to comparable performance with much less computational overhead, particularly when a large number of model parameters are to be optimized.
Abstract
Practical applications call for efficient model selection criteria for multiclass support vector machine (SVM) classification To solve this problem, this paper develops two model selection criteria by combining or redefining the radius-margin bound used in binary SVMs The combination is justified by linking the test error rate of a multiclass SVM with that of a set of binary SVMs The redefinition, which is relatively heuristic, is inspired by the conceptual relationship between the radius-margin bound and the class separability measure Hence, the two criteria are developed from the perspective of model selection rather than a generalization of the radius-margin bound for multiclass SVMs As demonstrated by extensive experimental study, the minimization of these two criteria achieves good model selection on most data sets Compared with the k-fold cross validation which is often regarded as a benchmark, these two criteria give rise to comparable performance with much less computational overhead, particularly when a large number of model parameters are to be optimized

read more

Content maybe subject to copyright    Report

1432 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 38, NO. 6, DECEMBER 2008
Two Criteria for Model Selection in Multiclass
Support Vector Machines
Lei Wang, Member, IEEE,PingXue,Senior Member, IEEE, and Kap Luk Chan, Member, IEEE
Abstract—Practical applications call for efficient model selec-
tion criteria for multiclass support vector machine (SVM)
classification. To solve this problem, this paper develops two model
selection criteria by combining or redefining the radius–margin
bound used in binary SVMs. The combination is justified by
linking the test error rate of a multiclass SVM with that of a set of
binary SVMs. The redefinition, which is relatively heuristic, is in-
spired by the conceptual relationship between the radius–margin
bound and the class separability measure. Hence, the two criteria
are developed from the perspective of model selection rather than
a generalization of the radius–margin bound for multiclass SVMs.
As demonstrated by extensive experimental study, the minimiza-
tion of these two criteria achieves good model selection on most
data sets. Compared with the k-fold cross validation which is
often regarded as a benchmark, these two criteria give rise to
comparable performance with much less computational overhead,
particularly when a large number of model parameters are to be
optimized.
Index Terms—Class separability measure, model selection, mul-
ticlass classification, multiclass support vector machines (SVMs),
radius–margin bound.
I. INTRODUCTION
I
N RECENT years, multiclass support vector machines
(SVMs) have attracted much attention due to the demands
for multicategory classification in many practical applications
and the success of SVMs in binary classification. The methods
realizing the multiclass SVMs roughly fall into three categories,
namely, the methods using the strategies of one-versus-all [1]
or one-versus-one [2], [3], the methods based on the error-
correcting output codes (ECOC) approach [4], [5], and those
using the single-machine approach [6]–[8]. Comparative stud-
ies of these methods can be found in [1] and [9]. The one-
versus-one- and one-versus-all-based methods are often
recommended for practical use because of lower computational
cost or conceptual simplicity.
Manuscript received February 1, 2007; revised October 31, 2007. First
published September 16, 2008; current version published November 20, 2008.
This work was supported in part by Nanyang Technological University under
Grant LIT 2002-4 of A-STAR and Grant RGM 14/02 and in part by Australian
Research Council Discovery Project under Grant DP0773761. The early work
of this paper was carried out at Nanyang Technological University, Singapore,
Singapore, and the further work of this paper was carried out at The Australian
National University, Canberra, A.C.T., Australia. This paper was recommended
by Associate Editor S. Singh.
L. Wang is with the Research School of Information Sciences and Engi-
neering, The Australian National University, Canberra, A.C.T. 0200, Australia
(e-mail: Lei.Wang@mail.rsise.anu.edu.au).
P. Xue and K. L. Chan are with the School of Electrical and Electronic
Engineering, Nanyang Technological University, Singapore 639798 (e-mail:
epxue@ntu.edu.sg; eklchan@ntu.edu.sg).
Digital Object Identifier 10.1109/TSMCB.2008.927272
Similar to binary SVMs, multiclass SVMs also require model
selection to achieve good classification performance. Overcom-
plex models will overfit training data, whereas oversimple mod-
els cannot effectively represent the intrinsic data structure. Both
will result in poor classification performance when the classi-
fiers are put into use. Just as its binary counterpart, the model
selection of multiclass SVMs is used to select the parameters of
a kernel function and the regularization parameter that balances
training error and machine complexity. Very often, a single
model parameter set is uniformly used across all the involved
classifiers (for example, the binary SVM classifiers in the
one-versus-one- or one-versus-all-based methods), rather than
using different parameter sets in different binary classifiers.
This is favored because of the following: 1) Much less model
parameters need to be determined, particularly when the kernel
function has multiple parameters; 2) past studies show little
difference on classification performance [10], [11]; and 3) the
risk of overfitting is reduced by using a simpler model. Hence,
the focus of this paper is on the model selection for multiclass
SVMs by finding the best single model parameter set.
In most of the existing work, the model selection for multi-
class SVMs uses an exhaustive grid-based search method. The
criterion is the k-fold or leave-one-out cross-validation error
rate. Although straightforward, the model selection process in
this way can become unbearably time consuming because for
multiclass SVMs, we are often required to solve larger scale
optimization problems. A few methods have been proposed
to speed up this process. In [12], generalized approximate
cross validation, which is an estimator of the leave-one-out test
error rate, is extended to the multiclass setting to tune model
parameters. In [13], an error bound for a multiclass SVM using
the ECOC approach is developed and applied to the model
selection. The grid search is still needed to find the best param-
eter set. These methods soon become intractable when three
or more model parameters are to be tuned. A genetic algorithm
has been used to search the model parameter space for model
selection [14], [15]. Again, the selection process becomes very
slow when the number of model parameters is large.
Practical applications of multiclass SVMs call for efficient
model selection criteria, which should be able to handle more
model parameters without leading to unacceptable computation
cost. In recent years, model selection for binary SVMs has been
well studied, and many selection criteria and methods have been
developed [16]–[18]. Our proposed approach in this paper is to
develop new criteria based on the principles of the successful
criteria in binary SVMs for the multiclass setting. In the model
selection for binary SVMs, a class of methods use nonlinear
optimization techniques to maximize or minimize a certain
1083-4419/$25.00 © 2008 IEEE
Authorized licensed use limited to: Australian National University. Downloaded on December 3, 2008 at 18:17 from IEEE Xplore. Restrictions apply.

WANG et al.: TWO CRITERIA FOR MODEL SELECTION IN MULTICLASS SUPPORT VECTOR MACHINES 1433
criterion to obtain an optimal model parameter set [19], [20].
They can achieve much more efficient model selection than
the straightforward grid search. A significant progress along
this direction is the method of minimizing the radius–margin
bound of a binary SVM classifier [16], [21]. Chapelle et al.
determine the derivatives of this bound with respect to model
parameters, making iterative gradient-based optimization tech-
niques applicable. The optimal model parameter set can be
efficiently found after a number of iterations. This method not
only significantly shortens the model selection process but can
also optimize multiple model parameters simultaneously. It is
much desired if such a criterion could also be extended to the
multiclass setting. However, such a theoretical generalization
of this bound is not that straightforward because this bound
is rooted in the theoretical basis of binary SVMs. In [22],
a theoretical generalization of this bound was reported but
without further experimental investigation.
Although an error bound can certainly be used as a model
selection criterion, it is unnecessary for a model selection
criterion to be a valid error bound. As pointed out in [16],
when model selection is of concern, whether the minimum (or
maximum) of a criterion aligns well with lower test error rates
is more important. Hence, instead of aiming to derive an error
bound for a multiclass SVM, our paper focuses on developing
practical and efficient model selection criteria by observing
the principle of such criteria in a binary setting. In detail, the
radius–margin bound for binary SVMs is exploited in the fol-
lowing two ways: 1) by linking the test error rates from binary
and multiclass SVM classifiers, the first criterion is developed
based on the pairwise combination of the radius–margin bounds
of a set of binary SVMs for model selection; and 2) inspired
by the relationship between the radius–margin bound and the
class separability measure, the second criterion defines a new
radius and margin to accommodate multiple classes. As shown
later, both criteria inherit the elegant properties of the orig-
inal radius–margin bound. Their derivatives with respect to
model parameters can also be analytically computed, and thus,
gradient-based optimization techniques are still applicable. The
two criteria allow for efficient optimization for several hundreds
of model parameters simultaneously. As before, the optimized
kernel parameters can be used to identify more discrimina-
tive features, which can be used to perform feature selec-
tion in a multiclass scenario. To evaluate the model selection
performance of the two criteria, extensive experiments were
conducted on a variety of benchmark data sets with different
numbers of model parameters. Although the two criteria are
developed for a multiclass SVM classifier using the one-versus-
one classification strategy, the model parameters selected by
them are also tested on the classifiers using other classification
strategies, including the one-versus-all, ECOC, and the single-
machine approach. The experimental results demonstrate the
simplicity, effectiveness, and efficiency of the two criteria for
model selection in multiclass SVMs.
The rest of this paper is organized as follows. In Section II,
the radius–margin bound is briefly introduced. To stay in focus,
the details of binary and multiclass SVMs are omitted, and
readers are referred to the papers cited earlier. Sections III
and IV present the two model selection criteria in detail. In
Section V, computational issue is discussed. Section VI
presents experimental results, and the concluding remarks are
drawn in Section VII.
II. R
ADIUS-MARGIN BOUND FOR BINARY SVMS
Let D denote a set of l training samples and D =
{(x
1
,y
1
),...,(x
l
,y
l
)}∈(R
d
×Y)
l
, where R
d
denotes a
d-dimensional input space, Y denotes the label set of x, and
y is 1} in binary classification. A kernel is defined as
k
θ
(x
i
, x
j
)=φ(x
i
)(x
j
), where φ(·) is a possibly nonlin-
ear mapping from R
d
to a feature space F, and θ denotes the
kernel parameter set. For nonseparable data, a regularization
parameter C will be used, and the model parameter set becomes
{θ,C}.
Let L(D) be the number of errors in a leave-one-out pro-
cedure performed on D. The radius–margin bound is an upper
bound of L(D). For a hard margin binary SVM, it is shown in
[16] that
L(D)
4R
2
γ
2
=4R
2
w
2
(1)
where R is the radius of the smallest sphere enclosing the
l training samples in F, γ is the margin, w is the normal
vector of the optimal separating hyperplane, and γ =1/w.
For nonseparable data, an L2-norm soft margin SVM will be
used, and the aforementioned result still holds. This is because
an L2-norm soft margin can be shown as a hard margin with
a slightly modified kernel function
k(x
i
, x
j
) [16], [23]. The
relationship between
k and k is
k(x
i
, x
j
)=k(x
i
, x
j
)+(1/C)
if i = j and
k(x
i
, x
j
)=k(x
i
, x
j
) otherwise, where C is the
regularization parameter mentioned earlier. This is also adopted
in this paper. The squared radius R
2
is expressed as R
2
=
min
φ(x
i
)
ˆ
c
2
ˆ
R
2
(
ˆ
R
2
), where φ(x
i
)(i =1,...,l) is the image
of x
i
in F,
ˆ
R is the radius of a sphere enclosing all the φ(x
i
),
and
ˆ
c is the center of this sphere. This leads to a quadratic
optimization problem, and it can be obtained that
R
2
= max
βR
l
l
i=1
β
i
k(x
i
, x
i
)
l
i,j=1
β
i
β
j
k(x
i
, x
j
)
subject to :
l
i=1
β
i
=1; β
i
0(i =1, 2,...,l) (2)
where β
i
is the ith Lagrange multiplier and the center of the
sphere is represented as
ˆ
c =
l
i=1
β
i
φ(x
i
).Asforw
2
, it can
be obtained once the SVM optimization problem is solved. In
detail
1
2
w
2
= max
αR
l
l
i=1
α
i
1
2
l
i,j=1
α
i
α
j
y
i
y
j
k(x
i
, x
j
)
subject to :
l
i=1
α
i
y
i
=0; α
i
0(i =1, 2,...,l) (3)
where α
i
is the ith Lagrange multiplier. The derivatives of R
2
and w
2
with respect to the model parameters are given in
Authorized licensed use limited to: Australian National University. Downloaded on December 3, 2008 at 18:17 from IEEE Xplore. Restrictions apply.

1434 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 38, NO. 6, DECEMBER 2008
[16]. Let θ
t
(θ
t
θ) be the tth model parameter
∂R
2
∂θ
t
=
l
i=1
β
i
∂k(x
i
, x
i
)
∂θ
t
l
i,j=1
β
i
β
j
∂k(x
i
, x
j
)
∂θ
t
(4)
where β
i
(i =1, 2,...,l) is the solution of (2). The derivative
of w
2
with respect to θ
t
is given as
w
2
∂θ
t
=(1) ·
l
i,j=1
α
i
α
j
y
i
y
j
∂k(x
i
, x
j
)
∂θ
t
(5)
where α
i
(i =1, 2,...,l) is the solution of (3). This way, the
derivative of the radius–margin bound with respect to θ
t
is
R
2
w
2
∂θ
t
= w
2
∂R
2
∂θ
t
+ R
2
w
2
∂θ
t
. (6)
The model selection with the radius–margin bound is briefly
described as follows.
1) Set θ
r
to an initial value θ
0
.
2) Based on the current θ
r
, optimize for α and β based on
(3) and (2), respectively, and denote the optimal solutions
by α
r
and β
r
.
3) Once α
r
and β
r
are obtained, the derivative in (6) can
be explicitly computed for a given θ
r
. Thus, a gradient-
based search method can be used to minimize R
2
w
2
with respect to θ
r
. The minimizer is denoted by θ
r+1
.
4) Stop if a given stopping criterion is satisfied and θ
r+1
is
the selected model. Otherwise, let θ
r
←− θ
r+1
and go to
Step 2).
As demonstrated, the radius–margin bound is rooted in the
theoretical basis of binary SVMs, and it cannot be directly used
in model selection for multiclass SVMs. In the rest of this paper,
two criteria are developed based on this bound to deal with
model selection in multiclass SVMs.
III. M
ODEL SELECTION CRITERION I
Let D and D
t
denote the training and test data sets, re-
spectively. E(D
t
) denotes the number of misclassified samples
obtained by applying a multiclass SVM classifier to D
t
.The
classifier and the test set are assumed to be fixed but unknown.
For a c-class problem
E(D
t
)=
c
i=1
c
j=1,j=i
E
ij
(D
t
). (7)
E
ij
(D
t
) denotes the number of samples misclassified from
class i to class j,
1
and it is expressed as
E
ij
(D
t
)=
x|x ∈D
t
,y
0
(x)=i, y
m
(x)=j
(8)
1
Without loss of generality, the cost of misclassification is considered as
identical among the classes in (7). The case having different misclassification
costs will be discussed at the end of Section IV.
where |·|denotes the size of a set.
2
A sample x in D
t
will
be counted into E
ij
(D
t
) if and only if its true label y
0
is i,
whereas the label y
m
predicted by a multiclass SVM classifier
is j. Considering that both true and predicted labels are unique
for each sample,
3
a misclassified sample will not be counted
into two different E
ij
s. Hence, there is no overlapping among
these E
ij
’s.
Let us focus on the one-versus-one strategy with the max
wins classification rule [9]. It is commonly used to solve mul-
ticlass SVM problems. With this strategy, a set of c(c 1)/2
pairwise binary SVM classifiers are constructed. Let SVM
ij
denote the binary SVM classifier trained with the samples from
classes i and j.TheE
ij
(D
t
) is the number of test samples
which belong to class i but are misclassified to class j when
SVM
ij
is applied to classes i and j. For the convenience of
notation, the label predicted by SVM
ij
is written as i or j
although it is +1or1 in general. The E
ij
(D
t
) is formally
expressed as
E
ij
(D
t
)=
x|x ∈D
t
,y
0
(x)=i, y
b
ij
(x)=j
(9)
where y
b
ij
(x) stands for the label predicted by the binary SVM
classifier, SVM
ij
. The total number of errors made by the c(c
1)/2 binary SVM classifiers is
E
(D
t
)=
1i,jc,i=j
E
ij
(D
t
). (10)
The following proves that E(D
t
) is upper bounded by
E
(D
t
). Under the rule of max wins [9], the label of a test
sample x is decided by
y
m
(x) = arg max
i=1,...,c
S
i
(x) = arg max
i=1,...,c
c
j=1,j=i
sign [w
ij
(x) + b
ij
]
(11)
where w
ij
(x) + b
ij
is positive if x is classified to class i.
The sign(a) denotes the sign function, and it is +1fora>0, 0
for a =0, and 1 otherwise. The summation over the (c 1)
sign functions is a score, and it is denoted by S
i
(x) for class i.
The sample x is assigned to the class having the highest score.
This rule immediately leads to the following three results.
1) x ∈D
t
, there must be S
i
(x) (c 1)(i =1,...,c),
and the equality is achieved if and only if all the (c
1) binary SVM classifiers SVM
ij
(j =1,...,c,j = i)
classify x to class i.
2) If S
i
(x) <S
j
(x), there must be S
i
(x) < (c 1). Refer-
ring to result 1), this indicates that at least one of the
(c 1) binary SVM classifiers does not classify x to
class i.
2
Please note that according to the definition of D
t
,“x ∈D
t
in (8) should
be written as (x,y
0
(x)) ∈D
t
.” However, the former is used in this paper for
the convenience of notation.
3
In multilabel classification, the true and predicted labels may not be unique
for a sample. This paper confines itself to multiclass problems.
Authorized licensed use limited to: Australian National University. Downloaded on December 3, 2008 at 18:17 from IEEE Xplore. Restrictions apply.

WANG et al.: TWO CRITERIA FOR MODEL SELECTION IN MULTICLASS SUPPORT VECTOR MACHINES 1435
3) If S
i
(x)=S
j
(x), then both of them must be smaller than
(c 1). This is because the binary SVM
ij
cannot classify
x to both classes i and j simultaneously.
Assume that a multiclass SVM misclassifies a test sample
x
t
(x
t
∈D
t
). That is, the true label y
0
(x
t
) is i, whereas the
predicted label y
m
(x
t
) is j. This contributes one count to
E
ij
(D
t
) based on (8). By referring to (11), this means that
S
j
(x
t
) is the highest score, and hence, S
i
(x
t
) S
j
(x
t
).By
applying results 2) and 3), it is obtained that S
i
(x
t
) < (c 1),
indicating that at least one of the (c 1) binary SVMs has mis-
classified the sample x
t
. This contributes one count to E
ik
(D
t
);
however, please note that k is not necessary to be exactly the j
in E
ij
(D
t
). Therefore, for any test sample misclassified by a
multiclass SVM, it must have been misclassified by at least one
binary SVM classifier. Summing E
ij
and E
ik
over i and j (or
k)givesriseto
1i,jc,i=j
E
ij
1i,kc,i=k
E
ik
⇐⇒ E(D
t
)E
(D
t
). (12)
This proves that E(D
t
) is upper bounded by E
(D
t
). Mean-
while, it is worth mentioning that E
ij
(D
t
) E
ij
(D
t
) is not
necessary to be true.
The aforementioned result suggests that to reduce the value
of E(D
t
), we could seek to minimize its upper bound E
(D
t
).
This leads to one model selection criterion as follows. As
known from (1) in Section II, the test error (E
ij
+ E
ji
) can
be estimated through the leave-one-out error of SVM
ij
, which
is denoted by L
ij
, that satisfies
L
ij
4R
2
ij
w
ij
2
. (13)
Thus, the E
(D
t
) can be estimated by
1i<jc
L
ij
, and it
satisfies
1i<jc
L
ij
1i<jc
4R
2
ij
w
ij
2
. (14)
To minimize E
(D
t
) (or more precisely, to minimize its esti-
mate), the right side has to be minimized.
Based on the aforementioned analysis, the
1i<jc
R
2
ij
w
ij
2
is defined as a model selection criterion
for multiclass SVMs. It is a pairwise combination of the
radius–margin bounds of the binary SVM classifiers. The
optimal model parameter set is obtained by
θ
= arg min
θ Θ
1i<jc
R
2
ij
w
ij
2
. (15)
The derivative of this criterion with respect to the tth model
parameter θ
t
is
∂θ
t
1i<jc
R
2
ij
w
ij
2
=
1i<jc
w
ij
2
∂R
2
ij
∂θ
t
+R
2
ij
w
ij
2
∂θ
t
. (16)
The calculation of ∂R
2
ij
/∂θ
t
and w
ij
2
/∂θ
t
follows (4) and
(5). As in a binary classification, the optimal model parameter
set θ
can be found by using gradient-based optimization
techniques.
Before ending this section, it is interesting to look into the
relationship between the proposed model selection criterion
and the radius–margin bound generalized for multiclass SVMs
in [22]. In that work, the multiclass SVM is solved by the
single-machine approach. With the notations in this paper, the
generalized bound in [22] can be expressed as
L(D) (4K/c)
R
2
1i<jc
w
i
w
j
2
(4K/c)
R
2
1i<jc
w
ij
2
(17)
where K is a constant and c is the number of classes. In [22], R
denotes the radius of the smallest sphere enclosing the support
vectors only. In this paper, R is changed to enclose all the
training samples. Note that such a change will not affect the
in (17) because the new R is an upper bound of the original
one. The work in [22] adopts the multiclass SVMs proposed by
[6]. There, the (w
i
w
j
) can be understood as a
w
ij
, which
is a normal vector of an SVM hyperplane separating classes i
and j. For the proposed Criterion I in (15), R
ij
is the radius of
the smallest sphere enclosing the training samples from classes
i and j, and therefore, R
2
ij
R
2
. Replacing all R
2
ij
in (15) with
R
2
and moving R
2
out of the summation sign turn Criterion I
to (R
2
1i<jc
w
ij
2
). If the constant (4K/c) is ignored,
the proposed Criterion I and the generalized bound in [22]
will share similar structures. Surely, from the perspective of
generalizing a bound in a strict theoretical sense, the approach
in [22] is more suitable.
IV. M
ODEL SELECTION CRITERION II
Class separability is a concept widely used in pattern recog-
nition [24]–[26]. The scatter-matrix-based measure is often
favored, thanks to its simplicity and applicability to both binary
and multiclass problems. They are defined as
S
W
=
c
i=1
x∈D
i
(x m
i
)(x m
i
)
S
B
=
c
i=1
n
i
(m
i
m)(m
i
m)
S
T
=
c
i=1
x∈D
i
(x m)(x m)
= S
W
+ S
B
. (18)
c is the number of classes, D
i
is the set of training samples from
class i, and n
i
is the size of D
i
. m
i
and m are the class and
total means, respectively. Many combinations of two of S
W
,
S
B
, and S
T
can be used as a class separability measure. The
commonly used ones include tr(S
B
)/tr(S
W
) and |S
B
|/|S
W
|,
Authorized licensed use limited to: Australian National University. Downloaded on December 3, 2008 at 18:17 from IEEE Xplore. Restrictions apply.

1436 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 38, NO. 6, DECEMBER 2008
where tr(A) and |A| denote the trace and determinant of a
square matrix A, respectively. Other combinations can be found
in [26].
In our previous work [27], we restrict to binary classification
and preliminarily discuss the relationship between the scatter-
matrix-based class separability measure and the radius–margin
bound. Now, this discussion is extended to a multiclass case
and is used to develop the second model selection criterion.
To do so, the following first extends the class separability to
a kernel-induced feature space F. Considering that the high
dimensionality of F can easily make the scatter matrices sin-
gular and their determinants zero, the trace-based measure is
used instead. In the following, the superscript φ is used to
distinguish the variables in F from those in R
d
. Recall that D
i
denotes the training samples from the ith class. D is defined as
the union of D
i
(i =1, 2,...,c), which is expressed as D =
c
i=1
D
i
. K
A,B
is a kernel matrix where {K
A,B
}
ij
= k(x
i
, x
j
),
with the constraints of x
i
∈Aand x
j
∈B. Sum(·) denotes
the summation of all the elements in a matrix. The traces are
obtained as
tr
S
φ
B
=
c
i=1
Sum(K
D
i
,D
i
)
n
i
Sum(K
D,D
)
n
(19)
tr
S
φ
W
=tr(K
D,D
)
c
i=1
Sum(K
D
i
,D
i
)
n
i
(20)
tr
S
φ
T
=tr(K
D,D
)
Sum(K
D,D
)
n
. (21)
To facilitate analysis, the class separability measure in F is
defined as tr(S
φ
B
)/tr(S
φ
T
) instead of tr(S
φ
B
)/tr(S
φ
W
).Note
that they are essentially identical because tr(S
φ
T
)=tr(S
φ
B
)+
tr(S
φ
W
).
Recall that n
1
and n
2
are the sizes of D
1
and D
2
, respec-
tively. The relationship between tr(S
φ
B
) and the squared margin
γ
2
can be proven as (the proof is omitted)
γ
2
1
4
n
1
+n
2
n
1
n
2
tr
S
φ
B
=
1
4
m
φ
1
m
φ
2
2
. (22)
This result indicates that 1/(4 −m
φ
1
m
φ
2
2
) is an upper
bound of γ
2
. The equality in is achieved if and only if
the solution of the problem in (3), denoted by α
i
,is1/n
1
for x
i
∈D
1
and 1/n
2
for x
i
∈D
2
. Considering that such a
solution seldom occurs in practice, 1/(4 −m
φ
1
m
φ
2
2
) is
a strict upper bound in general. Recall that when minimizing
the radius–margin bound for the model selection, γ
2
is to be
maximized. Based on (22), to allow γ
2
to be maximized, its
upper bound needs to be adequately large, and it will prevent γ
2
from being increased otherwise. This, in turn, requires m
φ
1
m
φ
2
2
to be adequately large. Meanwhile, decreasing the value
of m
φ
1
m
φ
2
2
will reduce the upper bound value, forcing γ
2
to be kept small. Please note that although a larger (or smaller)
m
φ
1
m
φ
2
2
does not necessarily lead to a larger (or smaller)
γ
2
, their values are often strongly positively correlated to each
other in practice, which can be seen from the results comparing
the values of tr(S
φ
B
) and w
2
in our previous work [27].
A similar result can be proven for the squared radius R
2
as
R
2
1
(n
1
+ n
2
)
tr
S
φ
T
. (23)
It shows that tr(S
φ
T
)/(n
1
+ n
2
) is a lower bound of
R
2
. The equality in is achieved if and only if
the solution of the problem in (2), denoted by β
i
,is1/(n
1
+
n
2
) for all the training samples. Again, such a solution is rare
in practice, and this is a strict lower bound in general. When
minimizing the radius–margin bound for the model selection,
R
2
is to be minimized. Based on (23), this needs tr(S
φ
T
) to
be adequately small to avoid hindering the decrease of R
2
.In
addition, it can be seen from [27] that the values of tr(S
φ
T
) and
R
2
are often strongly positively correlated.
Conceptually speaking, m
φ
1
m
φ
2
2
and γ
2
reflect the
similar property of data separation, whereas tr(S
φ
T
) and R
2
measure the similar property of data scattering. Inspired
by the aforementioned results, this paper transplants the
radius–margin bound to a multiclass scenario by mimicking the
class separability measure. At the same time, please note that
this new model selection criterion will still be based on R and
w rather than the traces of the scatter matrices.
In a multiclass case, tr(S
φ
T
)/(n
1
+ n
2
) measures the average
of the squared scattering radius of the training samples in F.
Considering the analogy between tr(S
φ
T
)/(n
1
+ n
2
) and R
2
in
a binary classification, the new criterion redefines R
2
as the
radius of the smallest sphere enclosing all the training samples
from the c classes
R
2
c
= min
φ(x)
ˆ
c
2
ˆ
R
2
(
ˆ
R
2
) x ∈D. (24)
For tr(S
φ
B
), it can be shown that in the case of c classes
tr
S
φ
B
=
1i<jc
n
i
n
j
m
φ
i
m
φ
j
2
n
2
. (25)
By noting the analogy between m
φ
1
m
φ
2
2
and γ
2
in a
binary classification, the margin in the new criterion is re-
defined as
γ
2
=
1i<jc
n
i
n
j
γ
2
ij
n
2
=
1i<jc
P
i
P
j
w
ij
2
(26)
where γ
ij
is the margin of the binary SVM classifier trained
with the training samples of classes i and j, and P
i
= n
i
/n,
which is the prior probability of class i estimated from the
training samples. The redefined margin is a weighted average of
those from the pairwise binary SVM classifiers, and the weight
is the product of the prior probabilities of the two involved
classes. This implies that the margins between the classes
dominating the training and test sets need to be emphasized.
Otherwise, the number of misclassified samples will be high.
This agrees with the intuition. In this way, the second model
Authorized licensed use limited to: Australian National University. Downloaded on December 3, 2008 at 18:17 from IEEE Xplore. Restrictions apply.

Citations
More filters
Journal ArticleDOI

A PSO and pattern search based memetic algorithm for SVMs parameters optimization

TL;DR: An efficient memetic algorithm based on particle swarm optimization algorithm (PSO) and pattern search and a novel probabilistic selection strategy to select the appropriate individuals among the current population to undergo local refinement is proposed, keeping a well balance between exploration and exploitation.
Journal ArticleDOI

Parameter Selection of Gaussian Kernel for One-Class SVM

TL;DR: A novel method to solve the problem of kernel parameter selection in one-class SVM with the Gaussian kernel by measuring the distances from the samples to the OCSVM enclosing surfaces and an optimization objective function for the parameter selection is put forward.
Journal ArticleDOI

Two methods of selecting Gaussian kernel parameters for one-class SVM and their application to fault detection

TL;DR: Two methods are proposed to select Gaussian kernel parameters in OCSVM: according to the first one, the parameters are selected using the information of the farthest and the nearest neighbors of each sample; using the second one,The parameters are determined via detecting the ''tightness'' of the decision boundaries.
Journal ArticleDOI

An overview of kernel alignment and its applications

TL;DR: The basic idea of kernel alignment and its theoretical properties, as well as the extensions and improvements for specific learning problems, are introduced and the typical applications, including kernel parameter tuning, multiple kernel learning, spectral kernel learning and feature selection and extraction are reviewed.
Journal ArticleDOI

A Quadratic Loss Multi-Class SVM for which a Radius-Margin Bound Applies

TL;DR: The first quadratic loss multi-class SVM is introduced: the M-SVM2, which can be seen as a direct extension of the 2-norm SVM to the multi- class case, which is established by deriving the corresponding generalized radius-margin bound.
References
More filters
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Book

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

TL;DR: This is the first comprehensive introduction to Support Vector Machines (SVMs), a new generation learning system based on recent advances in statistical learning theory, and will guide practitioners to updated literature, new applications, and on-line software.
Book

Introduction to Statistical Pattern Recognition

TL;DR: This completely revised second edition presents an introduction to statistical pattern recognition, which is appropriate as a text for introductory courses in pattern recognition and as a reference book for workers in the field.
BookDOI

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

TL;DR: Learning with Kernels provides an introduction to SVMs and related kernel methods that provide all of the concepts necessary to enable a reader equipped with some basic mathematical knowledge to enter the world of machine learning using theoretically well-founded yet easy-to-use kernel algorithms.
Journal ArticleDOI

A comparison of methods for multiclass support vector machines

TL;DR: Decomposition implementations for two "all-together" multiclass SVM methods are given and it is shown that for large problems methods by considering all data at once in general need fewer support vectors.
Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Two criteria for model selection in multiclass support vector machines" ?

To solve this problem, this paper develops two model selection criteria by combining or redefining the radius–margin bound used in binary SVMs.