What contributions have the authors mentioned in the paper "Two criteria for model selection in multiclass support vector machines" ?

To solve this problem, this paper develops two model selection criteria by combining or redefining the radius–margin bound used in binary SVMs.

(Open Access) Two Criteria for Model Selection in Multiclass Support Vector Machines (2008) | Lei Wang

1432 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 38, NO. 6, DECEMBER 2008

Two Criteria for Model Selection in Multiclass

Support Vector Machines

Lei Wang, Member, IEEE,PingXue,Senior Member, IEEE, and Kap Luk Chan, Member, IEEE

Abstract—Practical applications call for efﬁcient model selec-

tion criteria for multiclass support vector machine (SVM)

classiﬁcation. To solve this problem, this paper develops two model

selection criteria by combining or redeﬁning the radius–margin

bound used in binary SVMs. The combination is justiﬁed by

linking the test error rate of a multiclass SVM with that of a set of

binary SVMs. The redeﬁnition, which is relatively heuristic, is in-

spired by the conceptual relationship between the radius–margin

bound and the class separability measure. Hence, the two criteria

are developed from the perspective of model selection rather than

a generalization of the radius–margin bound for multiclass SVMs.

As demonstrated by extensive experimental study, the minimiza-

tion of these two criteria achieves good model selection on most

data sets. Compared with the k-fold cross validation which is

often regarded as a benchmark, these two criteria give rise to

comparable performance with much less computational overhead,

particularly when a large number of model parameters are to be

optimized.

Index Terms—Class separability measure, model selection, mul-

ticlass classiﬁcation, multiclass support vector machines (SVMs),

radius–margin bound.

I. INTRODUCTION

N RECENT years, multiclass support vector machines

(SVMs) have attracted much attention due to the demands

for multicategory classiﬁcation in many practical applications

and the success of SVMs in binary classiﬁcation. The methods

realizing the multiclass SVMs roughly fall into three categories,

namely, the methods using the strategies of one-versus-all [1]

or one-versus-one [2], [3], the methods based on the error-

correcting output codes (ECOC) approach [4], [5], and those

using the single-machine approach [6]–[8]. Comparative stud-

ies of these methods can be found in [1] and [9]. The one-

versus-one- and one-versus-all-based methods are often

recommended for practical use because of lower computational

cost or conceptual simplicity.

Manuscript received February 1, 2007; revised October 31, 2007. First

published September 16, 2008; current version published November 20, 2008.

This work was supported in part by Nanyang Technological University under

Grant LIT 2002-4 of A-STAR and Grant RGM 14/02 and in part by Australian

Research Council Discovery Project under Grant DP0773761. The early work

of this paper was carried out at Nanyang Technological University, Singapore,

Singapore, and the further work of this paper was carried out at The Australian

National University, Canberra, A.C.T., Australia. This paper was recommended

by Associate Editor S. Singh.

L. Wang is with the Research School of Information Sciences and Engi-

neering, The Australian National University, Canberra, A.C.T. 0200, Australia

(e-mail: Lei.Wang@mail.rsise.anu.edu.au).

P. Xue and K. L. Chan are with the School of Electrical and Electronic

Engineering, Nanyang Technological University, Singapore 639798 (e-mail:

epxue@ntu.edu.sg; eklchan@ntu.edu.sg).

Digital Object Identiﬁer 10.1109/TSMCB.2008.927272

Similar to binary SVMs, multiclass SVMs also require model

selection to achieve good classiﬁcation performance. Overcom-

plex models will overﬁt training data, whereas oversimple mod-

els cannot effectively represent the intrinsic data structure. Both

will result in poor classiﬁcation performance when the classi-

ﬁers are put into use. Just as its binary counterpart, the model

selection of multiclass SVMs is used to select the parameters of

a kernel function and the regularization parameter that balances

training error and machine complexity. Very often, a single

model parameter set is uniformly used across all the involved

classiﬁers (for example, the binary SVM classiﬁers in the

one-versus-one- or one-versus-all-based methods), rather than

using different parameter sets in different binary classiﬁers.

This is favored because of the following: 1) Much less model

parameters need to be determined, particularly when the kernel

function has multiple parameters; 2) past studies show little

difference on classiﬁcation performance [10], [11]; and 3) the

risk of overﬁtting is reduced by using a simpler model. Hence,

the focus of this paper is on the model selection for multiclass

SVMs by ﬁnding the best single model parameter set.

In most of the existing work, the model selection for multi-

class SVMs uses an exhaustive grid-based search method. The

criterion is the k-fold or leave-one-out cross-validation error

rate. Although straightforward, the model selection process in

this way can become unbearably time consuming because for

multiclass SVMs, we are often required to solve larger scale

optimization problems. A few methods have been proposed

to speed up this process. In [12], generalized approximate

cross validation, which is an estimator of the leave-one-out test

error rate, is extended to the multiclass setting to tune model

parameters. In [13], an error bound for a multiclass SVM using

the ECOC approach is developed and applied to the model

selection. The grid search is still needed to ﬁnd the best param-

eter set. These methods soon become intractable when three

or more model parameters are to be tuned. A genetic algorithm

has been used to search the model parameter space for model

selection [14], [15]. Again, the selection process becomes very

slow when the number of model parameters is large.

Practical applications of multiclass SVMs call for efﬁcient

model selection criteria, which should be able to handle more

model parameters without leading to unacceptable computation

cost. In recent years, model selection for binary SVMs has been

well studied, and many selection criteria and methods have been

developed [16]–[18]. Our proposed approach in this paper is to

develop new criteria based on the principles of the successful

criteria in binary SVMs for the multiclass setting. In the model

selection for binary SVMs, a class of methods use nonlinear

optimization techniques to maximize or minimize a certain

Authorized licensed use limited to: Australian National University. Downloaded on December 3, 2008 at 18:17 from IEEE Xplore. Restrictions apply.

WANG et al.: TWO CRITERIA FOR MODEL SELECTION IN MULTICLASS SUPPORT VECTOR MACHINES 1433

criterion to obtain an optimal model parameter set [19], [20].

They can achieve much more efﬁcient model selection than

the straightforward grid search. A signiﬁcant progress along

this direction is the method of minimizing the radius–margin

bound of a binary SVM classiﬁer [16], [21]. Chapelle et al.

determine the derivatives of this bound with respect to model

parameters, making iterative gradient-based optimization tech-

niques applicable. The optimal model parameter set can be

efﬁciently found after a number of iterations. This method not

only signiﬁcantly shortens the model selection process but can

also optimize multiple model parameters simultaneously. It is

much desired if such a criterion could also be extended to the

multiclass setting. However, such a theoretical generalization

of this bound is not that straightforward because this bound

is rooted in the theoretical basis of binary SVMs. In [22],

a theoretical generalization of this bound was reported but

without further experimental investigation.

Although an error bound can certainly be used as a model

selection criterion, it is unnecessary for a model selection

criterion to be a valid error bound. As pointed out in [16],

when model selection is of concern, whether the minimum (or

maximum) of a criterion aligns well with lower test error rates

is more important. Hence, instead of aiming to derive an error

bound for a multiclass SVM, our paper focuses on developing

practical and efﬁcient model selection criteria by observing

the principle of such criteria in a binary setting. In detail, the

radius–margin bound for binary SVMs is exploited in the fol-

lowing two ways: 1) by linking the test error rates from binary

and multiclass SVM classiﬁers, the ﬁrst criterion is developed

based on the pairwise combination of the radius–margin bounds

of a set of binary SVMs for model selection; and 2) inspired

by the relationship between the radius–margin bound and the

class separability measure, the second criterion deﬁnes a new

radius and margin to accommodate multiple classes. As shown

later, both criteria inherit the elegant properties of the orig-

inal radius–margin bound. Their derivatives with respect to

model parameters can also be analytically computed, and thus,

gradient-based optimization techniques are still applicable. The

two criteria allow for efﬁcient optimization for several hundreds

of model parameters simultaneously. As before, the optimized

kernel parameters can be used to identify more discrimina-

tive features, which can be used to perform feature selec-

tion in a multiclass scenario. To evaluate the model selection

performance of the two criteria, extensive experiments were

conducted on a variety of benchmark data sets with different

numbers of model parameters. Although the two criteria are

developed for a multiclass SVM classiﬁer using the one-versus-

one classiﬁcation strategy, the model parameters selected by

them are also tested on the classiﬁers using other classiﬁcation

strategies, including the one-versus-all, ECOC, and the single-

machine approach. The experimental results demonstrate the

simplicity, effectiveness, and efﬁciency of the two criteria for

model selection in multiclass SVMs.

The rest of this paper is organized as follows. In Section II,

the radius–margin bound is brieﬂy introduced. To stay in focus,

the details of binary and multiclass SVMs are omitted, and

readers are referred to the papers cited earlier. Sections III

and IV present the two model selection criteria in detail. In

Section V, computational issue is discussed. Section VI

presents experimental results, and the concluding remarks are

drawn in Section VII.

II. R

ADIUS-MARGIN BOUND FOR BINARY SVMS

Let D denote a set of l training samples and D =

{(x

),...,(x

)}∈(R

×Y)

, where R

denotes a

d-dimensional input space, Y denotes the label set of x, and

y is {±1} in binary classiﬁcation. A kernel is deﬁned as

, x

)=φ(x

),φ(x

), where φ(·) is a possibly nonlin-

ear mapping from R

to a feature space F, and θ denotes the

kernel parameter set. For nonseparable data, a regularization

parameter C will be used, and the model parameter set becomes

{θ,C}.

Let L(D) be the number of errors in a leave-one-out pro-

cedure performed on D. The radius–margin bound is an upper

bound of L(D). For a hard margin binary SVM, it is shown in

[16] that

L(D) ≤

=4R

w

(1)

where R is the radius of the smallest sphere enclosing the

l training samples in F, γ is the margin, w is the normal

vector of the optimal separating hyperplane, and γ =1/w.

For nonseparable data, an L2-norm soft margin SVM will be

used, and the aforementioned result still holds. This is because

an L2-norm soft margin can be shown as a hard margin with

a slightly modiﬁed kernel function



k(x

, x

) [16], [23]. The

relationship between



k and k is



k(x

, x

)=k(x

, x

)+(1/C)

if i = j and



k(x

, x

)=k(x

, x

) otherwise, where C is the

regularization parameter mentioned earlier. This is also adopted

in this paper. The squared radius R

is expressed as R

min

φ(x

)−

c

≤

(

), where φ(x

)(i =1,...,l) is the image

of x

in F,

R is the radius of a sphere enclosing all the φ(x

and

c is the center of this sphere. This leads to a quadratic

optimization problem, and it can be obtained that

= max

β∈R

⎡

⎣



i=1

k(x

, x

) −



i,j=1

k(x

, x

)

⎤

⎦

subject to :



i=1

=1; β

≥ 0(i =1, 2,...,l) (2)

where β

is the ith Lagrange multiplier and the center of the

sphere is represented as

c =



i=1

φ(x

).Asforw

, it can

be obtained once the SVM optimization problem is solved. In

detail

w

= max

α∈R

⎡

⎣



i=1

−



i,j=1

k(x

, x

)

⎤

⎦

subject to :



i=1

=0; α

≥ 0(i =1, 2,...,l) (3)

where α

is the ith Lagrange multiplier. The derivatives of R

and w

with respect to the model parameters are given in

Authorized licensed use limited to: Australian National University. Downloaded on December 3, 2008 at 18:17 from IEEE Xplore. Restrictions apply.

1434 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 38, NO. 6, DECEMBER 2008

[16]. Let θ

(θ

∈ θ) be the tth model parameter

∂R

∂θ



i=1



∂k(x

, x

)

∂θ

−



i,j=1



∂k(x

, x

)

∂θ

(4)

where β



(i =1, 2,...,l) is the solution of (2). The derivative

of w

with respect to θ

is given as

∂w

∂θ

=(−1) ·



i,j=1



∂k(x

, x

)

∂θ

(5)

where α



(i =1, 2,...,l) is the solution of (3). This way, the

derivative of the radius–margin bound with respect to θ

∂



w



∂θ

= w

∂R

∂θ

+ R

∂w

∂θ

. (6)

The model selection with the radius–margin bound is brieﬂy

described as follows.

1) Set θ

to an initial value θ

2) Based on the current θ

, optimize for α and β based on

(3) and (2), respectively, and denote the optimal solutions

by α



and β



3) Once α



and β



are obtained, the derivative in (6) can

be explicitly computed for a given θ

. Thus, a gradient-

based search method can be used to minimize R

w

with respect to θ

. The minimizer is denoted by θ

r+1

4) Stop if a given stopping criterion is satisﬁed and θ

r+1

the selected model. Otherwise, let θ

←− θ

r+1

and go to

Step 2).

As demonstrated, the radius–margin bound is rooted in the

theoretical basis of binary SVMs, and it cannot be directly used

in model selection for multiclass SVMs. In the rest of this paper,

two criteria are developed based on this bound to deal with

model selection in multiclass SVMs.

III. M

ODEL SELECTION CRITERION I

Let D and D

denote the training and test data sets, re-

spectively. E(D

) denotes the number of misclassiﬁed samples

obtained by applying a multiclass SVM classiﬁer to D

.The

classiﬁer and the test set are assumed to be ﬁxed but unknown.

For a c-class problem

E(D



i=1



j=1,j=i

). (7)

) denotes the number of samples misclassiﬁed from

class i to class j,

and it is expressed as





x|x ∈D

(x)=i, y

(x)=j





(8)

Without loss of generality, the cost of misclassiﬁcation is considered as

identical among the classes in (7). The case having different misclassiﬁcation

costs will be discussed at the end of Section IV.

where |·|denotes the size of a set.

A sample x in D

will

be counted into E

) if and only if its true label y

is i,

whereas the label y

predicted by a multiclass SVM classiﬁer

is j. Considering that both true and predicted labels are unique

for each sample,

a misclassiﬁed sample will not be counted

into two different E

’s. Hence, there is no overlapping among

these E

’s.

Let us focus on the one-versus-one strategy with the max

wins classiﬁcation rule [9]. It is commonly used to solve mul-

ticlass SVM problems. With this strategy, a set of c(c − 1)/2

pairwise binary SVM classiﬁers are constructed. Let SVM

denote the binary SVM classiﬁer trained with the samples from

classes i and j.TheE



) is the number of test samples

which belong to class i but are misclassiﬁed to class j when

SVM

is applied to classes i and j. For the convenience of

notation, the label predicted by SVM

is written as i or j

although it is +1or−1 in general. The E



) is formally

expressed as







x|x ∈D

(x)=i, y

(x)=j





(9)

where y

(x) stands for the label predicted by the binary SVM

classiﬁer, SVM

. The total number of errors made by the c(c −

1)/2 binary SVM classiﬁers is





1≤i,j≤c,i=j



). (10)

The following proves that E(D

) is upper bounded by



). Under the rule of max wins [9], the label of a test

sample x is decided by

(x) = arg max

i=1,...,c

(x) = arg max

i=1,...,c

⎛

⎝



j=1,j=i

sign [w

,φ(x) + b

]

⎞

⎠

(11)

where w

,φ(x) + b

is positive if x is classiﬁed to class i.

The sign(a) denotes the sign function, and it is +1fora>0, 0

for a =0, and −1 otherwise. The summation over the (c − 1)

sign functions is a score, and it is denoted by S

(x) for class i.

The sample x is assigned to the class having the highest score.

This rule immediately leads to the following three results.

1) ∀x ∈D

, there must be S

(x) ≤ (c − 1)(i =1,...,c),

and the equality is achieved if and only if all the (c −

1) binary SVM classiﬁers SVM

(j =1,...,c,j = i)

classify x to class i.

2) If S

(x) <S

(x), there must be S

(x) < (c − 1). Refer-

ring to result 1), this indicates that at least one of the

(c − 1) binary SVM classiﬁers does not classify x to

class i.

Please note that according to the deﬁnition of D

,“x ∈D

” in (8) should

be written as “(x,y

(x)) ∈D

.” However, the former is used in this paper for

the convenience of notation.

In multilabel classiﬁcation, the true and predicted labels may not be unique

for a sample. This paper conﬁnes itself to multiclass problems.

Authorized licensed use limited to: Australian National University. Downloaded on December 3, 2008 at 18:17 from IEEE Xplore. Restrictions apply.

WANG et al.: TWO CRITERIA FOR MODEL SELECTION IN MULTICLASS SUPPORT VECTOR MACHINES 1435

3) If S

(x)=S

(x), then both of them must be smaller than

(c − 1). This is because the binary SVM

cannot classify

x to both classes i and j simultaneously.

Assume that a multiclass SVM misclassiﬁes a test sample

∈D

). That is, the true label y

) is i, whereas the

predicted label y

) is j. This contributes one count to

) based on (8). By referring to (11), this means that

) is the highest score, and hence, S

) ≤ S

).By

applying results 2) and 3), it is obtained that S

) < (c − 1),

indicating that at least one of the (c −1) binary SVMs has mis-

classiﬁed the sample x

. This contributes one count to E



);

however, please note that k is not necessary to be exactly the j

in E

). Therefore, for any test sample misclassiﬁed by a

multiclass SVM, it must have been misclassiﬁed by at least one

binary SVM classiﬁer. Summing E

and E



over i and j (or

k)givesriseto



1≤i,j≤c,i=j

≤



1≤i,k≤c,i=k



⇐⇒ E(D

)≤E



). (12)

This proves that E(D

) is upper bounded by E



). Mean-

while, it is worth mentioning that E

) ≤ E



) is not

necessary to be true.

The aforementioned result suggests that to reduce the value

of E(D

), we could seek to minimize its upper bound E



This leads to one model selection criterion as follows. As

known from (1) in Section II, the test error (E



+ E



) can

be estimated through the leave-one-out error of SVM

, which

is denoted by L

, that satisﬁes

≤ 4R

w



. (13)

Thus, the E



) can be estimated by



1≤i<j≤c

, and it

satisﬁes



1≤i<j≤c

≤



1≤i<j≤c

w



. (14)

To minimize E



) (or more precisely, to minimize its esti-

mate), the right side has to be minimized.

Based on the aforementioned analysis, the



1≤i<j≤c

w



is deﬁned as a model selection criterion

for multiclass SVMs. It is a pairwise combination of the

radius–margin bounds of the binary SVM classiﬁers. The

optimal model parameter set is obtained by

∗

= arg min

θ∈ Θ

⎛

⎝



1≤i<j≤c

w



⎞

⎠

. (15)

The derivative of this criterion with respect to the tth model

parameter θ

∂

∂θ

⎛

⎝



1≤i<j≤c

w



⎞

⎠



1≤i<j≤c



w



∂R

∂θ

∂w



∂θ



. (16)

The calculation of ∂R

/∂θ

and ∂w



/∂θ

follows (4) and

(5). As in a binary classiﬁcation, the optimal model parameter

set θ

∗

can be found by using gradient-based optimization

techniques.

Before ending this section, it is interesting to look into the

relationship between the proposed model selection criterion

and the radius–margin bound generalized for multiclass SVMs

in [22]. In that work, the multiclass SVM is solved by the

single-machine approach. With the notations in this paper, the

generalized bound in [22] can be expressed as

L(D) ≤ (4K/c)

⎛

⎝



1≤i<j≤c

w

− w



⎞

⎠

 (4K/c)

⎛

⎝



1≤i<j≤c







⎞

⎠

(17)

where K is a constant and c is the number of classes. In [22], R

denotes the radius of the smallest sphere enclosing the support

vectors only. In this paper, R is changed to enclose all the

training samples. Note that such a change will not affect the

“≤” in (17) because the new R is an upper bound of the original

one. The work in [22] adopts the multiclass SVMs proposed by

[6]. There, the (w

− w

) can be understood as a



, which

is a normal vector of an SVM hyperplane separating classes i

and j. For the proposed Criterion I in (15), R

is the radius of

the smallest sphere enclosing the training samples from classes

i and j, and therefore, R

≤ R

. Replacing all R

in (15) with

and moving R

out of the summation sign turn Criterion I

to (R



1≤i<j≤c

w



). If the constant (4K/c) is ignored,

the proposed Criterion I and the generalized bound in [22]

will share similar structures. Surely, from the perspective of

generalizing a bound in a strict theoretical sense, the approach

in [22] is more suitable.

IV. M

ODEL SELECTION CRITERION II

Class separability is a concept widely used in pattern recog-

nition [24]–[26]. The scatter-matrix-based measure is often

favored, thanks to its simplicity and applicability to both binary

and multiclass problems. They are deﬁned as



i=1





x∈D

(x − m

)(x − m

)







i=1

− m)(m

− m)





i=1





x∈D

(x − m)(x − m)





= S

+ S

. (18)

c is the number of classes, D

is the set of training samples from

class i, and n

is the size of D

. m

and m are the class and

total means, respectively. Many combinations of two of S

, and S

can be used as a class separability measure. The

commonly used ones include tr(S

)/tr(S

) and |S

|/|S

Authorized licensed use limited to: Australian National University. Downloaded on December 3, 2008 at 18:17 from IEEE Xplore. Restrictions apply.

1436 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 38, NO. 6, DECEMBER 2008

where tr(A) and |A| denote the trace and determinant of a

square matrix A, respectively. Other combinations can be found

in [26].

In our previous work [27], we restrict to binary classiﬁcation

and preliminarily discuss the relationship between the scatter-

matrix-based class separability measure and the radius–margin

bound. Now, this discussion is extended to a multiclass case

and is used to develop the second model selection criterion.

To do so, the following ﬁrst extends the class separability to

a kernel-induced feature space F. Considering that the high

dimensionality of F can easily make the scatter matrices sin-

gular and their determinants zero, the trace-based measure is

used instead. In the following, the superscript φ is used to

distinguish the variables in F from those in R

. Recall that D

denotes the training samples from the ith class. D is deﬁned as

the union of D

(i =1, 2,...,c), which is expressed as D =

∪

i=1

. K

A,B

is a kernel matrix where {K

A,B

}

= k(x

, x

with the constraints of x

∈Aand x

∈B. Sum(·) denotes

the summation of all the elements in a matrix. The traces are

obtained as







i=1

Sum(K

)

−

Sum(K

D,D

)

(19)





=tr(K

D,D

) −



i=1

Sum(K

)

(20)





=tr(K

D,D

) −

Sum(K

D,D

)

. (21)

To facilitate analysis, the class separability measure in F is

deﬁned as tr(S

)/tr(S

) instead of tr(S

)/tr(S

).Note

that they are essentially identical because tr(S

)=tr(S

tr(S

Recall that n

and n

are the sizes of D

and D

, respec-

tively. The relationship between tr(S

) and the squared margin

can be proven as (the proof is omitted)

≤

4 −









4 −



− m



. (22)

This result indicates that 1/(4 −m

− m



) is an upper

bound of γ

. The equality in “≤” is achieved if and only if

the solution of the problem in (3), denoted by α



,is1/n

for x

∈D

and 1/n

for x

∈D

. Considering that such a

solution seldom occurs in practice, 1/(4 −m

− m



) is

a strict upper bound in general. Recall that when minimizing

the radius–margin bound for the model selection, γ

is to be

maximized. Based on (22), to allow γ

to be maximized, its

upper bound needs to be adequately large, and it will prevent γ

from being increased otherwise. This, in turn, requires m

−



to be adequately large. Meanwhile, decreasing the value

of m

− m



will reduce the upper bound value, forcing γ

to be kept small. Please note that although a larger (or smaller)

m

− m



does not necessarily lead to a larger (or smaller)

, their values are often strongly positively correlated to each

other in practice, which can be seen from the results comparing

the values of −tr(S

) and w

in our previous work [27].

A similar result can be proven for the squared radius R

≥

+ n

)





. (23)

It shows that tr(S

)/(n

+ n

) is a lower bound of

. The equality in “≥” is achieved if and only if

the solution of the problem in (2), denoted by β



,is1/(n

) for all the training samples. Again, such a solution is rare

in practice, and this is a strict lower bound in general. When

minimizing the radius–margin bound for the model selection,

is to be minimized. Based on (23), this needs tr(S

) to

be adequately small to avoid hindering the decrease of R

.In

addition, it can be seen from [27] that the values of tr(S

) and

are often strongly positively correlated.

Conceptually speaking, m

− m



and γ

reﬂect the

similar property of data separation, whereas tr(S

) and R

measure the similar property of data scattering. Inspired

by the aforementioned results, this paper transplants the

radius–margin bound to a multiclass scenario by mimicking the

class separability measure. At the same time, please note that

this new model selection criterion will still be based on R and

w rather than the traces of the scatter matrices.

In a multiclass case, tr(S

)/(n

+ n

) measures the average

of the squared scattering radius of the training samples in F.

Considering the analogy between tr(S

)/(n

+ n

) and R

a binary classiﬁcation, the new criterion redeﬁnes R

as the

radius of the smallest sphere enclosing all the training samples

from the c classes

= min

φ(x)−

c

≤

(

) ∀x ∈D. (24)

For tr(S

), it can be shown that in the case of c classes







1≤i<j≤c



− m



. (25)

By noting the analogy between m

− m



and γ

in a

binary classiﬁcation, the margin in the new criterion is re-

deﬁned as



1≤i<j≤c



1≤i<j≤c

w



−2

(26)

where γ

is the margin of the binary SVM classiﬁer trained

with the training samples of classes i and j, and P

= n

/n,

which is the prior probability of class i estimated from the

training samples. The redeﬁned margin is a weighted average of

those from the pairwise binary SVM classiﬁers, and the weight

is the product of the prior probabilities of the two involved

classes. This implies that the margins between the classes

dominating the training and test sets need to be emphasized.

Otherwise, the number of misclassiﬁed samples will be high.

This agrees with the intuition. In this way, the second model

Authorized licensed use limited to: Australian National University. Downloaded on December 3, 2008 at 18:17 from IEEE Xplore. Restrictions apply.

Two Criteria for Model Selection in Multiclass Support Vector Machines

Figures

Citations

A PSO and pattern search based memetic algorithm for SVMs parameters optimization

Parameter Selection of Gaussian Kernel for One-Class SVM

Two methods of selecting Gaussian kernel parameters for one-class SVM and their application to fault detection

An overview of kernel alignment and its applications

A Quadratic Loss Multi-Class SVM for which a Radius-Margin Bound Applies

References

LIBSVM: A library for support vector machines

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

Introduction to Statistical Pattern Recognition

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

A comparison of methods for multiclass support vector machines

Related Papers (5)

Statistical learning theory

The Nature of Statistical Learning Theory

Choosing Multiple Parameters for Support Vector Machines

Kernel Methods for Pattern Analysis

A comparison of methods for multiclass support vector machines

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Two criteria for model selection in multiclass support vector machines" ?