What have the authors contributed in "Least squares support vector machine classifiers" ?

Q: What have the authors contributed in "Least squares support vector machine classifiers" ?

In this letter the authors discuss a least squares version for support vector machine ( SVM ) classifiers. Due to equality type constraints in the formulation, the solution follows from solving a set of linear equations, instead of quadratic programming for classical SVM ’ s.

(Open Access) Least Squares Support Vector Machine Classifiers (1999) | Johan A. K. Suykens

Neural Processing Letters 9: 293–300, 1999.

293

Least Squares Support Vector Machine Classiﬁers

J.A.K. SUYKENS and J. VANDEWALLE

Katholieke Universiteit Leuven, Department of Electrical Engineering, ESAT-SISTA Kardinaal

Mercierlaan 94, B–3001 Leuven (Heverlee), Belgium, e-mail: johan.suykens@esat.kuleuven.ac.be

Abstract. In this letter we discuss a least squares version for support vector machine (SVM) classi-

ﬁers. Due to equality type constraints in the formulation, the solution follows from solving a set of

linear equations, instead of quadratic programming for classical SVM’s. The approach is illustrated

on a two-spiral benchmark classiﬁcation problem.

Key words: classiﬁcation, support vector machines, linear least squares, radial basis function kernel

Abbreviations: SVM – Support Vector Machines; VC – Vapnik-Chervonenkis; RBF – Radial Basis

Function

1. Introduction

Recently, support vector machines (Vapnik, 1995; Vapknik, 1998a; Vapnik, 1998b)

have been introduced for solving pattern recognition problems. In this method one

maps the data into a higher dimensional input space and one constructs an optimal

separating hyperplane in this space. This basically involves solving a quadratic

programming problem, while gradient based training methods for neural network

architectures on the other hand suffer from the existence of many local minima

(Bishop, 1995; Cherkassky & Mulier, 1998; Haykin, 1994; Zurada, 1992). Kernel

functions and parameters are chosen such that a bound on the VC dimension is

minimized. Later, the support vector method was extended for solving function es-

timation problems. For this purpose Vapnik’s epsilon insensitive loss function and

Huber’s loss function have been employed. Besides the linear case, SVM’s based

on polynomials, splines, radial basis function networks and multilayer perceptrons

have been successfully applied. Being based on the structural risk minimization

principle and capacity concept with pure combinatorial deﬁnitions, the quality and

complexity of the SVM solution does not depend directly on the dimensionality of

the input space (Vapnik, 1995; Vapknik, 1998a; Vapnik, 1998b).

In this paper we formulate a least squares version of SVM’s for classiﬁcation

problems with two classes. For the function estimation problem a support vec-

tor interpretation of ridge regression (Golub & Van Loan, 1989) has been given

in (Saunders et al., 1998), which considers equality type constraints instead of

inequalities from the classical SVM approach. Here, we also consider equality

294 J.A.K. SUYKENS AND J. VANDEWALLE

constraints for the classiﬁcation problem with a formulation in least squares sense.

As a result the solution follows directly from solving a set of linear equations,

instead of quadratic programming. While in classical SVM’s many support values

are zero (nonzero values correspond to support vectors), in least squares SVM’s

the support values are proportional to the errors.

This paper is organized as follows. In Section 2 we review some basic work

about support vector machine classiﬁers. In Section 3 we discuss the least squares

support vector machine classiﬁers. In Section 4 examples are given to illustrate the

support values and on a two-spiral benchmark problem.

2. Support Vector Machines for Classiﬁcation

In this Section we shortly review some basic work on support vector machines

(SVM) for classiﬁcation problems. For all further details we refer to (Vapnik, 1995;

Vapnik, 1998a; Vapnik, 1998b).

Given a training set of N data points {y

}

k=1

,wherex

∈ R

is the kth input

pattern and y

∈ R is the kth output pattern, the support vector method approach

aims at constructing a classiﬁer of the form:

y(x) = sign

k=1

ψ(x, x

) + b

, (1)

where α

are positive real constants and b is a real constant. For ψ(·, ·) one typically

has the following choices: ψ(x, x

) = x

x (linear SVM); ψ(x, x

) = (x

x + 1)

(polynomial SVM of degree d); ψ(x, x

) = exp{−kx − x

/σ

} (RBF SVM);

ψ(x, x

) = tanh[κ x

x+θ] (two layer neural SVM), where σ, κ and θ are constants.

The classiﬁer is constructed as follows. One assumes that

ϕ(x

) + b ≥ 1 , if y

=+1,

ϕ(x

) + b ≤−1 , if y

=−1,

(2)

which is equivalent to

ϕ(x

) + b]≥1,k= 1, ..., N, (3)

where ϕ(·) is a nonlinear function which maps the input space into a higher di-

mensional space. However, this function is not explicitly constructed. In order to

have the possibility to violate (3), in case a separating hyperplane in this higher

dimensional space does not exist, variables ξ

are introduced such that

ϕ(x

) + b]≥1 − ξ

,k= 1, ..., N,

≥ 0,k= 1, ..., N.

(4)

According to the structural risk minimization principle, the risk bound is minim-

ized by formulating the optimization problem

min

w,ξ

(w, ξ

) =

w + c

k=1

(5)

LEAST SQUARES SUPPORT VECTOR MACHINE CLASSIFIERS 295

subject to (4). Therefore, one constructs the Lagrangian

(w, b, ξ

; α

, ν

) = J

(w, ξ

) −

k=1

ϕ(x

) + b]−

−1 + ξ

}−

k=1

(6)

by introducing Lagrange multipliers α

≥ 0, ν

≥ 0 (k = 1, ..., N ). The solution

is given by the saddle point of the Lagrangian by computing

max

,ν

min

w,b,ξ

(w, b, ξ

; α

, ν

). (7)

One obtains

∂L

∂w

= 0 → w =

k=1

ϕ(x

∂L

∂b

= 0 →

k=1

= 0,

∂L

∂ξ

= 0 → 0 ≤ α

≤ c, k = 1, ..., N,

(8)

which leads to the solution of the following quadratic programming problem

max

(α

; ϕ(x

)) =−

k,l=1

ϕ(x

)

ϕ(x

) α

k=1

, (9)

such that

k=1

= 0, 0 ≤ α

≤ c, k = 1, ..., N.

The function ϕ(x

) in (9) is related then to ψ(x, x

) by imposing

ϕ(x)

ϕ(x

) = ψ(x, x

), (10)

which is motivated by Mercer’s Theorem. Note that for the two layer neural SVM,

Mercer’s condition only holds for certain parameter values of κ and θ.

The classiﬁer (1) is designed by solving

max

(α

; ψ(x

)) =−

k,l=1

ψ(x

) α

k=1

, (11)

subject to the constraints in (9). One does not have to calculate w nor ϕ(x

) in order

to determine the decision surface. Because the matrix associated with this quadratic

programming problem is not indeﬁnite, the solution to (11) will be global (Fletcher,

1987).

Furthermore, one can show that hyperplanes (3) satisfying the constraint kwk

≤

a have a VC-dimension h which is bounded by

h ≤ min([r

],n)+ 1, (12)

296 J.A.K. SUYKENS AND J. VANDEWALLE

where [.] denotes the integer part and r is the radius of the smallest ball containing

the points ϕ(x

), ..., ϕ(x

). Finding this ball is done by deﬁning the Lagrangian

(r, q, λ

) = r

−

k=1

−kϕ(x

) − qk

), (13)

where q is the center of the ball and λ

are positive Lagrange multipliers. In a

similar way as for (5) one ﬁnds that the center is equal to q =

ϕ(x

),where

the Lagrange multipliers follow from

max

(λ

; ϕ(x

)) =−

k,l=1

ϕ(x

)

ϕ(x

) λ

k=1

ϕ(x

)

ϕ(x

), (14)

such that

k=1

= 1, λ

≥ 0 ,k = 1, ..., N.

Based on (10), Q

can also be expressed in terms of ψ(x

). Finally, one

selects a support vector machine with minimal VC dimension by solving (11) and

computing (12) from (14).

3. Least Squares Support Vector Machines

Here we introduce a least squares version to the SVM classiﬁer by formulating the

classiﬁcation problem as

min

w,b,e

(w,b,e) =

w + γ

k=1

, (15)

subject to the equality constraints

ϕ(x

) + b]=1 − e

,k= 1, ..., N. (16)

One deﬁnes the Lagrangian

(w,b,e; α) = J

(w,b,e)−

k=1

ϕ(x

) + b]−1 + e

}, (17)

where α

are Lagrange multipliers (which can be either positive or negative now

due to the equality constraints as follows from the Kuhn-Tucker conditions (Fletcher,

1987)).

The conditions for optimality

∂L

∂w

= 0 → w =

k=1

ϕ(x

∂L

∂b

= 0 →

k=1

= 0,

∂L

∂e

= 0 → α

= γe

,k= 1, ..., N,

∂L

∂α

= 0 → y

ϕ(x

) + b]−1 + e

= 0,k = 1, ..., N

(18)

LEAST SQUARES SUPPORT VECTOR MACHINE CLASSIFIERS 297

can be written immediately as the solution to the following set of linear equations

(Fletcher, 1987)







I 00

−Z

000 −Y

00γI −I

ZY I 0































, (19)

where Z =[ϕ(x

)

; ...; ϕ(x

)

], Y =[y

; ...; y

1 =[1; ...; 1], e =

; ...; e

], α =[α

; ...; α

]. The solution is also given by



−Y

Y ZZ

+ γ

−1









. (20)

Mercer’s condition can be applied again to the matrix  = ZZ

,where



= y

ϕ(x

)

ϕ(x

)

= y

ψ(x

(21)

Hence, the classiﬁer (1) is found by solving the linear set of Equations (20)–(21)

instead of quadratic programming. The parameters of the kernels such as σ for

the RBF kernel can be optimally chosen according to (12). The support values α

are proportional to the errors at the data points (18), while in the case of (14) most

values are equal to zero. Hence, one could rather speak of a support value spectrum

in the least squares case.

4. Examples

In a ﬁrst example (Figure 1) we illustrate the support values for a linearly separable

problem of two classes in a two dimensional space. The size of the circles indicated

at the training data is chosen proportionally to the absolute values of the support

values. A linear SVM has been taken with γ = 1. Clearly, points located close and

far from the decision line have the largest support values. This is different from

SVM’s based on inequality constraints, where only points that are near the decision

line have nonzero support values. This can be understood from the fact that the

signed distance from a point x

to the decision line is equal to (w

+ b)/kwk=

(1 − e

)/(y

kwk) and α

= γe

in the least squares SVM case.

In a second example (Figure 2) we illustrate a least squares support vector

machine RBF classiﬁer on a two-spiral benchmark problem. The training data are

shown on Figure 2 with two classes indicated by ’o’ and

∗

(360 points with 180

for each class) in a two dimensional input space. Points in between the training

data located on the two spirals are often considered as test data for this problem but

are not shown on the ﬁgure. The excellent generalization performance is clear from

the decision boundaries shown on the ﬁgures. In this case σ = 1andγ = 1were

chosen as parameters. Other methods which have been applied to the two-spiral

Least Squares Support Vector Machine Classifiers

Figures

Citations

Extreme Learning Machine for Regression and Multiclass Classification

Applied Predictive Modeling

Do we need hundreds of classifiers to solve real world classification problems

In Defense of One-Vs-All Classification

Extreme learning machines: a survey

References

The Nature of Statistical Learning Theory

Matrix computations

Neural Networks: A Comprehensive Foundation

Statistical learning theory

Neural networks for pattern recognition

Related Papers (5)

The Nature of Statistical Learning Theory

Support-Vector Networks

Statistical learning theory

LIBSVM: A library for support vector machines

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Least squares support vector machine classifiers" ?