scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A sequential approach for multi-class discriminant analysis with kernels

17 May 2004-Vol. 5, pp 453-456
TL;DR: This work presents a sequential algorithm for GDA avoiding problems when one deals with large numbers of datapoints when dealing with nonlinear discriminant analysis using kernel functions.
Abstract: Linear discriminant analysis (LDA) is a standard statistical tool for data analysis. Recently, a method called generalized discriminant analysis (GDA) has been developed to deal with nonlinear discriminant analysis using kernel functions. Difficulties for the GDA method can arise in the form of both computational complexity and storage requirements. We present a sequential algorithm for GDA avoiding these problems when one deals with large numbers of datapoints.

Summary (2 min read)

1. INTRODUCTION

  • Fisher linear discriminant analysis (LDA) is a classical multivariate technique both for dimension reduction and classification.
  • This strategy allows low dimensional representations by using the first variates corresponding to the largest eigenvalues indicating that the major part of the information in the data is conserved.
  • Kernel-based methods are categorized into nonlinear transformation techniques for representation and for classification.
  • Recently, a powerful method for obtaining a nonlinear extension of the LDA method has been proposed and referred to as the Generalized Discriminant Analysis (GDA) [1].
  • Experiments showing the validity of their approach are proposed in Section 5.

2. SCATTER MATRICES FOR SEPARABILITY CRITERIA

  • For convenience and intelligibility, the authors show hereafter all scatter matrices in the feature space F .
  • This reflects the notion that performing a nonlinear data transformation into some specific high dimensional feature spaces increases the probability of having linearly separable classes within the transformed space.
  • The mixture scatter matrix in F is the covariance matrix of all samples regardless of their class assignments.

3. GDA METHOD IN FEATURE SPACE

  • The GDA method consists in finding the transformation matrix W that in some sense maximizes the ratio of the between-class scatter to the within-class scatter.
  • The columns of an optimal W are the generalized eigenvectors that correspond to the largest eigenvalues in1 Bwi = λiSwi. (7) By observing (7), the authors can conclude that deriving the GDA solutions may be a computationally intractable problem since they have to work in F which may be a very high, or even infinitely, dimensional space.
  • Note that the largest eigenvalue of (7) leads to the maximum quotient of the inertia [1] λ = wtBw wtSw .
  • Here, κ can be any kernel that satisfies the Mercer condition.
  • Memory and complexity problems can arise for the GDA method when the authors deal with large number of patterns since they have to perform an eigenvectors decomposition of K .

4. SEQUENTIAL GDA METHOD

  • Thus it may arises in the eigenvector decomposition the same problems of storage and complexity calculation as with the standard GDA.
  • Note that calculating (19) may be a computationally intractable problem.
  • (b) Gives the projection of the whole examples on the first two axes using the sequential approach.
  • Then, the same algorithm described at the beginning of this section can be applied with Knew.

5. EXPERIMENTS

  • The Iris data consist of 150 4-dimension examples of three classes [1] (each class consists on 50 examples).
  • One class is linearly separable from two other non-linearly separable classes.
  • Figure 1 (a) shows the projection of the three classes on the first axe, that was obtained with the sequential GDA.
  • The first axe seems to be sufficient to separate the data.

6. CONCLUSION

  • The authors have presented a sequential approach to calculate nonlinear features based on the GDA method proposed by [1].
  • The importance in the proposed algorithm for sequential GDA is that it does not need the inversion or even the storage of the Gram matrix of size (n, n).
  • The weakness of their approach is that the complexity increases with the number of axes to be found.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A SEQUENTIAL APPROACH FOR MULTI-CLASS DISCRIMINANT ANALYSIS WITH
KERNELS
Fahed ABDALLAH, C
´
edric RICHARD, R
´
egis LENGELLE
Laboratoire LM2S, Universit´e de Technologie de Troyes
B.P. 2060, F-10010 Troyes Cedex, FRANCE
fahed.abdallah@utt.fr, cedric.richard@utt.fr, regis.lengelle@utt.fr
ABSTRACT
Linear discriminant analysis (LDA) is a standard statistical tool
for data analysis. Recently, a method called Generalized discrim-
inant analysis (GDA) has been developed to deal with nonlinear
discriminant analysis using kernel functions. Difficulties for GDA
method can arise both in the form of computational complexity
and storage requirements. In this paper, we present a sequential
algorithm for GDA avoiding these problems when one deals with
large numbers of datapoints.
1. INTRODUCTION
Fisher linear discriminant analysis (LDA) is a classical multivari-
ate technique both for dimension reduction and classification. It is
based on a projection of the data vectors {x
i
}
i=1,...,n
that belong
to c different classes into a (c1) dimensional space in such a way
that the quotient between the inter-classes inertia and the intra-
classes inertia is maximized [3]. The method consists on an eigen-
value resolution which leads to the so-called canonical variates [2]
that contain the whole class specific information in the (c 1)
dimensional space. This strategy allows low dimensional repre-
sentations by using the first variates corresponding to the largest
eigenvalues indicating that the major part of the information in the
data is conserved. It can also be used as a multi-class classification
technique by partitioning the projected space into regions that can
be defined as the class membership. The LDA method for classi-
fication leads to linear decision boundaries and hence, the method
fails for nonlinear problems. Several attempts have been made to
incorporate nonlinearity into the original algorithm [2].
In the last few years there have been very significant develop-
ments in classification algorithms based on kernels. Kernel-based
methods are categorized into nonlinear transformation techniques
for representation and for classification. Support Vector Machines
(SVMs) were introduced and first applied as alternatives to multi-
layer neural networks [4]. The high generalization ability provided
by these learning machines has inspired recent works in discrimi-
nant analysis and feature extraction. Recently, a powerful method
for obtaining a nonlinear extension of the LDA method has been
proposed and referred to as the Generalized Discriminant Analy-
sis (GDA) [1]. An equivalent approach to the GDA can be found
in [2]. The main idea is to map the data into a convenient higher di-
mensional feature space F and then to perform the LDA algorithm
in F instead of the original input space. Fortunately, the LDA al-
gorithm can be reformulated into dot product form in F and there
is a highly effective trick to compute scalar products in some fea-
ture spaces using kernel functions which satisfy the Mercer con-
dition [4]. Hence, the GDA method can be expressed efficiently
as a linear algebraic formula in the transformed space using ker-
nel operators. Nevertheless, this formulation implies that we have
to manipulate the Gram matrix K of size (n, n) containing all
dot products in F. This can cause problems when the number of
patterns is very large. Our goal in this paper is to present a se-
quential method based on a gradient descent strategy avoiding the
need of inverting or even storing the kernel matrix K. The pre-
sented method allows us to calculate a discriminating axe in every
stage after manipulating the elements of the matrix K in a way
to remove the contribution of the discriminant axes calculated in
preceding stages.
In this paper, we first introduce scatter matrices which are very
useful for designing separability criteria. In Section 3 we give a
brief review of the GDA which is an efficient extension of the LDA
after mapping the data into the high dimensional feature space F.
In Section 4, the sequential GDA is reformulated with a gradient
descent procedure which avoids the manipulation of the kernel ma-
trix K of size (n, n). Experiments showing the validity of our ap-
proach are proposed in Section 5. Brief conclusions are provided
in Section 6.
2. SCATTER MATRICES FOR SEPARABILITY
CRITERIA
Recall that we consider a multi-class classification problem with
d-dimensional patterns {x
i
}
i=1,...,n
belonging to c different
classes,{C
i
}
i=1,...,c
. Let n
l
be the number of patterns of the class
l thus
P
c
l=1
n
l
= n. For convenience and intelligibility, we show
hereafter all scatter matrices in the feature space F.
Every kernel-based algorithm starts with the following idea: via a
nonlinear mapping
φ : R
d
F
x 7− φ(x),
the patterns are mapped into a high dimensional feature space F.
LDA can then be performed in F on the set {(φ(x
i
)}
i=1,...,n
.
This reflects the notion that performing a nonlinear data trans-
formation into some specific high dimensional feature spaces in-
creases the probability of having linearly separable classes within
the transformed space.
The between-class scatter matrix is the covariance matrix of
the class centers in F and is defined as
B =
1
n
c
X
l=1
n
l
(m
φ
l
m
φ
)(m
φ
l
m
φ
)
t
, (1)

where
m
φ
=
1
n
n
X
i=1
φ(x
i
)
is the mean calculated over all the data in F and
m
φ
l
=
1
n
l
X
x∈C
l
φ(x)
is the mean in F of the data belonging to the class C
l
. The within-
class scatter matrix in F is the scatter of all samples around their
respective class means
V =
1
n
c
X
l=1
X
x∈C
l
(φ(x) m
φ
l
)(φ(x) m
φ
l
)
t
. (2)
The mixture scatter matrix in F is the covariance matrix of all
samples regardless of their class assignments. It is defined as
S =
1
n
n
X
i=1
(φ(x
i
) m
φ
)(φ(x
i
) m
φ
)
t
. (3)
Note that B represents the inter-classes inertia, V corresponds
to the intra-classes inertia and S is the total inertia of the data
into F. The three matrix are related by the MANOVA equation
S = V + B. We show in the next section how to use these ma-
trices in order to get a judicious criterion of separability between
different classes.
3. GDA METHOD IN FEATURE SPACE
The projection of a pattern φ(x) from the feature space F to a
(c 1)-dimensional space is performed by (c 1) discriminant
functions
z(i) = w
t
i
φ(x) i = 1, . . . , c 1. (4)
This can be reduced as a single matrix equation
z = W
t
φ(x), (5)
where z is of components z(i) and W is a (D, c 1) matrix
having w
i
as columns with i {1, . . . , c 1}. D represents the
dimension of the feature space F that can be infinite.
The GDA method consists in finding the transformation matrix
W that in some sense maximizes the ratio of the between-class
scatter to the within-class scatter. A judicious criterion function is
the ratio [1, 3, 5]
J(W ) =
|W
t
BW |
|W
t
V W |
(6)
where |X| indicates the determinant of a matrix X. The columns
of an optimal W are the generalized eigenvectors that correspond
to the largest eigenvalues in
1
Bw
i
= λ
i
Sw
i
. (7)
By observing (7), we can conclude that deriving the GDA solu-
tions may be a computationally intractable problem since we have
to work in F which may be a very high, or even infinitely, dimen-
sional space. However, by using the theory of reproducing ker-
nels [4, 6, 7, 8, 9], such a problem can be solved without explicitly
1
Bw
i
= ρ
i
V w
i
and Bw
i
= λ
i
Sw
i
are equivalent eigenvalue
equations with identical solutions w
i
. See [2] for a demonstration.
mapping the data to the feature space F. Hence, any vector w F
of W must lie in the span of all training samples in F. Therefore
w can be written as follows:
w =
n
X
i=1
α(i) φ(x
i
) (8)
where the α(i)s denote the components of a dual vector α of size
n. Note that the largest eigenvalue of (7) leads to the maximum
quotient of the inertia [1]
λ =
w
t
Bw
w
t
Sw
. (9)
It can be shown that (9) is equivalent to [1]
λ =
α
t
KLKα
α
t
KKα
(10)
where K is the Gram matrix whose components correspond to
the inner product of x
i
and x
j
in F, κ(x
i
, x
j
) = φ(x
i
)
t
φ(x
j
).
Here, κ can be any kernel that satisfies the Mercer condition. L is
a (n, n) block diagonal matrix
L = (L
l
)
l=1,...,c
where L
l
is a (n
l
, n
l
) matrix with all terms equal to
1
n
l
. The
resolution of the eigenvector system (10) requires an eigenvectors
decomposition of the matrix K, K = P ΓP
t
. It can be shown
that the solution α is of the form [1]
α = P Γ
1
β (11)
with βs can be obtained by maximizing λ in
λβ = P
t
LP β. (12)
Note that the coefficients α should be divided by
α
t
Kα in or-
der to get a normalized w as w
t
w = 1.
The i-th component of the projected pattern φ(x) on a vector w
i
is given by using (4) and (8):
z(i) =
n
X
j=1
α
i
(j)κ(x, x
j
) = α
t
i
e
κ(x) (13)
where α
i
is the dual vector corresponding to w
i
and the vec-
tor
e
κ(x) = (κ(x, x
1
) . . . κ(x, x
n
))
t
. Memory and complexity
problems can arise for the GDA method when we deal with large
number of patterns since we have to perform an eigenvectors de-
composition of K. In the next section, we present the sequential
GDA which is less memory costing than the previous method since
we do not need to manipulate the (n, n) matrix K.
4. SEQUENTIAL GDA METHOD
It can be shown straightforwardly that (10) is equivalent to the
following quotient
λ =
P
c
l=1
n
l
(α
t
µ
ll
)
2
α
t
Nα
(14)
where the (n, n) matrix N = KK
t
nµµ
t
, µ =
1
n
P
n
i=1
e
κ(x
i
)
and µ
ll
= µ µ
l
with µ
l
=
1
n
P
x∈C
l
e
κ(x). In the aim of using

PSfrag replacements
(a)
(b)
-1
-0.8
-0.6
-0.4
-0.2
0
0
0.2
0.4
0.6
5 10 15 20 25 30 35 40 45 50
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
100
150
200
250
300
350
400
PSfrag replacements
(a)
(b)
-1
-0.8
-0.6
-0.4
-0.2
0
0
0.2
0.4
0.6
5
10
15
20
25
30
35
40
45
50
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
100 150 200 250 300 350 400
Fig. 1. (a) Represents the projection of each class (50 elements) of Iris data on the first axe of the sequential algorithm. A gaussian kernel
was used with σ = 1. (b) Gives the value of the criterion in (14) as a function of number of iterations.
a gradient descent strategy, we shall now take the inverse of (14)
as an objective function to be minimized
2
J(α) =
α
t
Nα
P
c
l=1
n
l
(α
t
µ
ll
)
2
. (15)
The gradient of (15) is given by the following expression:
α
J(α) =
2
(
P
c
l=1
n
l
(α
t
µ
ll
)
2
)
2
×
"
Nα
c
X
l=1
n
l
(α
t
µ
ll
)
2
c
X
l=1
n
l
(α
t
µ
ll
)µ
ll
(α
t
N α)
#
. (16)
From (15), it is obvious that the norm kαk of α is irrele-
vant. This implies that we can keep in (16)
P
c
l=1
n
l
(α
t
µ
ll
)
2
=
α
t
P
c
l=1
n
l
µ
ll
µ
t
ll
α = α
t
Σα = 1. Here Σ = AA
t
where
A = [
n
1
µ
11
, . . . ,
n
c
µ
cc
]. This can be done by dividing α by
kα
t
M D
1
2
k where M and D are respectively the matrices con-
taining the eigenvectors and eigenvalues of Σ. Here Σ is a (n, n)
matrix. Thus it may arises in the eigenvector decomposition the
same problems of storage and complexity calculation as with the
standard GDA. We solve it by observing that Σ has a maximum
rank c. Eigenvectors of Σ are the same as those of the (c, c) ma-
trix A
t
A. Finally, we get the expression of the update α where
the factor of 2 has been ignored:
α = N α (α
t
Nα)
c
X
l=1
n
l
(α
t
µ
ll
)µ
ll
. (17)
This can be written using only vectors calculation
α =
n
X
i=1
y(i)
"
e
κ(x
i
) y(i)
c
X
l=1
n
l
(α
t
µ
ll
)µ
ll
#
+ nµ
t
α
"
µ
t
α
c
X
l=1
n
l
(α
t
µ
ll
)µ
ll
µ
#
, (18)
2
the reason for using the inverse of (14) will be clear in the following.
where y(i) = α
t
˜
κ(x
i
). The sequential algorithm can be then
described as follows:
1. initialize α
2. calculate µ
l
, µ
ll
and µ
3. calculate the eigenvectors and eigenvalues of Σ from A
t
A
and normalize α as α α/kα
t
M D
1
2
k, and next calcu-
late α from (18)
4. update α: α α η α, with η > 0 the learning step
size
5. if finish, then exit; otherwise, go to 3.
In order to calculate a second vector α
2
, we should first elimi-
nate the contribution on the data of the first α
1
calculated by max-
imizing (14). This can be done by observing that in (7), when S is
of full rank, the two axes w
1
and w
2
corresponding to the first two
eigenvectors of S
1
B verify the equation w
t
1
Bw
2
= 0. Thus in
F, we should replace φ(x
i
) using the rule
φ(x
i
)
new
φ(x
i
)
(Bw
1
)(Bw
1
)
t
φ(x
i
)
kBw
1
k
2
(19)
for all the data in the training set. Note that calculating (19) may
be a computationally intractable problem. However, observing that
our algorithm is formulated using only dot products, we only need
to compute the dot products
κ
new
(x
i
, x
j
) = (φ(x
i
)
new
)
t
φ(x
j
)
new
. (20)
Developing (20) we obtain the expression of the dot product as
κ
new
(x
i
, x
j
) = κ(x
i
, x
j
)
φ(x
j
)(Bw
1
)(Bw
1
)
t
φ(x
j
)
kBw
1
k
2
= κ(x
i
, x
j
)
M(x
i
, x
j
)
L
(21)
where M(x
i
, x
j
) is given by
M(x
i
, x
j
) =
1
n
2
c
X
l,f
n
l
X
a
n
f
X
b
κ(x
a
, x
i
)κ(x
b
, x
j
)Γ(l)Γ(f),
(22)

PSfrag replacements
(a)
(b)
-2.5
-2.5
-2
-2
-1.5
-1.5
-1
-1
-0.5
-0.5
0
0.5
0.5
1
1
1.5
1.5
2
2
2.5
2.5
-0.25
-0.2
-0.15
-0.1
-0.05
0
0
0.05
0.1
0.15
0.2
0.25
-0.3
0.3
-0.4
PSfrag replacements
(a)
(b)
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
-0.25
-0.2
-0.2
-0.15
-0.1
-0.1 -0.05
0
0
0.05
0.1
0.1 0.15
0.2
0.2 0.25
-0.3
0.3
-0.4
Fig. 2. (a) synthetic data consisting of three classes. The first is represented by , the second by + and the third by ?. (b) Gives the
projection of the whole examples on the first two axes using the sequential approach. A gaussian kernel was used with σ = 1.
and
L =
1
n
2
c
X
l,f
n
l
X
a
n
f
X
b
κ(x
a
, x
b
)Γ(l)Γ(f). (23)
Here Γ(r) =
h
1
n
r
P
n
r
k=1
y(k)
1
n
P
n
p=1
y(p)
i
for a class r.
Then, the same algorithm described at the beginning of this sec-
tion can be applied with K
new
. If we need to calculate a third axe,
we can apply (21) on the elements of K
new
instead of the elements
of K. It is obvious that the complexity of the method increases
with the number of axes to be found. Indeed, our approach is com-
putationally efficient to compute the first axe. However, obtaining
the second, the third, ..., axes requires to compute the elements of
K
new
which increases the complexity of the method.
5. EXPERIMENTS
The Iris data consist of 150 4-dimension examples of three
classes [1] (each class consists on 50 examples). One class is
linearly separable from two other non-linearly separable classes.
Figure 1 (a) shows the projection of the three classes on the first
axe, that was obtained with the sequential GDA. A gaussian kernel
was used with width σ = 1. The first axe seems to be sufficient
to separate the data. Figure 1 (b) gives the value of the criterion
in (14) as a function of number of iterations. The step size η was
set to 0.02. Next, we consider in Figure 2 (a) three non-linearly
separable synthetic classes consisting of samples uniformaly lo-
cated upon three circles with the same center but with different
radius. Each class contains 200 2-dimension samples. The first
class is represented by , the second by + and the third by ?. Even
if the first axe is sufficient to separate the classes, we give in Fig-
ure 2 (b) the projection of the whole examples on the two first axes
using our sequential approach. A gaussian kernel was used with
σ = 1. The step size η was set to 0.04. Note that the second axe
allows to separate only classes 1 and 2 from class 3.
6. CONCLUSION
In this paper, we have presented a sequential approach to calculate
nonlinear features based on the GDA method proposed by [1]. The
importance in the proposed algorithm for sequential GDA is that it
does not need the inversion or even the storage of the Gram matrix
of size (n, n). However, the weakness of our approach is that the
complexity increases with the number of axes to be found.
7. REFERENCES
[1] G. Baudat, F. Anouar. ”Generalized discriminant analysis
using a kernel approach, in Neural Computation, vol. 12,
no. 10, pp. 2385–2404, 2000.
[2] V. Roth and V. Steinhage. ”Nonlinear discriminant analysis
using kernel functions, in Advances in Neural Information
Processing Systems, S.A. Solla, T.K. Leen, and K.-R. Mller,
editors, vol. 12, pp. 568–574, MIT Press, 2000.
[3] K. Fukunaga. Statistical Pattern Recognition, San Diego:
Academic Press, 1990.
[4] V. Vapnik. The Nature of Statistical Learning Theory. New
York: Springer Verlag, 1995.
[5] R. O. Duda, P. E. Hart, D. G. Stork. Pattern Classification,
New York: Wiley and Sons, 2001.
[6] S. Mika, G. R¨atsch, J. Weston, B. Sch¨olkopf and K. R.
M¨uller. ”Fisher discrimiant analysis with kernels, in Ad-
vances in Neural networks for signal processing , Y. H. Hu,
J. Larsen, E. Wilson, S. Douglas, editors, pp. 41–48, 1999.
[7] K. R. M¨uller, S. Mika, R¨atsch, K. Tsuda and B. Sch¨olkopf.
”An introduction to kernel-based learning algorithms, IEEE
Neural Networks, vol. 12, no. 2, pp. 181–201, May 2001.
[8] S. Saitoh. Theory of Reproducing Kernels and its Applica-
tions. Longman Scientific & Technical, 1988.
[9] B. Sch¨olkopf, C. Burges and A. Smola. Advances in Kernel
Methods-Support Vector Learning, MIT Press, 1999.
Citations
More filters
Book ChapterDOI
E.R. Davies1
01 Jan 1990
TL;DR: This chapter introduces the subject of statistical pattern recognition (SPR) by considering how features are defined and emphasizes that the nearest neighbor algorithm achieves error rates comparable with those of an ideal Bayes’ classifier.
Abstract: This chapter introduces the subject of statistical pattern recognition (SPR). It starts by considering how features are defined and emphasizes that the nearest neighbor algorithm achieves error rates comparable with those of an ideal Bayes’ classifier. The concepts of an optimal number of features, representativeness of the training data, and the need to avoid overfitting to the training data are stressed. The chapter shows that methods such as the support vector machine and artificial neural networks are subject to these same training limitations, although each has its advantages. For neural networks, the multilayer perceptron architecture and back-propagation algorithm are described. The chapter distinguishes between supervised and unsupervised learning, demonstrating the advantages of the latter and showing how methods such as clustering and principal components analysis fit into the SPR framework. The chapter also defines the receiver operating characteristic, which allows an optimum balance between false positives and false negatives to be achieved.

1,189 citations

22 Sep 2006
TL;DR: The classification problem is nonlinear and that the modified Kohonen network was essentially equivalent to LDA, and the implementation of NLDA to more sophisticated neural networks that also approximate LDA is compared.
Abstract: This work deals with pattern classification of single pap-smear cells from an existing database developed on Herlev University Hospital [1]-[2] with 917 cells characterized by 20 numerical features and classified over 7 classes by Human experts. Medical, the method can be used for detecting pre-malignant cells in uterine cervix before the progress into cancer. Available cell features like area, position and brightness of nucleus and cytoplasm are used for the classification into normal and abnormal cells. We began to solve this problem with a modified Kohonnen neural network that took into account the classification errors, but even after long hours of fine tuning of a set of parameters we only got 66.7% of good classifications. Then using Fisher's linear discriminant analysis we also got a similar result, 66.8% of good classifications. So we reached the conclusion that our classification problem is nonlinear and that our modified Kohonen network was essentially equivalent to LDA. Then we implement NLDA with a very simple feedforward neural network and after only 50 epochs of training with BP and varying the number of sigmoidal neurons in the first hidden layer we got a surprising result of 98.3% of good classifications in the best of five successive runs of BP over 50 epochs with random weights initialization and 60 sigmoidal neurons in the first hidden layer. Next we formatted the input data such that all variables have unit variance and we obtained 99.1% of good classifications after 1,000 epochs of training and forcing also zero mean in all variables we got an even better result of 99.8%, i.e. 2 errors in 917 classifications. Finally we compare our solution to recent works and our implementation of NLDA to more sophisticated neural networks that also approximate LDA.

1 citations


Cites background from "A sequential approach for multi-cla..."

  • ...Recently the LDA and NLDA are being revisited in the solution of difficult pattern recognition problems like Human face recognition [6]-[16]....

    [...]

References
More filters
Book
Vladimir Vapnik1
01 Jan 1995
TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Abstract: Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.

40,147 citations


"A sequential approach for multi-cla..." refers methods in this paper

  • ...However, by using the theory of reproducing kernels [4, 6, 7, 8, 9], such a problem can be solved without explicitly (1)Bwi = ρiV wi and Bwi = λiSwi are equivalent eigenvalue equations with identical solutions wi....

    [...]

  • ...Fortunately, the LDA algorithm can be reformulated into dot product form in F and there is a highly effective trick to compute scalar products in some feature spaces using kernel functions which satisfy the Mercer condition [4]....

    [...]

  • ...Support Vector Machines (SVMs) were introduced and first applied as alternatives to multilayer neural networks [4]....

    [...]

Book
01 Jan 1973

20,541 citations


"A sequential approach for multi-cla..." refers background in this paper

  • ...A judicious criterion function is the ratio [1, 3, 5] J(W ) = |W BW | |W V W | (6) where |X | indicates the determinant of a matrix X ....

    [...]

Proceedings ArticleDOI
08 Feb 1999
TL;DR: Support vector machines for dynamic reconstruction of a chaotic system, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel.
Abstract: Introduction to support vector learning roadmap. Part 1 Theory: three remarks on the support vector method of function estimation, Vladimir Vapnik generalization performance of support vector machines and other pattern classifiers, Peter Bartlett and John Shawe-Taylor Bayesian voting schemes and large margin classifiers, Nello Cristianini and John Shawe-Taylor support vector machines, reproducing kernel Hilbert spaces, and randomized GACV, Grace Wahba geometry and invariance in kernel based methods, Christopher J.C. Burges on the annealed VC entropy for margin classifiers - a statistical mechanics study, Manfred Opper entropy numbers, operators and support vector kernels, Robert C. Williamson et al. Part 2 Implementations: solving the quadratic programming problem arising in support vector classification, Linda Kaufman making large-scale support vector machine learning practical, Thorsten Joachims fast training of support vector machines using sequential minimal optimization, John C. Platt. Part 3 Applications: support vector machines for dynamic reconstruction of a chaotic system, Davide Mattera and Simon Haykin using support vector machines for time series prediction, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel. Part 4 Extensions of the algorithm: reducing the run-time complexity in support vector machines, Edgar E. Osuna and Federico Girosi support vector regression with ANOVA decomposition kernels, Mark O. Stitson et al support vector density estimation, Jason Weston et al combining support vector and mathematical programming methods for classification, Bernhard Scholkopf et al.

5,506 citations


"A sequential approach for multi-cla..." refers methods in this paper

  • ...However, by using the theory of reproducing kernels [4, 6, 7, 8, 9], such a problem can be solved without explicitly (1)Bwi = ρiV wi and Bwi = λiSwi are equivalent eigenvalue equations with identical solutions wi....

    [...]

Journal ArticleDOI
TL;DR: This paper provides an introduction to support vector machines, kernel Fisher discriminant analysis, and kernel principal component analysis, as examples for successful kernel-based learning methods.
Abstract: This paper provides an introduction to support vector machines, kernel Fisher discriminant analysis, and kernel principal component analysis, as examples for successful kernel-based learning methods. We first give a short background about Vapnik-Chervonenkis theory and kernel feature spaces and then proceed to kernel based learning in supervised and unsupervised scenarios including practical and algorithmic considerations. We illustrate the usefulness of kernel algorithms by discussing applications such as optical character recognition and DNA analysis.

3,566 citations


"A sequential approach for multi-cla..." refers methods in this paper

  • ...However, by using the theory of reproducing kernels [4, 6, 7, 8, 9], such a problem can be solved without explicitly (1)Bwi = ρiV wi and Bwi = λiSwi are equivalent eigenvalue equations with identical solutions wi....

    [...]

Proceedings ArticleDOI
23 Aug 1999
TL;DR: In this article, a non-linear classification technique based on Fisher's discriminant is proposed and the main ingredient is the kernel trick which allows the efficient computation of Fisher discriminant in feature space.
Abstract: A non-linear classification technique based on Fisher's discriminant is proposed. The main ingredient is the kernel trick which allows the efficient computation of Fisher discriminant in feature space. The linear classification in feature space corresponds to a (powerful) non-linear decision function in input space. Large scale simulations demonstrate the competitiveness of our approach.

2,896 citations

Frequently Asked Questions (1)
Q1. What are the contributions in "A sequential approach for multi-class discriminant analysis with kernels" ?

In this paper, the authors present a sequential algorithm for GDA avoiding these problems when one deals with large numbers of datapoints.