scispace - formally typeset
Open AccessProceedings ArticleDOI

Logistic discriminant analysis

TLDR
A novel nonlinear discriminantAnalysis named logistic discriminant analysis (LgDA) in which the posterior probabilities are estimated by multi-nominal logistic regression (MLR) is proposed.
Abstract
Linear discriminant analysis (LDA) is one of the well known methods to extract the best features for the multi-class discrimination. Otsu derived the optimal nonlinear discriminant analysis (ONDA) by assuming the underlying probabilities and showed that the ONDA was closely related to Bayesian decision theory (the posterior probabilities). Also Otsu pointed out that LDA could be regarded as a linear approximation of the ONDA through the linear approximations of the Bayesian posterior probabilities. Based on this theory, we propose a novel nonlinear discriminant analysis named logistic discriminant analysis (LgDA) in which the posterior probabilities are estimated by multi-nominal logistic regression (MLR). The experimental results are shown by comparing the discriminant spaces constructed by LgDA and LDA for the standard repository datasets.

read more

Content maybe subject to copyright    Report

Logistic Discriminant Analysis
Takio Kurita
Neuroscience Resarch Institute
AIST
Tsukuba, Japan
takio-kurita@aist.go.jp
Kenji Watanabe
AIST
Tsukuba, Japan
kenji-watanabe@aist.go.jp
Nobuyuki Otsu
AIST Fellow
AIST
Tsukuba, Japan
otsu.n@aist.go.jp
Abstract—Linear discriminant analysis (LDA) is one of the
well known methods to extract the best features for the multi-
class discrimination. Otsu derived the optimal nonlinear
discriminant analysis (ONDA) by assuming the underlying
probabilities and showed that the ONDA was closely related to
Bayesian decision theory (the posterior probabilities). Also Otsu
pointed out that LDA could be regarded as a linear
approximation of the ONDA through the linear approximations
of the Bayesian posterior probabilities. Based on this theory, we
propose a novel nonlinear discriminant analysis named logistic
discriminant analysis (LgDA) in which the posterior probabilities
are estimated by multi-nominal logistic regression (MLR). The
experimental results are shown by comparing the discriminant
spaces constructed by LgDA and LDA for the standard
repository datasets.
Keywords— linear discriminant analysis, nonlinear
discriminant analysis, multi-nominal logistic regression, logistic
discriminant analysis, Bayesian decision theory
I. INTRODUCTION
Feature extraction is one of the most important problems in
pattern recognition. Linear discriminant analysis (LDA) is one
of the well known methods to extract the best features for
multi-class discrimination. LDA is formulated as a problem to
find an optimal linear mapping by which the within-class
scatter in the mapped feature space is made as small as
possible relative to the between-class scatter. LDA is useful for
linear separable cases, but for more complicated cases, it is
necessary to extend it to non-linear.
Otsu derived the optimal nonlinear discriminant analysis
(ONDA) by assuming the underlying probabilities [1, 2, 3]. He
showed that the optimal non-linear discriminant mapping was
closely related to Bayesian decision theory (The posterior
probabilities). Also Otsu pointed out that LDA can be regarded
as the linear approximation of the ultimate ONDA through the
linear approximations of the Bayesian posterior probabilities.
This theory suggests that we can construct a novel nonlinear
discriminant mapping if we utilize nonlinear estimates of the
posterior probabilities. Since the outputs of the trained multi-
layered Perceptron (MLP) for pattern classification problems
can be regarded as the approximations of the posteriori
probabilities [4], Kurita et al. [5] proposed the neural network
based non-linear discriminant analysis by using the outputs of
the trained MLP. Recently non-linear discriminant space can
be constructed by the kernel discriminant analysis [6, 7]. This
is also interpreted as an approximation of the ultimate ONDA.
LDA is the linear approximation of the ONDA through the
linear approximations of the Bayesian posterior probabilities.
However, a linear model is not suitable to estimate the
posterior probabilities. Logistic regression (LR) is one of the
simplest models for binary classification and can directly
estimate the posterior probabilities. Multi-nominal logistic
regression (MLR) is a natural extension of LR to multi-class
classification problems. They are known as the members of the
generalized linear model (GLM) which is a flexible
generalization of ordinary least squares regression. By
modifying the outputs of the linear predictor by the link
function, MLR can more naturally estimate the posterior
probabilities.
In this paper, we propose a novel nonlinear discriminant
analysis in which the Bayesian posterior probabilities are
estimated by MLR. The proposed method is named as logistic
discriminant analysis, in short LgDA. It is expected that the
discriminant space constructed by LgDA is better than the one
constructed by LDA, because MLR is more natural as the
probability estimator than the linear approximation of the
posterior probabilities used in LDA. The experimental results
are shown by comparing the discriminant spaces constructed
by LgDA and LDA for the standard repository datasets.
II.
LINEAR AND NON-LINEAR DISCRIMINANT ANALYSIS
A. Linear Disctiminant Analysis
Let an m dimensional feature vector be
()
Τ
=
m
x,,x "
1
x .
Consider
K classes
{}
K
k
k
C
1=
. As training samples, we have N
feature vectors and they are labeled as one of the K classes.
Then LDA constructs a dimension reducing linear mapping
from the input feature vector x to a new feature vector y
xAy
Τ
= , (1)
where A = [
a
ij
] is the coefficient matrix. The discriminant
criterion
(
)
BT
ˆˆ
J ¦¦=
1
tr (2)
Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics
San Antonio, TX, USA - October 2009
978-1-4244-2794-9/09/$25.00 ©2009 IEEE
2236

is used to evaluate the performance of the discrimination of the
new feature vectors
y. The objective is to maximize the
discriminant criterion
J, where
T
ˆ
¦
and
B
ˆ
¦
are respectively
the total covariance matrix and the between-class covariance
matrix of the new feature vectors
y.
The optimal coefficient matrix
A is then obtained by
solving the following eigen equation
()
IAAAA =¦Λ¦=¦
Τ
TTB
, (3)
where Λ is a diagonal matrix of eigen values and I denotes the
unit matrix. The matrices
T
¦ and
B
¦ are respectively the total
covariance matrix and the between-class covariance matrix of
the input vectors
x, and they are computed as follows:
()()
()
()()
.CP
,
K
k
TkTkkB
N
i
TiTiT
¦
¦
=
Τ
=
Τ
=¦
=¦
1
1
xxxx
xxxx
(4)
Where
()
k
CP ,
k
x and
T
x denote a priori probability of the
class
C
k
(
()
NNCP
kk
= , N
k
is the number of input vectors of
the class
C
k
and N is the number of input vectors), the mean
vector of the class
C
k
and the total mean vector, respectively.
The
j-th column of A is the eigenvector corresponding to
the
j-th largest eigenvalue. Therefore, the importance of each
element of the new feature vector
y is evaluated by the
corresponding eigenvalues. The dimension of the new feature
vector
y is bounded by
()
N,Kmin 1 .
B. Optimal Nonlinear Discriminant Analysis
Otsu derived the optimal nonlinear discriminant analysis
(NDA) by assuming the underlying probabilities [1, 2].
Similarly to the LDA, the ONDA constructs the dimension
reducing nonlinear mapping which maximizes the discriminant
criterion J. The optimal non-linear discriminant mapping is
given by
()
¦
=
=
K
k
kk
CP
1
uxy
, (5)
where
()
x
k
CP is the Bayesian posterior probability of the
class C
k
given the input x. The vectors u
k
(k = 1,…, K) are
class representative vectors which are determined by following
eigen-equation:
Λ=
Γ
PUU , (6)
where Γ is a
K
K
× matrix whose elements are
ij
γ
()
()
()
()()()
()
³
= xxxx dpCPCPCPCP
jjiiij
γ
,(7)
and the other matrices are represented as follows:
[]
() ( )()
()
.,,diag
,CP,,CPdiag
,,,
K
K
K
λλ
!
!
!
1
1
1
=Λ
=
=
Τ
P
uuU
(8)
It is important to notice that the optimal non-linear mapping is
closely related to Bayesian decision theory, namely the
posterior probabilities
()
x
k
CP . Along this line, Fukunaga et
al
discussed the various properties of the criterion from the
viewpoint of non-linear mappings [8].
Thus, we can construct optimal nonlinear discriminant
features by ONDA from a given input features if we can know
or estimate all the Bayesian
posteriori probabilities correctly.
However, it is usually difficult to estimate them from the input
features.
C. Linear approximation of NDA
In the previous subsection, we explained ONDA as a
ultimate nonlinear extension of LDA. Then we may have the
following question: in what sense does LDA approximate
NDA?
Let
()
)(
0
)( kk
k
bCL += xbx (9)
be a linear approximation of the Bayesian
posteriori
probabilities which minimizes the mean square error as
follows:
()(){}
()
³
= xxxx dpCLCP
kk
2
2
ε
. (10)
Otsu [2, 11] already pointed out that the optimal linear
function is given by
()
()()()
{
}
1
1
+¦=
Τ
TTTkkk
CPCL xxxxx , (11)
where
T
¦ denotes the total covariance matrix. It is interesting
to note that this function has unit-sum property from x as
follows:
()
1
1
=
¦
=
K
k
k
CL x . (12)
Let us substitute these linear approximations
()
x
k
CL for
the Bayesian posterior probabilities
()
x
k
CP in (5) and (6) of
ONDA. By this substitution, (5) becomes
2237

()
()
pUxxPMU
uxy
ΤΤΤ
=
+¦=
=
¦
TT
K
k
kk
CL
1
1
, (13)
where
()()
[]
Τ
=
TKT
,, xxxxM !
1
and
T
K
CPCP ))(,),((
1
!=p
. Γ in (6) becomes as follows:
MPPM
1Τ
¦=Γ
T
. (14)
By multiplying M from the left and substituting
U
T
MP
1
Σ
for
A
, we have the same eigen-equation with (3). This means
that LDA is the linear approximation of ONDA through the
linear approximation
()
x
k
CL of the posteriori
probabilities
()
x
k
CP .
III. L
OGISTIC DISCRIMINANT ANALYSIS
A. Multi-nominal Logistic Regression
Logistic regression (LR) is one of the simplest models for
binary classification and can directly estimate the posterior
probabilities. Multi-nominal logistic regression (MLR) is a
natural extension of LR to multi-class classification problems
[9]. It is known as one of the generalized linear model (GLM)
which is a flexible generalization of ordinary least squares
regression. By modifying the outputs of the linear predictor by
the link function, MLR can naturally estimate the posterior
probabilities.
For K-class classification problem, let
(){}
N
i
ii
,D
1=
= tx be
the given training data, where
()
T
imii
xx ,,
1
"=x is the i-th
input vector, and
{}
{
110
1
==
L
k
i
,,| tttTt is the
class representative vector for the i-th input vector. The outputs
of MLR estimate the posterior probabilities
()
i
k
i
tP x| . They
are defined as follows:
()
()
()
¦
=
+
==
1
1
exp1
exp
)(|
ˆ
K
j
j
i
k
i
i
k
i
k
i
Ș
Ș
ytP xx
, (15)
kT
ik
kT
i
k
i
bȘ wxwx
ˆˆ
=+=
. (16)
Where
()
T
mkk
k
ww ,,
ˆ
1
"=w and
k
b are the weight vector and
the bias term of k-th class, respectively. To simplify the
notation, we include the bias term in the vectors as
()
T
kmkk
k
bww ,,,
1
"=w and
()
T
imii
xx 1,,,
ˆ
1
"=x . In
matrix notation, we use
()
11
,,
=
K
wwW " and
()
N
xxX
ˆ
,,
ˆ
ˆ
1
"= . The optimal parameters of MLR are
obtained by minimizing the negative log-likelihood
D
Eminarg
w
W = , (17)
()
¦¦ ¦
=
=
=
°
¿
°
¾
½
°
¯
°
®
¸
¸
¹
·
¨
¨
©
§
+=
N
i
K
j
j
i
j
i
K
l
l
i
j
iD
ȘtȘexplogtE
1
1
1
1
1
1 . (18)
Equation (17) represents a convex optimization problem
and it has only a single, global minimum. Again the optimal
parameter
W can be efficiently found using Newton-Raphson
method or an iterative re-weighted least squares (IRLS)
procedure. In each iteration step,
W is updated by
ZGHW
Τ+
=
11t
, (19)
where
RGGH
T
= is the block Hessian matrix, and
1
H is
the inverse matrix of
H.
(
)
11
ˆ
,,
ˆ
=
K
diag XXG " is the
block diagonal matrix of
X
ˆ
, and
X
X
ˆˆ
k
= . The matrix R is
the block matrix defined as follows:
()
() ()()
()
()
.
otherwise
kjif
,yyr
,r,,rdiag
,
jk
k
njk
j
n
jk
n
jk
N
jk
jk
KKK
K
°
¯
°
®
=
==
=
¸
¸
¸
¸
¹
·
¨
¨
¨
¨
©
§
=
0
1
1
1111
1111
δδ
"
"
#%#
"
R
RR
RR
R
(20)
The vector
Z is the block vector with elements
()
kk
K
j
j
kjk
tyȘRz =
¦
=
1
1
. (21)
Equation (19) is repeated until it converges
B. Regularization of MLR
In general, the regularization term is introduced to control
the over-fitting. The regularization methods of MLR were
proposed such as the shrinkage method (regularized MLR) and
locality preserving multi-nominal logistic regression (LPMLR)
[10]. In shrinkage method, unnecessary growth of the
parameters is penalized by introducing the regularization
term
W
E defined as follows:
¦¦
=
=
Τ
Τ
==
1
1
1
1
K
j
K
k
kjW
E wwWW
. (22)
Then the optimal parameters of the regularized MLR is
determined by minimizing the negative log-likelihood as
()
WD
EEminarg
w
w
W
λ
+= . (23)
2238

Equation (23) represents a convex optimization problem,
and
w
Ȝ is the pre-specified regularization parameter of
W
E .
The multiplicative update rule for the regularized MLR is
the same as (19). However, the elements of the block Hessian
matrix
H are different from MLR [10]. the block Hessian
matrix
H of the regularized MLR is defined as follows:
.
ˆˆ
2
ˆˆ
,
w
w
1111
1111
°
¯
°
®
+
=+
=
¸
¸
¸
¹
·
¨
¨
¨
©
§
=
otherwiseȜ
kjifȜ
jk
jk
jk
))(K(K)(K
)(K
IXRX
IXRX
H
HH
HH
H
T
T
"
#%#
"
(24)
Where
I
is the identity matrix. R is the block matrix similar to
(20).
Z is the block vector with elements similar to (21).
C. Logistic Discriminant Analysis
After the training of the parameters using the sufficient
number of samples, the outputs of the ordinal MLR or the
regularized MLR can be interpreted as estimates of the
Bayesian
posterior probabilities
() ()()
Τ
xx
K
CPCP ,,
1
!
. By
substituting the Bayesian
posterior probabilities in the ONDA
with the outputs of the ordinal MLR or the regularized MLR,
we can directly construct an approximation of ONDA. We call
this method logistic discriminant analysis (LgDA). It is
expected that the discriminant space constructed by LgDA is
better than the one constructed by LDA, because MLR is more
natural as the estimates of the
posterior probabilities than the
linear approximation of them used in LDA.
Let the outputs of the ordinal MLR or the regularized MLR
for an input vector
x be
() () ()
()
Τ
= xxxy
K
yy ,,
1
!
. Then a
priori
probability
()
k
CP is approximated as follows:
() () ()
Kkyy
N
CP
k
N
i
i
k
k
,,1
1
~
1
!===
¦
=
x
. (25)
The approximation of the matrix Γ is also given by
()()()()
¦
=
Τ
=Γ
N
i
ii
N
1
1
~
yxyyxy
. (26)
Thus the non-linear discriminant mapping is obtained as
()
¦
=
=
K
k
k
k
y
1
~~
uxy
. (27)
The representative vectors of each class
k
u
~
are determined by
the following eigen equation
Λ=Γ
~~~~
~
UPU
, (28)
where the matrices such as P
~
, U
~
and Λ
~
are defined as
follows:
[]
()
()
.
~
,,
~
~
,)(
~
,),(
~~
,
~
,,
~
~
1
1
1
K
K
K
diag
CPCPdiag
λλ
!
!
!
=Λ
=
=
Τ
P
uuU
(29)
If the outputs of the ordinal MLR or the regularized MLR can
give sufficiently good approximation to the Bayesian
posterior
probabilities, it is expected that the nonlinear discriminant
mapping defined by (27) constructs the good approximation of
the ultimate nonlinear discriminant mapping ONDA in terms
of the discriminant criterion.
IV. E
XPERIMENTS
To show the effectiveness of the proposed LgDA, the
discriminant space was compared with LDA using the standard
repository datasets for multiclass classification [11]. Fig.1
shows the 2-dimensional discriminant spaces constructed by
LDA and LgDA for Satimage dataset which has 36
dimensional features from 6 classes and consists of 4435
training samples and 2000 test samples. The regularization
parameter
w
Ȝ of LgDA was determined by grid search. Fig.2
shows the 2-dimensional discriminant spaces for Balance
dataset which has 4 dimensional features from 3 classes and
consists of 625 samples. This dataset was randomly divided
into 90 training samples and 535 test samples. The test samples
are plotted in the constructed discriminant spaces. It is noticed
that samples of each class are gathered around the class
representative vectors in the discriminant space constructed by
LgDA but samples are more spread in the discriminant space
by LDA. Especially in the case of Balance dataset, the
discriminant space constructed by LgDA is more class-
dependent while that by LDA inherits the topology in the input
feature space.
TABLE I. shows the values of the discriminant criteria of
the constructed discriminant space. It is noticed that the
proposed LgDA achieves higher values than LDA. This means
that the discriminant space constructed by LgDA is better than
that constructed by LDA in terms of the discriminant criterion
which is the objective function of the discriminant analysis.
Especially, the improvement of the discriminant criterion is
large in the case of Balance dataset.
TABLE II. shows the recognition rates of Satimage and
Balance datasets obtained by using LDA, MLR and LgDA.
They are calculated by using k nearest neighbor (k-NN)
classifier in the discriminant space for the test samples. In
TABLE II., LgDA (
Ȝ
W
= 0) denotes LgDA by MLR without
regularization. It is noticed that the recognition rates of
Satimage by LgDA are higher than that by LDA. Especially,
LgDA by MLR without regularization gave higher recognition
2239

rate than MLR and it gave the best recognition rate for
Satimage. The recognition rate by LgDA with regularization
was slightly lower than that by the regularized MLR. The
reason of this is probably because the regularization parameter
of LgDA is not tuned and is set to the same value with the
regularized MLR. The recognition rate of LgDA with
regularization is probably improved by tuning the
regularization parameter for LgDA classifier. Also the
recognition rate by LgDA with regularization was slaightly
lower than that by LgDA with regularization. The reason is
also similar. The recognition rate of LgDA with regularization
is probably improved by tuning the reguralization parameter
for LgDA classifier. In the case of Balance dataset, the
recognition rate by LDA was the lowest and that by other
methods was equal to 92.15%. These results suggest that
LgDA can achieve higher recognition rates than LDA even if
the sample shows the structured distribution in input feature
space.
TABLE I. DISCRIMINANT CRITERIONS
Satimage Balance
LDA 0.4900 0.3333
LgDA (Ȝ
W
= 0)
0.7541 0.6783
LgDA 0.7462 0.6773
Figure 2. Discriminant spaces by LDA (above) and by LgDA (below)
for Balance
Figure 1. Discriminant spaces by LDA (above) and b LgDA (below) for
Satimage
2240

Citations
More filters
Journal ArticleDOI

Computerized two-lead resting ECG analysis for the detection of coronary artery stenosis after coronary revascularization.

TL;DR: 3DMP's computer-based, mathematically derived analysis of resting two-lead ECG data provides detection of hemodynamically relevant CAD in patients with a history of coronary revascularization with high sensitivity and specificity that appears to be at least as good as those reported for other resting and/or stress ECG methods currently used in clinical practice.
Journal ArticleDOI

Computerized two-lead resting ECG analysis for the detection of coronary artery stenosis.

TL;DR: 3DMP's computer-based, mathematically derived analysis of resting two-lead ECG data provides detection of hemodynamically relevant CAD with high sensitivity and specificity that appears to be at least as good as those reported for other resting and/or stress ECG methods currently used in clinical practice.
Patent

Enhanced system and method for conducting PCA analysis on data signals

TL;DR: In this paper, wavelet packet transform (WPT) based PCA is used to detect faults in a machine or in a system being monitored, and the results of the PCA analysis are automatically classified to automatically determine what issues there may be in a finished product or a machine being monitored.

New Wealth, New Wisdom: Understanding the Changing Profile of Countries Creating Sovereign Wealth Funds

TL;DR: In this article, the authors investigated the economic and noneconomic factors influencing a country's decision to create a SWF and found that countries with a high dependence on resource exports and countries enjoying high levels of GDP growth are more likely to create SWFs.
Proceedings ArticleDOI

Discriminant Kernels derived from the optimum nonlinear discriminant analysis

TL;DR: The optimum kernel function in terms of the discriminant criterion is derived by investigating the optimum discriminant mapping constructed by the optimum nonlinear discriminant analysis (ONDA).
References
More filters
Book

Generalized Linear Models

TL;DR: In this paper, a generalization of the analysis of variance is given for these models using log- likelihoods, illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables), and gamma (variance components).
Proceedings ArticleDOI

Fisher discriminant analysis with kernels

TL;DR: In this article, a non-linear classification technique based on Fisher's discriminant is proposed and the main ingredient is the kernel trick which allows the efficient computation of Fisher discriminant in feature space.
Journal ArticleDOI

Generalized Discriminant Analysis Using a Kernel Approach

TL;DR: A new method that is close to the support vector machines insofar as the GDA method provides a mapping of the input vectors into high-dimensional feature space to deal with nonlinear discriminant analysis using kernel function operator.
Frequently Asked Questions (11)
Q1. What contributions have the authors mentioned in the paper "Logistic discriminant analysis" ?

Based on this theory, the authors propose a novel nonlinear discriminant analysis named logistic discriminant analysis ( LgDA ) in which the posterior probabilities are estimated by multi-nominal logistic regression ( MLR ). 

Since the MLR can be extend to non-linear by using kernel trick, it is natural to extend the LgDA to the kernel logistic discriminant analysis by using kernel MLR as the estimator of the posterior probabilities. 

Since linear discriminant analysis (LDA) can be regarded as the linear approximation of the ONDA through the linear approximations of the Bayesian posterior probabilities, the proposed LgDA can be regarded as the natural extension of LDA substituting the generalized linear model for the linear model in LDA. 

By modifying the outputs of the linear predictor by the link function, MLR can naturally estimate the Bayesian posterior probabilities. 

Then LDA constructs a dimension reducing linear mapping from the input feature vector x to a new feature vector yxAy Τ= , (1)where A = [aij] is the coefficient matrix. 

The regularization methods of MLR were proposed such as the shrinkage method (regularized MLR) and locality preserving multi-nominal logistic regression (LPMLR) [10]. 

Similarly to the LDA, the ONDA constructs the dimension reducing nonlinear mapping which maximizes the discriminant criterion J. The optimal non-linear discriminant mapping is given by( ) == Kk kkCP 1 uxy , (5)where ( )xkCP is the Bayesian posterior probability of the class Ck given the input x. 

20)The vector Z is the block vector with elements( )kkK j j kjk tyRz −−= − = 1 1 . (21)Equation (19) is repeated until it converges 

Let( ) )(0)( kkk bCL += xbx (9) be a linear approximation of the Bayesian posteriori probabilities which minimizes the mean square error as follows:( ) ( ){ } ( )−= xxxx dpCLCP kk 22ε . 

the block Hessian matrix H of the regularized MLR is defined as follows:.ˆˆ 2ˆˆ,ww11111111+ =+ ==−−−−otherwise kjifjkjk jk))(K(K)(K)(KIXRX IXRX HHHHH HTT(24)Where The authoris the identity matrix. 

(27)The representative vectors of each class ku~ are determined by the following eigen equationΛ=Γ ~~~~~ UPU , (28)where the matrices such as P~ , U~ and Λ~ are defined as follows:[ ]( ) ( ).~,,~~ ,)(~,),(~~,~,,~~111KKKdiagCPCPdiagλλ=Λ== ΤPuuU(29)If the outputs of the ordinal MLR or the regularized MLR can give sufficiently good approximation to the Bayesian posterior probabilities, it is expected that the nonlinear discriminant mapping defined by (27) constructs the good approximation of the ultimate nonlinear discriminant mapping ONDA in terms of the discriminant criterion.