How can kernel trick be used to estimate the posterior probabilities?

Since the MLR can be extend to non-linear by using kernel trick, it is natural to extend the LgDA to the kernel logistic discriminant analysis by using kernel MLR as the estimator of the posterior probabilities.

What is the reason why the LgDA is the natural extension of LDA?

Since linear discriminant analysis (LDA) can be regarded as the linear approximation of the ONDA through the linear approximations of the Bayesian posterior probabilities, the proposed LgDA can be regarded as the natural extension of LDA substituting the generalized linear model for the linear model in LDA.

What is the block vector with elements?

20)The vector Z is the block vector with elements( )kkK j j kjk tyRz −−= − = 1 1 . (21)Equation (19) is repeated until it converges

what is the block matrix of the regularized MLR?

the block Hessian matrix H of the regularized MLR is defined as follows:.ˆˆ 2ˆˆ,ww11111111+ =+ ==−−−−otherwise kjifjkjk jk))(K(K)(K)(KIXRX IXRX HHHHH HTT(24)Where The authoris the identity matrix.

(Open Access) Logistic discriminant analysis (2009) | Takio Kurita

Q: What contributions have the authors mentioned in the paper "Logistic discriminant analysis" ?

Based on this theory, the authors propose a novel nonlinear discriminant analysis named logistic discriminant analysis ( LgDA ) in which the posterior probabilities are estimated by multi-nominal logistic regression ( MLR ).

Q: how does a linear approximation of the Bayesian posteriori probabilities work?

Let( ) )(0)( kkk bCL += xbx (9) be a linear approximation of the Bayesian posteriori probabilities which minimizes the mean square error as follows:( ) ( ){ } ( )−= xxxx dpCLCP kk 22ε .

Logistic Discriminant Analysis

Takio Kurita

Neuroscience Resarch Institute

AIST

Tsukuba, Japan

takio-kurita@aist.go.jp

Kenji Watanabe

AIST

Tsukuba, Japan

kenji-watanabe@aist.go.jp

Nobuyuki Otsu

AIST Fellow

AIST

Tsukuba, Japan

otsu.n@aist.go.jp

Abstract—Linear discriminant analysis (LDA) is one of the

well known methods to extract the best features for the multi-

class discrimination. Otsu derived the optimal nonlinear

discriminant analysis (ONDA) by assuming the underlying

probabilities and showed that the ONDA was closely related to

Bayesian decision theory (the posterior probabilities). Also Otsu

pointed out that LDA could be regarded as a linear

approximation of the ONDA through the linear approximations

of the Bayesian posterior probabilities. Based on this theory, we

propose a novel nonlinear discriminant analysis named logistic

discriminant analysis (LgDA) in which the posterior probabilities

are estimated by multi-nominal logistic regression (MLR). The

experimental results are shown by comparing the discriminant

spaces constructed by LgDA and LDA for the standard

repository datasets.

Keywords— linear discriminant analysis, nonlinear

discriminant analysis, multi-nominal logistic regression, logistic

discriminant analysis, Bayesian decision theory

I. INTRODUCTION

Feature extraction is one of the most important problems in

pattern recognition. Linear discriminant analysis (LDA) is one

of the well known methods to extract the best features for

multi-class discrimination. LDA is formulated as a problem to

find an optimal linear mapping by which the within-class

scatter in the mapped feature space is made as small as

possible relative to the between-class scatter. LDA is useful for

linear separable cases, but for more complicated cases, it is

necessary to extend it to non-linear.

Otsu derived the optimal nonlinear discriminant analysis

(ONDA) by assuming the underlying probabilities [1, 2, 3]. He

showed that the optimal non-linear discriminant mapping was

closely related to Bayesian decision theory (The posterior

probabilities). Also Otsu pointed out that LDA can be regarded

as the linear approximation of the ultimate ONDA through the

linear approximations of the Bayesian posterior probabilities.

This theory suggests that we can construct a novel nonlinear

discriminant mapping if we utilize nonlinear estimates of the

posterior probabilities. Since the outputs of the trained multi-

layered Perceptron (MLP) for pattern classification problems

can be regarded as the approximations of the posteriori

probabilities [4], Kurita et al. [5] proposed the neural network

based non-linear discriminant analysis by using the outputs of

the trained MLP. Recently non-linear discriminant space can

be constructed by the kernel discriminant analysis [6, 7]. This

is also interpreted as an approximation of the ultimate ONDA.

LDA is the linear approximation of the ONDA through the

linear approximations of the Bayesian posterior probabilities.

However, a linear model is not suitable to estimate the

posterior probabilities. Logistic regression (LR) is one of the

simplest models for binary classification and can directly

estimate the posterior probabilities. Multi-nominal logistic

regression (MLR) is a natural extension of LR to multi-class

classification problems. They are known as the members of the

generalized linear model (GLM) which is a flexible

generalization of ordinary least squares regression. By

modifying the outputs of the linear predictor by the link

function, MLR can more naturally estimate the posterior

probabilities.

In this paper, we propose a novel nonlinear discriminant

analysis in which the Bayesian posterior probabilities are

estimated by MLR. The proposed method is named as logistic

discriminant analysis, in short LgDA. It is expected that the

discriminant space constructed by LgDA is better than the one

constructed by LDA, because MLR is more natural as the

probability estimator than the linear approximation of the

posterior probabilities used in LDA. The experimental results

are shown by comparing the discriminant spaces constructed

by LgDA and LDA for the standard repository datasets.

II.

LINEAR AND NON-LINEAR DISCRIMINANT ANALYSIS

A. Linear Disctiminant Analysis

Let an m dimensional feature vector be

()

x,,x "

x .

Consider

K classes

{}

. As training samples, we have N

feature vectors and they are labeled as one of the K classes.

Then LDA constructs a dimension reducing linear mapping

from the input feature vector x to a new feature vector y

xAy

= , (1)

where A = [

] is the coefficient matrix. The discriminant

criterion

(

)

ˆˆ

J ¦¦=

−1

tr (2)

Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics

San Antonio, TX, USA - October 2009

2236

is used to evaluate the performance of the discrimination of the

new feature vectors

y. The objective is to maximize the

discriminant criterion

J, where

and

are respectively

the total covariance matrix and the between-class covariance

matrix of the new feature vectors

The optimal coefficient matrix

A is then obtained by

solving the following eigen equation

()

IAAAA =¦Λ¦=¦

TTB

, (3)

where Λ is a diagonal matrix of eigen values and I denotes the

unit matrix. The matrices

¦ and

¦ are respectively the total

covariance matrix and the between-class covariance matrix of

the input vectors

x, and they are computed as follows:

()()

()

()()

.CP

TkTkkB

TiTiT

−−=¦

xxxx

(4)

Where

()

CP ,

x and

x denote a priori probability of the

class

(

()

NNCP

= , N

is the number of input vectors of

the class

and N is the number of input vectors), the mean

vector of the class

and the total mean vector, respectively.

The

j-th column of A is the eigenvector corresponding to

the

j-th largest eigenvalue. Therefore, the importance of each

element of the new feature vector

y is evaluated by the

corresponding eigenvalues. The dimension of the new feature

vector

y is bounded by

()

N,Kmin 1− .

B. Optimal Nonlinear Discriminant Analysis

Otsu derived the optimal nonlinear discriminant analysis

(NDA) by assuming the underlying probabilities [1, 2].

Similarly to the LDA, the ONDA constructs the dimension

reducing nonlinear mapping which maximizes the discriminant

criterion J. The optimal non-linear discriminant mapping is

given by

()

uxy

, (5)

where

()

CP is the Bayesian posterior probability of the

class C

given the input x. The vectors u

(k = 1,…, K) are

class representative vectors which are determined by following

eigen-equation:

Λ=

PUU , (6)

where Γ is a

× matrix whose elements are

()

()()()

()

−−= xxxx dpCPCPCPCP

jjiiij

,(7)

and the other matrices are represented as follows:

[]

() ( )()

()

.,,diag

,CP,,CPdiag

,,,

λλ

=Λ

uuU

(8)

It is important to notice that the optimal non-linear mapping is

closely related to Bayesian decision theory, namely the

posterior probabilities

()

CP . Along this line, Fukunaga et

discussed the various properties of the criterion from the

viewpoint of non-linear mappings [8].

Thus, we can construct optimal nonlinear discriminant

features by ONDA from a given input features if we can know

or estimate all the Bayesian

posteriori probabilities correctly.

However, it is usually difficult to estimate them from the input

features.

C. Linear approximation of NDA

In the previous subsection, we explained ONDA as a

ultimate nonlinear extension of LDA. Then we may have the

following question: in what sense does LDA approximate

NDA?

Let

()

)(

)( kk

bCL += xbx (9)

be a linear approximation of the Bayesian

posteriori

probabilities which minimizes the mean square error as

follows:

()(){}

()

−= xxxx dpCLCP

. (10)

Otsu [2, 11] already pointed out that the optimal linear

function is given by

()

()()()

{

}

+−¦−=

−

TTTkkk

CPCL xxxxx , (11)

where

¦ denotes the total covariance matrix. It is interesting

to note that this function has unit-sum property from x as

follows:

()

CL x . (12)

Let us substitute these linear approximations

()

CL for

the Bayesian posterior probabilities

()

CP in (5) and (6) of

ONDA. By this substitution, (5) becomes

2237

()

pUxxPMU

uxy

Τ−ΤΤ

+−¦=

, (13)

where

()()

[]

−−=

TKT

,, xxxxM !

and

CPCP ))(,),((

!=p

. Γ in (6) becomes as follows:

MPPM

1−Τ

¦=Γ

. (14)

By multiplying M from the left and substituting

1−

for

, we have the same eigen-equation with (3). This means

that LDA is the linear approximation of ONDA through the

linear approximation

()

CL of the posteriori

probabilities

()

CP .

III. L

OGISTIC DISCRIMINANT ANALYSIS

A. Multi-nominal Logistic Regression

Logistic regression (LR) is one of the simplest models for

binary classification and can directly estimate the posterior

probabilities. Multi-nominal logistic regression (MLR) is a

natural extension of LR to multi-class classification problems

[9]. It is known as one of the generalized linear model (GLM)

which is a flexible generalization of ordinary least squares

regression. By modifying the outputs of the linear predictor by

the link function, MLR can naturally estimate the posterior

probabilities.

For K-class classification problem, let

(){}

= tx be

the given training data, where

()

imii

xx ,,

"=x is the i-th

input vector, and

{}

{

}

110

=∈=∈

,,| tttTt is the

class representative vector for the i-th input vector. The outputs

of MLR estimate the posterior probabilities

()

tP x| . They

are defined as follows:

()

−

exp1

exp

)(|

ytP xx

, (15)

bȘ wxwx

ˆˆ

=+=

. (16)

Where

()

mkk

ww ,,

"=w and

b are the weight vector and

the bias term of k-th class, respectively. To simplify the

notation, we include the bias term in the vectors as

()

kmkk

bww ,,,

"=w and

()

imii

xx 1,,,

"=x . In

matrix notation, we use

()

−

wwW " and

()

xxX

"= . The optimal parameters of MLR are

obtained by minimizing the negative log-likelihood

Eminarg

W = , (17)

()

¦¦ ¦

−



−

ȘtȘexplogtE

1 . (18)

Equation (17) represents a convex optimization problem

and it has only a single, global minimum. Again the optimal

parameter

W can be efficiently found using Newton-Raphson

method or an iterative re-weighted least squares (IRLS)

procedure. In each iteration step,

W is updated by

ZGHW

Τ−+

11t

, (19)

where

RGGH

= is the block Hessian matrix, and

1−

H is

the inverse matrix of

(

)

−

diag XXG " is the

block diagonal matrix of

, and

ˆˆ

= . The matrix R is

the block matrix defined as follows:

()

() ()()

()

otherwise

kjif

,yyr

,r,,rdiag

njk

KKK



=−=

−−−

−

1111

δδ

#%#

(20)

The vector

Z is the block vector with elements

()

kjk

tyȘRz −−=

−

. (21)

Equation (19) is repeated until it converges

B. Regularization of MLR

In general, the regularization term is introduced to control

the over-fitting. The regularization methods of MLR were

proposed such as the shrinkage method (regularized MLR) and

locality preserving multi-nominal logistic regression (LPMLR)

[10]. In shrinkage method, unnecessary growth of the

parameters is penalized by introducing the regularization

term

E defined as follows:

¦¦

−

kjW

E wwWW

. (22)

Then the optimal parameters of the regularized MLR is

determined by minimizing the negative log-likelihood as

()

EEminarg

+= . (23)

2238

Equation (23) represents a convex optimization problem,

and

Ȝ is the pre-specified regularization parameter of

E .

The multiplicative update rule for the regularized MLR is

the same as (19). However, the elements of the block Hessian

matrix

H are different from MLR [10]. the block Hessian

matrix

H of the regularized MLR is defined as follows:

ˆˆ

1111



−−−

−

otherwiseȜ

kjifȜ

))(K(K)(K

)(K

IXRX

#%#

(24)

Where

is the identity matrix. R is the block matrix similar to

(20).

Z is the block vector with elements similar to (21).

C. Logistic Discriminant Analysis

After the training of the parameters using the sufficient

number of samples, the outputs of the ordinal MLR or the

regularized MLR can be interpreted as estimates of the

Bayesian

posterior probabilities

() ()()

CPCP ,,

. By

substituting the Bayesian

posterior probabilities in the ONDA

with the outputs of the ordinal MLR or the regularized MLR,

we can directly construct an approximation of ONDA. We call

this method logistic discriminant analysis (LgDA). It is

expected that the discriminant space constructed by LgDA is

better than the one constructed by LDA, because MLR is more

natural as the estimates of the

posterior probabilities than the

linear approximation of them used in LDA.

Let the outputs of the ordinal MLR or the regularized MLR

for an input vector

x be

() () ()

()

= xxxy

yy ,,

. Then a

priori

probability

()

CP is approximated as follows:

() () ()

Kkyy

,,1

!===

. (25)

The approximation of the matrix Γ is also given by

()()()()

−−=Γ

yxyyxy

. (26)

Thus the non-linear discriminant mapping is obtained as

()

uxy

. (27)

The representative vectors of each class

are determined by

the following eigen equation

Λ=Γ

~~~~

UPU

, (28)

where the matrices such as P

, U

and Λ

are defined as

follows:

[]

()

,)(

,),(

diag

CPCPdiag

λλ

=Λ

uuU

(29)

If the outputs of the ordinal MLR or the regularized MLR can

give sufficiently good approximation to the Bayesian

posterior

probabilities, it is expected that the nonlinear discriminant

mapping defined by (27) constructs the good approximation of

the ultimate nonlinear discriminant mapping ONDA in terms

of the discriminant criterion.

IV. E

XPERIMENTS

To show the effectiveness of the proposed LgDA, the

discriminant space was compared with LDA using the standard

repository datasets for multiclass classification [11]. Fig.1

shows the 2-dimensional discriminant spaces constructed by

LDA and LgDA for Satimage dataset which has 36

dimensional features from 6 classes and consists of 4435

training samples and 2000 test samples. The regularization

parameter

Ȝ of LgDA was determined by grid search. Fig.2

shows the 2-dimensional discriminant spaces for Balance

dataset which has 4 dimensional features from 3 classes and

consists of 625 samples. This dataset was randomly divided

into 90 training samples and 535 test samples. The test samples

are plotted in the constructed discriminant spaces. It is noticed

that samples of each class are gathered around the class

representative vectors in the discriminant space constructed by

LgDA but samples are more spread in the discriminant space

by LDA. Especially in the case of Balance dataset, the

discriminant space constructed by LgDA is more class-

dependent while that by LDA inherits the topology in the input

feature space.

TABLE I. shows the values of the discriminant criteria of

the constructed discriminant space. It is noticed that the

proposed LgDA achieves higher values than LDA. This means

that the discriminant space constructed by LgDA is better than

that constructed by LDA in terms of the discriminant criterion

which is the objective function of the discriminant analysis.

Especially, the improvement of the discriminant criterion is

large in the case of Balance dataset.

TABLE II. shows the recognition rates of Satimage and

Balance datasets obtained by using LDA, MLR and LgDA.

They are calculated by using k nearest neighbor (k-NN)

classifier in the discriminant space for the test samples. In

TABLE II., LgDA (

= 0) denotes LgDA by MLR without

regularization. It is noticed that the recognition rates of

Satimage by LgDA are higher than that by LDA. Especially,

LgDA by MLR without regularization gave higher recognition

2239

rate than MLR and it gave the best recognition rate for

Satimage. The recognition rate by LgDA with regularization

was slightly lower than that by the regularized MLR. The

reason of this is probably because the regularization parameter

of LgDA is not tuned and is set to the same value with the

regularized MLR. The recognition rate of LgDA with

regularization is probably improved by tuning the

regularization parameter for LgDA classifier. Also the

recognition rate by LgDA with regularization was slaightly

lower than that by LgDA with regularization. The reason is

also similar. The recognition rate of LgDA with regularization

is probably improved by tuning the reguralization parameter

for LgDA classifier. In the case of Balance dataset, the

recognition rate by LDA was the lowest and that by other

methods was equal to 92.15%. These results suggest that

LgDA can achieve higher recognition rates than LDA even if

the sample shows the structured distribution in input feature

space.

TABLE I. DISCRIMINANT CRITERIONS

Satimage Balance

LDA 0.4900 0.3333

LgDA (Ȝ

= 0)

0.7541 0.6783

LgDA 0.7462 0.6773

Figure 2. Discriminant spaces by LDA (above) and by LgDA (below)

for Balance

Figure 1. Discriminant spaces by LDA (above) and b LgDA (below) for

Satimage

2240

Logistic discriminant analysis

Figures

Citations

Computerized two-lead resting ECG analysis for the detection of coronary artery stenosis after coronary revascularization.

Computerized two-lead resting ECG analysis for the detection of coronary artery stenosis.

Enhanced system and method for conducting PCA analysis on data signals

New Wealth, New Wisdom: Understanding the Changing Profile of Countries Creating Sovereign Wealth Funds

Discriminant Kernels derived from the optimum nonlinear discriminant analysis

References

Generalized Linear Models

UCI Machine Learning Repository

Generalized Linear Models

Fisher discriminant analysis with kernels

Generalized Discriminant Analysis Using a Kernel Approach

Related Papers (5)

Linear Discriminant Analysis

Classification and discriminant analysis

Nonlinear Discriminant Analysis and Related Topics

Classification Accuracy of Neural Networks vs. Discriminant Analysis, Logistic Regression, and Classification and Regression Trees

Classification Accuracy of Neural Networks vs. Discriminant Analysis, Logistic Regression, and Classification and Regression Trees Three- and Five-Group Cases

Frequently Asked Questions (11)

Q1. What contributions have the authors mentioned in the paper "Logistic discriminant analysis" ?

Q2. How can kernel trick be used to estimate the posterior probabilities?

Q3. What is the reason why the LgDA is the natural extension of LDA?

Q4. How can LgDA be used to estimate the Bayesian posterior probabilities?

Q5. What is the criterion for calculating the coefficient matrix of the input vector x?

Q6. What are the regularization methods of MLR?

Q7. What is the criterion for the discriminant mapping?

Q8. What is the block vector with elements?

Q9. how does a linear approximation of the Bayesian posteriori probabilities work?

Q10. what is the block matrix of the regularized MLR?

Q11. What is the definition of the nonlinear discriminant mapping?