Journal Article•DOI•

Nonlinear component analysis as a kernel eigenvalue problem

Bernhard Schölkopf¹, Alexander J. Smola, Klaus-Robert Müller•Institutions (1)

01 Jul 1998-Neural Computation (MIT Press)-Vol. 10, Iss: 5, pp 1299-1319

TL;DR: A new method for performing a nonlinear form of principal component analysis by the use of integral operator kernel functions is proposed and experimental results on polynomial feature extraction for pattern recognition are presented.

read less

Abstract: A new method for performing a nonlinear form of principal component analysis is proposed. By the use of integral operator kernel functions, one can efficiently compute principal components in high-dimensional feature spaces, related to input space by some nonlinear map—for instance, the space of all possible five-pixel products in 16 × 16 images. We give the derivation of the method and present experimental results on polynomial feature extraction for pattern recognition.

...read moreread less

Summary (2 min read)

Jump to: [Introduction] – [B Kernels Corresponding to Dot Products in Another Space] – [B Kernels Chosen A Priori] – [B Local Kernels] and [B Constructing Kernels from other Kernels]

Introduction

The authors describe a new method for performing a nonlinear form of Principal Component Anal ysis.
In this paper the authors give some examples of non linear methods constructed by this approach.
To gether these two sections form the basis for Sec which presents the proposed kernel based algo rithm for nonlinear PCA Following that Sec will discuss some di erences between kernel based PCA and other generalizations of PCA.
To this end they substitute a priori chosen kernel functions for all occurances of dot products.
In ex periments on classi cation based on the extracted principal components the authors found that in the non linear case it was su cient to use a linear Support Vector machine to construct the decision bound ary Linear Support Vector machines however are much faster in classi cation speed than non linear ones.

B Kernels Corresponding to Dot Products in Another Space

In practise the authors are free to try to use also sym metric kernels of inde nite operators.
In that case the matrix K can still be diagonalized and the authors can extract nonlinear feature values with the one modi cation that they need to modify their normal ization condition in order to deal with possi ble negative Eigenvalues K then induces a map ping to a Riemannian space with inde nite metric.
In fact many symmetric forms may induce spaces with inde nite signature.
In the following sections the authors shall give some ex amples of kernels that can be used for kernel PCA.

B Kernels Chosen A Priori

The fact that the authors can use inde nite operators dis tinguishes this approach from the usage of kernels in the Support Vector machine in the latter the de niteness is necessary for the optimization procedure.
The choice of c should depend on the range of the input variables and Neural Network type kernels k x y tanh x y b Interestingly these di erent types of kernels al low the construction of Polynomial Classi ers Ra dial Basis Function Classi ers and Neural Net works with the Support Vector algorithm which exhibit very similar accuracy.

B Local Kernels

Locality in their context means that the principal component extraction should take into account only neighbourhoods.
Depending on whether the authors consider neighbourhoods in input space or in an other space say the image space where the input vectors correspond to d functions the functions in locality can assume di erent meanings.
This additional degree of freedom can greatly improve statistical estimates which are computed from a limited amount of data Bottou Vapnik.

B Constructing Kernels from other Kernels

In other words the admissible kernels form a cone in the space of all integral operators Clearly k k corresponds to mapping into the direct sum of the respective spaces into which k and k map.
Of course the authors could also explicitly do the principal component extraction twice for both kernels and decide ourselves on the respec tive numbers of components to extract.

Did you find this useful? Give us your feedback

Content maybe subject to copyright Report

Max-Planck-Institut

für biologische Kybernetik

Spemannstraße 38 • 72076 Tübingen • Germany

Arbeitsgruppe Bülthoff

Technical Report No. 44 December 1996

Nonlinear Comp onent Analysis as a Kernel

Eigenvalue Problem

Bernhard Scholkopf, Alexander Smola, and Klaus{Rob ert M uller

Abstract

We describ e a new metho d for performing a nonlinear form of Principal ComponentAnal-

ysis. By the use of integral op erator kernel functions, we can eciently compute principal

components in high{dimensional feature spaces, related to input space by some nonlinear

map for instance the space of all p ossible 5{pixel products in 16



16 images. We givethe

derivation of the method, along with a discussion of other techniques whichcanbemade

nonlinear with the kernel approach and present rst exp erimental results on nonlinear fea-

ture extraction for pattern recognition.

AS and KRM are with GMD First (Forschungszentrum Informationstechnik), Rudower Chaussee 5, 12489

Berlin. AS and BS were supp orted by grants from the Studienstiftung des deutschen Volkes. BS thanks

the GMD First for hospitality during two visits. AS and BS thank V. Vapnik for introducing them

to kernel representations of dot pro ducts during jointwork on Supp ort Vector machines. This work

proted from discussions with V. Blanz, L. Bottou, C. Burges, H. B ultho, K. Gegenfurtner, P. Haner,

N. Murata, P. Simard, S. Solla, V. Vapnik, and T. Vetter. We are grateful to V. Blanz, C. Burges, and

S. Solla for reading a preliminary version of the manuscript.

This do cumentisavailable as

/pub/mpi-memos/TR-044.ps

via anonymous ftp from

ftp.mpik-tueb.mpg.d

from the World Wide Web,

http://www.mpik-tueb.mpg.de/bu.html

1 Intro duction

Principal Comp onent Analysis (PCA) is a power-

ful technique for extracting structure from p ossi-

bly high{dimensional data sets. It is readily per-

formed by solving an Eigenvalue problem, or by

using iterative algorithms which estimate princi-

pal components for reviews of the existing liter-

ature, see Jollie (1986) and Diamataras & Kung

(1996). PCA is an orthogonal transformation of

the co ordinate system in whichwe describe our

data. The new coordinate values bywhichwe rep-

resent out data are called

principal components

.It

is often the case that a small number of principal

components is sucient to account for most of the

structure in the data. These are sometimes called

the

factors

latent variables

of the data.

The presentwork generalizes PCA to the case

where we are not interested in principal compo-

nents in input space, but rather in principal com-

ponents of variables, or

features

, which are non-

linearly related to the input variables. Among

these are for instance variables obtained by taking

higher{order correlations b etween input variables.

In the case of image analysis, this would amount

to nding principal components in the space of

products of input pixels.

To this end, we are using the method of ex-

pressing dot pro ducts in feature space in terms

of kernel functions in input space. Given

any

al-

gorithm which can be expressed solely in terms

of dot pro ducts, i.e. without explicit usage of the

variables themselves, this kernel metho d enables

us to construct dierent nonlinear versions of it

(Aizerman, Braverman, & Rozonoer, 1964 Boser,

Guyon,&V

apnik, 1992). Even though this gen-

eral fact was known (Burges, 1996), the machine

learning community has made little use of it, the

exception being Supp ort Vector machines (Vap-

nik, 1995).

In this pap er, wegive some examples of non-

linear metho ds constructed by this approach. For

one example, the case of a nonlinear form of prin-

cipal component analysis, we shall give details and

experimental results (Sections 2 { 6) for some

other cases, we shall briey sketch the algorithms

(Sec. 7).

In the next section, we will rst review the stan-

dard PCA algorithm. In order to be able to gener-

alize it to the nonlinear case, we shall then formu-

late it in a way which uses exclusively dot pro d-

ucts. In Sec. 3, we shall discuss the kernel metho d

for computing dot products in feature spaces. To-

gether, these two sections form the basis for Sec. 4,

which presents the proposed kernel{based algo-

rithm for nonlinear PCA. Following that, Sec. 5

will discuss some dierences b etween kernel{based

PCA and other generalizations of PCA. In Sec. 6,

we shall give some rst experimental results on

kernel{based feature extraction for pattern recog-

nition. After a discussion of other applications of

the kernel method (Sec. 7), we conclude with a dis-

cussion (Sec. 8). Finally, some technical material

which is not essential for the main thread of the

argument has been relegated into the appendix.

2 PCA in Feature Spaces

Given a set of

centered observations

:::M

= 0, PCA diagonal-

izes the covariance matrix

(1)

To do this, one has to solve the Eigenvalue equa-

tion



(2)

for Eigenvalues



0and

.As

(



)

, all solutions

must lie in the

span of

:::

, hence (2) is equivalentto



(



)=(



) for all

:::M:

(3)

The remainder of this section is devoted to

a straightforward translation to a nonlinear sce-

nario, in order to prepare the ground for the

method proposed in the presentpaper. We shall

now describe this computation in another dot

product space

, which is related to the input

space by a p ossibly nonlinear map

:

F

(4)

Note that

, whichwe will refer to as the

feature

space

, could have an arbitrarily large, possibly in-

nite, dimensionality. Here and in the following,

upper case characters are used for elements of

while lower case characters denote elements of

Again, wemake the assumption that we are

dealing with centered data, i.e.

(

)=0

|we shall return to this point later. Using the

covariance matrix in



(

)(

)



(5)

More precisely,thecovariance matrix is dened as

the expectation of

 for convenience, we shall use

the same term to refer to the maximum likeliho od

esti-

mate

(1) of the covariance matrix from a nite sample.

(if

is innite{dimensional, we think of

(

)(

)

as the linear operator which maps

to (

)((

)



)) wenowhavetond

Eigenvalues



0 and Eigenvectors

satisfying





(6)

By the same argumentasabove, the solutions

lie in the span of (

)

:::

(

). For us, this

has two useful consequences: rst, we can consider

the equivalent equation



((

)



) = ((

)





)forall

:::M

(7)

and second, there exist co ecien



(

:::M

) suchthat



(

)

(8)

Combining (7) and (8), weget





((

)



(

)) =



((

)



(

))((

)



(

))

for all

:::M:

(9)

Dening an



matrix

:= ((

)



(

))



(10)

this reads

MK



(11)

where



denotes the column vector with entries



:::

.As

is symmetric, it has a set of

Eigenvectors which spans the whole space, thus

M



(12)

gives us all solutions



of Eq. (11). Note that

is p ositive semidenite, which can be seen by

noticing that it equals

((

)

:::

(

))



((

)

:::

(

))



(13)

which implies that for all

(



((

)

:::

(

))



(14)

Consequently,

's Eigenvalues will be nonnega-

tive, and will exactly give the solutions

M

Eq. (11). We therefore only need to diagonalize

.Let









:::





denote the Eigenval-

ues, and



:::



the corresponding complete

set of Eigenvectors, with



being the rst nonzero

Eigenvalue.

We normalize



:::



y requir-

ing that the corresp onding vectors in

be nor-

malized, i.e.

(



) = 1 for all

p:::M:

(15)

By virtue of (8) and (12), this translates into a

normalization condition for



:::



1 =

ij



((

)



(

))

ij



= (







)



(







) (16)

For the purpose of principal comp onent extrac-

tion, we need to compute pro jections on the Eigen-

vectors

(

p:::M

). Let

beatest

point, with an image (

)in

,then

(



(

)) =



((

)



(

)) (17)

may b e called its nonlinear principal comp onents

corresponding to .

In summary, the following steps were necessary

to compute the principal components: rst, com-

pute the dot product matrix

dened by (10)

second, compute its Eigenvectors and normalize

them in

 third, compute pro jections of a test

pointonto the Eigenvectors by(17).

For the sake of simplicity,wehaveabovemade

the assumption that the observations are centered.

This is easy to achieve in input space, but more

dicult in

,aswe cannot explicitly compute the

mean of the mapped observations in

. There is,

however,away to do it, and this leads to slightly

modied equations for kernel{based PCA (see Ap-

pendix A).

Before we pro ceed to the next section, which

more closely investigates the role of the map ,

the following observation is essential. The map-

ping  used in the matrix computation can b e an

arbitrary nonlinear map into the possibly high{

dimensional space

. e.g. the space of all

th or-

der monomials in the entries of an input vector.

If we require that  should not map all observa-

tions to zero, then sucha

will always exist.

Note that in our derivation we could haveused

the known result (e.g. Kirby & Sirovich, 1990) that

PCA can b e carried out on the dot pro duct matrix

(



)

instead of (1), however, for the sakeofclarity

and extendability (in Appendix A, we shall consider

the case where the data must b e centered in

), we

gave a detailed derivation.

In that case, we need to compute dot pro ducts of

input vectors mapped by , with a possibly pro-

hibitive computational cost. The solution to this

problem, which will b e described in the following

section, builds on the fact that we

exclusively

need

to compute dot products b etween mapp ed pat-

terns (in (10) and (17))

we never need the mapped

patterns explicitly.

3 Computing Dot Pro ducts in

Feature Space

In order to compute dot pro ducts of the form

((

)



(

)), we use kernel representations of the

form

(



)=((

)



(

))



(18)

which allow us to compute the value of the dot

product in

without having to carry out the map

. This metho d was used by Boser, Guyon, &

Vapnik (1992) to extend the \Generalized Por-

trait" hyperplane classier of Vapnik & Chervo-

nenkis (1974) to nonlinear Support Vector ma-

chines. To this end, they substitute a priori chosen

kernel functions for all o ccurances of dot products.

This way, the p owerful results of Vapnik & Cher-

vonenkis (1974) for the Generalized Portrait carry

over to the nonlinear case. Aizerman, Braver-

man & Rozono er (1964) call

the \linearization

space", and use it in the context of the poten-

tial function classication metho d to express the

dot pro duct between elements of

in terms of ele-

ments of the input space. If

is high{dimensional,

wewould like to b e able to nd a closed form ex-

pression for

which can be eciently computed.

Aizerman et al. (1964) consider the p ossibilityof

choosing

a priori, without being directly con-

cerned with the corresp onding mapping  into

A sp ecic choice of

might then corresp ond to a

dot pro duct b etween patterns mapp ed with a suit-

able . A particularly useful example, whichis a

direct generalization of a result proved byPoggio

(1975, Lemma 2.1) in the context of polynomial

approximation, is

(



)

(

)

C

(

))



(19)

where

maps

to the vector

(

) whose entries

are all p ossible

-th degree ordered pro ducts of

the entries of

.For instance (Vapnik, 1995), if

x

), then

(

)=(

x

or, yielding the same value of the dot product,

(

)=(

x



)

(20)

For this example, it is easy to verify that

;

(

x

)(

y

)



= (

x



)(

y



)

(

)

(

)

(21)

In general, the function

(



)=(



)

(22)

corresponds to a dot pro duct in the space of

-th order monomials of the input coordinates.

represents an image with the entries b eing

pixel values, we can thus easily work in the space

spanned by pro ducts of any

pixels | provided

that we are able to do our work solely in terms

of dot products, without any explicit usage of a

mapped pattern

(

). The latter lives in a pos-

sibly very high{dimensional space: even though

we will identify terms like

and

into one

coordinate of

as in (20), the dimensionalityof

, the image of

under

, still is

(

;

1)!

;

1)!

and

thus grows like

.For instance, 16



16 input im-

ages and a polynomial degree

= 5 yield a dimen-

sionalityof10

.Thus, using kernels of the form

(22) is our only waytotakeinto account higher{

order statistics without a combinatorial explosion

of time complexity.

The general question which function

corre-

sponds to a dot pro duct in some space

has

been discussed by Boser, Guyon, & Vapnik (1992)

and Vapnik (1995): Mercer's theorem of functional

analysis states that if

is a continuous kernel of

a p ositiveintegral operator, we can construct a

mapping into a space where

acts as a dot prod-

uct (for details, see App endix B.

The application of (18) to our problem is

straightforward: we simply substitute an a priori

chosen kernel function

(



) for all occurances

of ((

)



(

)). This was the reason whywe had

to formulate the problem in Sec. 2 in a waywhich

only makes use of the values of dot products in

. The choice of

then

implicitly

determines the

mapping  and the feature space

In App endix B, we give some examples of ker-

nels other than (22) whichmay b e used.

4 Kernel PCA

4.1 The Algorithm

To perform kernel{based PCA (Fig. 1), from now

on referred to as

kernel PCA

, the following steps

have to be carried out: rst, we compute the dot

product matrix (cf. Eq. (10))

(



))

(23)

Next, we solve(12)by diagonalizing

, and nor-

malize the Eigenvector expansion co ecients



by requiring Eq. (16),



(







)

inear PC

ernel PC

Figure 1: The basic idea of kernel PCA. In some high{

dimensional feature space

(bottom right), weare

performing linear PCA, just as a PCA in input space

(top). Since

is nonlinearly related to input space

(via ), the contour lines of constant pro jections onto

the principal Eigenvector (drawn as an arrow) b ecome

nonlinear

in input space. Note that we cannot draw

a pre{image of the Eigenvector in input space, as it

may not even exist. Crucial to kernel PCA is the fact

that we do not actually perform the map into

,but

instead p erform all necessary computations by the use

ofakernel function

in input space (here:

To extract the principal comp onents (correspond-

ing to the kernel

) of a test p oint

,we then

compute pro jections onto the Eigenvectors by (cf.

Eq. (17)),

(

PC)

(

)=(



(

)) =



(



)

(24)

If we use a kernel as described in Sec. 3, we

know that this procedure exactly corresp onds to

standard PCA in some high{dimensional feature

space, except that we do not need to p erform ex-

pensiv

e computations in that space.

4.2 Properties of (Kernel{) PCA

If we use a kernel which satises the conditions

given in Sec. 3, we know that we are in fact doing

a standard PCA in

. Consequently, all math-

ematical and statistical prop erties of PCA (see

for instance Jollie, 1986) carry over to kernel{

based PCA, with the modications that they be-

come statements about a set of points (

)

i

:::M

,in

rather than in

.In

,we can

thus assert that PCA is the orthogonal basis trans-

formation with the following properties (assuming

that the Eigenvectors are sorted in ascending order

of the Eigenvalue size):



the rst

(

:::M

) principal comp o-

nents, i.e. pro jections on Eigenvectors, carry

more variance than any other

orthogonal

directions



the mean{squared approximation error in

representing the observations by the rst

principal comp onents is minimal



the principal components are uncorrelated



the representation entropy is minimized



the rst

principal components havemaxi-

mal mutual information with respect to the

inputs

For more details, see Diamantaras & Kung (1996).

To translate these properties of PCA in

into

statements ab out the data in input space, they

need to b e investigated for sp ecic choices of a

kernels. Weshallnotgointo detail on that mat-

ter, but rather proceed in our discussion of kernel

PCA.

4.3 Dimensionality Reduction and

Feature Extraction

Unlike linear PCA, the prop osed metho d allows

the extraction of a number of principal compo-

nents which

can

exceed the input dimensionality.

Suppose that the number of observations

ex-

ceeds the input dimensionality

. Linear PCA,

even when it is based on the



dot product

matrix, can nd at most

nonzero Eigenvalues

| they are identical to the nonzero Eigenvalues

of the



covariance matrix. In contrast, ker-

nel PCA can nd up to

nonzero Eigenvalues

| a fact that illustrates that it is imp ossible to

perform kernel PCA based on an



covari-

ance matrix.

4.4 Computational Complexity

As mentioned in Sec. 3, a fth order p olynomial

kernel on a 256{dimensional input space yields a

{dimensional space. It would seem that lo ok-

ing for principal components in his space should

pose intractable computational problems. How-

ever, as wehave explained ab ove, this is not the

case. First, as pointed out in Sect. 2 wedo not

need to look for Eigenvectors in the full space

but just in the subspace spanned by the images

of our observations

. Second, wedo not

need to compute dot products explicitly b etween

vectors in

,as we know that in our case this can

be done directly in the input space, using kernel

If we use one kernel | of course, we could extract

features with several kernels, to get even more.

HTML Viewer

Frequently Asked Questions (3)

Q1. What are the contributions in this paper?

The authors describe a new method for performing a nonlinear form of Principal Component Anal ysis The authors give the derivation of the method along with a discussion of other techniques which can be made nonlinear with the kernel approach and present rst experimental results on nonlinear fea ture extraction for pattern recognition AS and KRM are with GMD First Forschungszentrum Informationstechnik Rudower Chaussee Berlin AS and BS were supported by grants from the Studienstiftung des deutschen Volkes BS thanks the GMD First for hospitality during two visits AS and BS thank V Vapnik for introducing them to kernel representations of dot products during joint work on Support Vector machines This work pro ted from discussions with V Blanz L Bottou C Burges H B ultho K Gegenfurtner P Ha ner N Murata P Simard S Solla V Vapnik and T Vetter By the use of integral operator kernel functions the authors can e ciently compute principal components in high dimensional feature spaces related to input space by some nonlinear map for instance the space of all possible pixel products in images

Q2. What is the meaning of k k?

In other words the admissible kernels form a cone in the space of all integral operators Clearly k k corresponds to mapping into the direct sum of the respective spaces into which k and k map

Q3. What is the definition of input space locality?

In input space locality consists of basing their com ponent extraction for a point x on other points in an appropriately chosen neighbourhood of x

Nonlinear component analysis as a kernel eigenvalue problem

Summary (2 min read)

Introduction

B Kernels Corresponding to Dot Products in Another Space

B Kernels Chosen A Priori

B Local Kernels

B Constructing Kernels from other Kernels

Citations

Cites background from "Nonlinear component analysis as a k..."

Cites methods from "Nonlinear component analysis as a k..."

References

"Nonlinear component analysis as a k..." refers background or methods in this paper

"Nonlinear component analysis as a k..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (3)

Q1. What are the contributions in this paper?

Q2. What is the meaning of k k?

Q3. What is the definition of input space locality?