scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Nonlinear component analysis as a kernel eigenvalue problem

01 Jul 1998-Neural Computation (MIT Press)-Vol. 10, Iss: 5, pp 1299-1319
TL;DR: A new method for performing a nonlinear form of principal component analysis by the use of integral operator kernel functions is proposed and experimental results on polynomial feature extraction for pattern recognition are presented.
Abstract: A new method for performing a nonlinear form of principal component analysis is proposed. By the use of integral operator kernel functions, one can efficiently compute principal components in high-dimensional feature spaces, related to input space by some nonlinear map—for instance, the space of all possible five-pixel products in 16 × 16 images. We give the derivation of the method and present experimental results on polynomial feature extraction for pattern recognition.

Summary (2 min read)

Introduction

  • The authors describe a new method for performing a nonlinear form of Principal Component Anal ysis.
  • In this paper the authors give some examples of non linear methods constructed by this approach.
  • To gether these two sections form the basis for Sec which presents the proposed kernel based algo rithm for nonlinear PCA Following that Sec will discuss some di erences between kernel based PCA and other generalizations of PCA.
  • To this end they substitute a priori chosen kernel functions for all occurances of dot products.
  • In ex periments on classi cation based on the extracted principal components the authors found that in the non linear case it was su cient to use a linear Support Vector machine to construct the decision bound ary Linear Support Vector machines however are much faster in classi cation speed than non linear ones.

B Kernels Corresponding to Dot Products in Another Space

  • In practise the authors are free to try to use also sym metric kernels of inde nite operators.
  • In that case the matrix K can still be diagonalized and the authors can extract nonlinear feature values with the one modi cation that they need to modify their normal ization condition in order to deal with possi ble negative Eigenvalues K then induces a map ping to a Riemannian space with inde nite metric.
  • In fact many symmetric forms may induce spaces with inde nite signature.
  • In the following sections the authors shall give some ex amples of kernels that can be used for kernel PCA.

B Kernels Chosen A Priori

  • The fact that the authors can use inde nite operators dis tinguishes this approach from the usage of kernels in the Support Vector machine in the latter the de niteness is necessary for the optimization procedure.
  • The choice of c should depend on the range of the input variables and Neural Network type kernels k x y tanh x y b Interestingly these di erent types of kernels al low the construction of Polynomial Classi ers Ra dial Basis Function Classi ers and Neural Net works with the Support Vector algorithm which exhibit very similar accuracy.

B Local Kernels

  • Locality in their context means that the principal component extraction should take into account only neighbourhoods.
  • Depending on whether the authors consider neighbourhoods in input space or in an other space say the image space where the input vectors correspond to d functions the functions in locality can assume di erent meanings.
  • This additional degree of freedom can greatly improve statistical estimates which are computed from a limited amount of data Bottou Vapnik.

B Constructing Kernels from other Kernels

  • In other words the admissible kernels form a cone in the space of all integral operators Clearly k k corresponds to mapping into the direct sum of the respective spaces into which k and k map.
  • Of course the authors could also explicitly do the principal component extraction twice for both kernels and decide ourselves on the respec tive numbers of components to extract.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Max-Planck-Institut
für biologische Kybernetik
Spemannstraße 38 72076 Tübingen Germany
Arbeitsgruppe Bülthoff
Technical Report No. 44 December 1996
Nonlinear Comp onent Analysis as a Kernel
Eigenvalue Problem
Bernhard Scholkopf, Alexander Smola, and Klaus{Rob ert M uller
Abstract
We describ e a new metho d for performing a nonlinear form of Principal ComponentAnal-
ysis. By the use of integral op erator kernel functions, we can eciently compute principal
components in high{dimensional feature spaces, related to input space by some nonlinear
map for instance the space of all p ossible 5{pixel products in 16
16 images. We givethe
derivation of the method, along with a discussion of other techniques whichcanbemade
nonlinear with the kernel approach and present rst exp erimental results on nonlinear fea-
ture extraction for pattern recognition.
AS and KRM are with GMD First (Forschungszentrum Informationstechnik), Rudower Chaussee 5, 12489
Berlin. AS and BS were supp orted by grants from the Studienstiftung des deutschen Volkes. BS thanks
the GMD First for hospitality during two visits. AS and BS thank V. Vapnik for introducing them
to kernel representations of dot pro ducts during jointwork on Supp ort Vector machines. This work
proted from discussions with V. Blanz, L. Bottou, C. Burges, H. B ultho, K. Gegenfurtner, P. Haner,
N. Murata, P. Simard, S. Solla, V. Vapnik, and T. Vetter. We are grateful to V. Blanz, C. Burges, and
S. Solla for reading a preliminary version of the manuscript.
This do cumentisavailable as
/pub/mpi-memos/TR-044.ps
via anonymous ftp from
ftp.mpik-tueb.mpg.d
e
or
from the World Wide Web,
http://www.mpik-tueb.mpg.de/bu.html
.

1 Intro duction
Principal Comp onent Analysis (PCA) is a power-
ful technique for extracting structure from p ossi-
bly high{dimensional data sets. It is readily per-
formed by solving an Eigenvalue problem, or by
using iterative algorithms which estimate princi-
pal components for reviews of the existing liter-
ature, see Jollie (1986) and Diamataras & Kung
(1996). PCA is an orthogonal transformation of
the co ordinate system in whichwe describe our
data. The new coordinate values bywhichwe rep-
resent out data are called
principal components
.It
is often the case that a small number of principal
components is sucient to account for most of the
structure in the data. These are sometimes called
the
factors
or
latent variables
of the data.
The presentwork generalizes PCA to the case
where we are not interested in principal compo-
nents in input space, but rather in principal com-
ponents of variables, or
features
, which are non-
linearly related to the input variables. Among
these are for instance variables obtained by taking
higher{order correlations b etween input variables.
In the case of image analysis, this would amount
to nding principal components in the space of
products of input pixels.
To this end, we are using the method of ex-
pressing dot pro ducts in feature space in terms
of kernel functions in input space. Given
any
al-
gorithm which can be expressed solely in terms
of dot pro ducts, i.e. without explicit usage of the
variables themselves, this kernel metho d enables
us to construct dierent nonlinear versions of it
(Aizerman, Braverman, & Rozonoer, 1964 Boser,
Guyon,&V
apnik, 1992). Even though this gen-
eral fact was known (Burges, 1996), the machine
learning community has made little use of it, the
exception being Supp ort Vector machines (Vap-
nik, 1995).
In this pap er, wegive some examples of non-
linear metho ds constructed by this approach. For
one example, the case of a nonlinear form of prin-
cipal component analysis, we shall give details and
experimental results (Sections 2 { 6) for some
other cases, we shall briey sketch the algorithms
(Sec. 7).
In the next section, we will rst review the stan-
dard PCA algorithm. In order to be able to gener-
alize it to the nonlinear case, we shall then formu-
late it in a way which uses exclusively dot pro d-
ucts. In Sec. 3, we shall discuss the kernel metho d
for computing dot products in feature spaces. To-
gether, these two sections form the basis for Sec. 4,
which presents the proposed kernel{based algo-
rithm for nonlinear PCA. Following that, Sec. 5
will discuss some dierences b etween kernel{based
PCA and other generalizations of PCA. In Sec. 6,
we shall give some rst experimental results on
kernel{based feature extraction for pattern recog-
nition. After a discussion of other applications of
the kernel method (Sec. 7), we conclude with a dis-
cussion (Sec. 8). Finally, some technical material
which is not essential for the main thread of the
argument has been relegated into the appendix.
2 PCA in Feature Spaces
Given a set of
M
centered observations
x
k
,
k
=
1
:::M
,
x
k
2
R
N
,
P
M
k
=1
x
k
= 0, PCA diagonal-
izes the covariance matrix
1
C
=
1
M
M
X
j
=1
x
j
x
>
j
:
(1)
To do this, one has to solve the Eigenvalue equa-
tion
v
=
C
v
(2)
for Eigenvalues
0and
v
2
R
N
nf
0
g
.As
C
v
=
1
M
P
M
j
=1
(
x
j
v
)
x
j
, all solutions
v
must lie in the
span of
x
1
:::
x
M
, hence (2) is equivalentto
(
x
k
v
)=(
x
k
C
v
) for all
k
=1
:::M:
(3)
The remainder of this section is devoted to
a straightforward translation to a nonlinear sce-
nario, in order to prepare the ground for the
method proposed in the presentpaper. We shall
now describe this computation in another dot
product space
F
, which is related to the input
space by a p ossibly nonlinear map
:
R
N
!
F
x
7!
X
:
(4)
Note that
F
, whichwe will refer to as the
feature
space
, could have an arbitrarily large, possibly in-
nite, dimensionality. Here and in the following,
upper case characters are used for elements of
F
,
while lower case characters denote elements of
R
N
.
Again, wemake the assumption that we are
dealing with centered data, i.e.
P
M
k
=1
(
x
k
)=0
|we shall return to this point later. Using the
covariance matrix in
F
,
C
=
1
M
M
X
j
=1
(
x
j
)(
x
j
)
>
(5)
1
More precisely,thecovariance matrix is dened as
the expectation of
xx
>
for convenience, we shall use
the same term to refer to the maximum likeliho od
esti-
mate
(1) of the covariance matrix from a nite sample.
2

(if
F
is innite{dimensional, we think of
(
x
j
)(
x
j
)
>
as the linear operator which maps
X
2
F
to (
x
j
)((
x
j
)
X
)) wenowhavetond
Eigenvalues
0 and Eigenvectors
V
2
F
nf
0
g
satisfying
V
=
C
V
(6)
By the same argumentasabove, the solutions
V
lie in the span of (
x
1
)
:::
(
x
M
). For us, this
has two useful consequences: rst, we can consider
the equivalent equation
((
x
k
)
V
) = ((
x
k
)
C
V
)forall
k
=1
:::M
(7)
and second, there exist co ecien
ts
i
(
i
=
1
:::M
) suchthat
V
=
M
X
i
=1
i
(
x
i
)
:
(8)
Combining (7) and (8), weget
M
X
i
=1
i
((
x
k
)
(
x
i
)) =
1
M
M
X
i
=1
i
((
x
k
)
M
X
j
=1
(
x
j
))((
x
j
)
(
x
i
))
for all
k
=1
:::M:
(9)
Dening an
M
M
matrix
K
by
K
ij
:= ((
x
i
)
(
x
j
))
(10)
this reads
MK
=
K
2
(11)
where
denotes the column vector with entries
1
:::
M
.As
K
is symmetric, it has a set of
Eigenvectors which spans the whole space, thus
M
=
K
(12)
gives us all solutions
of Eq. (11). Note that
K
is p ositive semidenite, which can be seen by
noticing that it equals
((
x
1
)
:::
(
x
M
))
>
((
x
1
)
:::
(
x
M
))
(13)
which implies that for all
X
2
F
,
(
X
K
X
)=
k
((
x
1
)
:::
(
x
M
))
X
k
2
0
:
(14)
Consequently,
K
's Eigenvalues will be nonnega-
tive, and will exactly give the solutions
M
of
Eq. (11). We therefore only need to diagonalize
K
.Let
1
2
:::
M
denote the Eigenval-
ues, and
1
:::
M
the corresponding complete
set of Eigenvectors, with
p
being the rst nonzero
Eigenvalue.
2
We normalize
p
:::
M
b
y requir-
ing that the corresp onding vectors in
F
be nor-
malized, i.e.
(
V
k
V
k
) = 1 for all
k
=
p:::M:
(15)
By virtue of (8) and (12), this translates into a
normalization condition for
p
:::
M
:
1 =
M
X
ij
=1
k
i
k
j
((
x
i
)
(
x
j
))
=
M
X
ij
=1
k
i
k
j
K
ij
= (
k
K
k
)
=
k
(
k
k
) (16)
For the purpose of principal comp onent extrac-
tion, we need to compute pro jections on the Eigen-
vectors
V
k
in
F
(
k
=
p:::M
). Let
x
beatest
point, with an image (
x
)in
F
,then
(
V
k
(
x
)) =
M
X
i
=1
k
i
((
x
i
)
(
x
)) (17)
may b e called its nonlinear principal comp onents
corresponding to .
In summary, the following steps were necessary
to compute the principal components: rst, com-
pute the dot product matrix
K
dened by (10)
3
second, compute its Eigenvectors and normalize
them in
F
third, compute pro jections of a test
pointonto the Eigenvectors by(17).
For the sake of simplicity,wehaveabovemade
the assumption that the observations are centered.
This is easy to achieve in input space, but more
dicult in
F
,aswe cannot explicitly compute the
mean of the mapped observations in
F
. There is,
however,away to do it, and this leads to slightly
modied equations for kernel{based PCA (see Ap-
pendix A).
Before we pro ceed to the next section, which
more closely investigates the role of the map ,
the following observation is essential. The map-
ping used in the matrix computation can b e an
arbitrary nonlinear map into the possibly high{
dimensional space
F
. e.g. the space of all
n
th or-
der monomials in the entries of an input vector.
2
If we require that should not map all observa-
tions to zero, then sucha
p
will always exist.
3
Note that in our derivation we could haveused
the known result (e.g. Kirby & Sirovich, 1990) that
PCA can b e carried out on the dot pro duct matrix
(
x
i
x
j
)
ij
instead of (1), however, for the sakeofclarity
and extendability (in Appendix A, we shall consider
the case where the data must b e centered in
F
), we
gave a detailed derivation.
3

In that case, we need to compute dot pro ducts of
input vectors mapped by , with a possibly pro-
hibitive computational cost. The solution to this
problem, which will b e described in the following
section, builds on the fact that we
exclusively
need
to compute dot products b etween mapp ed pat-
terns (in (10) and (17))
we never need the mapped
patterns explicitly.
3 Computing Dot Pro ducts in
Feature Space
In order to compute dot pro ducts of the form
((
x
)
(
y
)), we use kernel representations of the
form
k
(
x
y
)=((
x
)
(
y
))
(18)
which allow us to compute the value of the dot
product in
F
without having to carry out the map
. This metho d was used by Boser, Guyon, &
Vapnik (1992) to extend the \Generalized Por-
trait" hyperplane classier of Vapnik & Chervo-
nenkis (1974) to nonlinear Support Vector ma-
chines. To this end, they substitute a priori chosen
kernel functions for all o ccurances of dot products.
This way, the p owerful results of Vapnik & Cher-
vonenkis (1974) for the Generalized Portrait carry
over to the nonlinear case. Aizerman, Braver-
man & Rozono er (1964) call
F
the \linearization
space", and use it in the context of the poten-
tial function classication metho d to express the
dot pro duct between elements of
F
in terms of ele-
ments of the input space. If
F
is high{dimensional,
wewould like to b e able to nd a closed form ex-
pression for
k
which can be eciently computed.
Aizerman et al. (1964) consider the p ossibilityof
choosing
k
a priori, without being directly con-
cerned with the corresp onding mapping into
F
.
A sp ecic choice of
k
might then corresp ond to a
dot pro duct b etween patterns mapp ed with a suit-
able . A particularly useful example, whichis a
direct generalization of a result proved byPoggio
(1975, Lemma 2.1) in the context of polynomial
approximation, is
(
x
y
)
d
=(
C
d
(
x
)
C
d
(
y
))
(19)
where
C
d
maps
x
to the vector
C
d
(
x
) whose entries
are all p ossible
n
-th degree ordered pro ducts of
the entries of
x
.For instance (Vapnik, 1995), if
x
=(
x
1
x
2
), then
C
2
(
x
)=(
x
2
1
x
2
2
x
1
x
2
x
2
x
1
),
or, yielding the same value of the dot product,
c
2
(
x
)=(
x
2
1
x
2
2
p
2
x
1
x
2
)
:
(20)
For this example, it is easy to verify that
;
(
x
1
x
2
)(
y
1
y
2
)
>
2
= (
x
2
1
x
2
2
p
2
x
1
x
2
)(
y
2
1
y
2
2
p
2
y
1
y
2
)
>
=
c
2
(
x
)
c
2
(
y
)
>
:
(21)
In general, the function
k
(
x
y
)=(
x
y
)
d
(22)
corresponds to a dot pro duct in the space of
d
-th order monomials of the input coordinates.
If
x
represents an image with the entries b eing
pixel values, we can thus easily work in the space
spanned by pro ducts of any
d
pixels | provided
that we are able to do our work solely in terms
of dot products, without any explicit usage of a
mapped pattern
c
d
(
x
). The latter lives in a pos-
sibly very high{dimensional space: even though
we will identify terms like
x
1
x
2
and
x
2
x
1
into one
coordinate of
F
as in (20), the dimensionalityof
F
, the image of
R
N
under
c
d
, still is
(
N
+
p
;
1)!
p
!(
N
;
1)!
and
thus grows like
N
p
.For instance, 16
16 input im-
ages and a polynomial degree
d
= 5 yield a dimen-
sionalityof10
10
.Thus, using kernels of the form
(22) is our only waytotakeinto account higher{
order statistics without a combinatorial explosion
of time complexity.
The general question which function
k
corre-
sponds to a dot pro duct in some space
F
has
been discussed by Boser, Guyon, & Vapnik (1992)
and Vapnik (1995): Mercer's theorem of functional
analysis states that if
k
is a continuous kernel of
a p ositiveintegral operator, we can construct a
mapping into a space where
k
acts as a dot prod-
uct (for details, see App endix B.
The application of (18) to our problem is
straightforward: we simply substitute an a priori
chosen kernel function
k
(
x
y
) for all occurances
of ((
x
)
(
y
)). This was the reason whywe had
to formulate the problem in Sec. 2 in a waywhich
only makes use of the values of dot products in
F
. The choice of
k
then
implicitly
determines the
mapping and the feature space
F
.
In App endix B, we give some examples of ker-
nels other than (22) whichmay b e used.
4 Kernel PCA
4.1 The Algorithm
To perform kernel{based PCA (Fig. 1), from now
on referred to as
kernel PCA
, the following steps
have to be carried out: rst, we compute the dot
product matrix (cf. Eq. (10))
K
ij
=(
k
(
x
i
x
j
))
ij
:
(23)
Next, we solve(12)by diagonalizing
K
, and nor-
malize the Eigenvector expansion co ecients
n
by requiring Eq. (16),
1=
n
(
n
n
)
:
4

R
2
l
inear PC
A
R
2
F
=R
2
0
Φ
k
ernel PC
A
k
Figure 1: The basic idea of kernel PCA. In some high{
dimensional feature space
F
(bottom right), weare
performing linear PCA, just as a PCA in input space
(top). Since
F
is nonlinearly related to input space
(via ), the contour lines of constant pro jections onto
the principal Eigenvector (drawn as an arrow) b ecome
nonlinear
in input space. Note that we cannot draw
a pre{image of the Eigenvector in input space, as it
may not even exist. Crucial to kernel PCA is the fact
that we do not actually perform the map into
F
,but
instead p erform all necessary computations by the use
ofakernel function
k
in input space (here:
R
2
).
To extract the principal comp onents (correspond-
ing to the kernel
k
) of a test p oint
x
,we then
compute pro jections onto the Eigenvectors by (cf.
Eq. (17)),
(
k
PC)
n
(
x
)=(
V
n
(
x
)) =
M
X
i
=1
n
i
k
(
x
i
x
)
:
(24)
If we use a kernel as described in Sec. 3, we
know that this procedure exactly corresp onds to
standard PCA in some high{dimensional feature
space, except that we do not need to p erform ex-
pensiv
e computations in that space.
4.2 Properties of (Kernel{) PCA
If we use a kernel which satises the conditions
given in Sec. 3, we know that we are in fact doing
a standard PCA in
F
. Consequently, all math-
ematical and statistical prop erties of PCA (see
for instance Jollie, 1986) carry over to kernel{
based PCA, with the modications that they be-
come statements about a set of points (
x
i
)
i
=
1
:::M
,in
F
rather than in
R
N
.In
F
,we can
thus assert that PCA is the orthogonal basis trans-
formation with the following properties (assuming
that the Eigenvectors are sorted in ascending order
of the Eigenvalue size):
the rst
q
(
q
2f
1
:::M
g
) principal comp o-
nents, i.e. pro jections on Eigenvectors, carry
more variance than any other
q
orthogonal
directions
the mean{squared approximation error in
representing the observations by the rst
q
principal comp onents is minimal
the principal components are uncorrelated
the representation entropy is minimized
the rst
q
principal components havemaxi-
mal mutual information with respect to the
inputs
For more details, see Diamantaras & Kung (1996).
To translate these properties of PCA in
F
into
statements ab out the data in input space, they
need to b e investigated for sp ecic choices of a
kernels. Weshallnotgointo detail on that mat-
ter, but rather proceed in our discussion of kernel
PCA.
4.3 Dimensionality Reduction and
Feature Extraction
Unlike linear PCA, the prop osed metho d allows
the extraction of a number of principal compo-
nents which
can
exceed the input dimensionality.
Suppose that the number of observations
M
ex-
ceeds the input dimensionality
N
. Linear PCA,
even when it is based on the
M
M
dot product
matrix, can nd at most
N
nonzero Eigenvalues
| they are identical to the nonzero Eigenvalues
of the
N
N
covariance matrix. In contrast, ker-
nel PCA can nd up to
M
nonzero Eigenvalues
4
| a fact that illustrates that it is imp ossible to
perform kernel PCA based on an
N
N
covari-
ance matrix.
4.4 Computational Complexity
As mentioned in Sec. 3, a fth order p olynomial
kernel on a 256{dimensional input space yields a
10
10
{dimensional space. It would seem that lo ok-
ing for principal components in his space should
pose intractable computational problems. How-
ever, as wehave explained ab ove, this is not the
case. First, as pointed out in Sect. 2 wedo not
need to look for Eigenvectors in the full space
F
,
but just in the subspace spanned by the images
of our observations
x
k
in
F
. Second, wedo not
need to compute dot products explicitly b etween
vectors in
F
,as we know that in our case this can
be done directly in the input space, using kernel
4
If we use one kernel | of course, we could extract
features with several kernels, to get even more.
5

Citations
More filters
Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Journal ArticleDOI
TL;DR: There are several arguments which support the observed high accuracy of SVMs, which are reviewed and numerous examples and proofs of most of the key theorems are given.
Abstract: The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and non-separable data, working through a non-trivial example in detail. We describe a mechanical analogy, and discuss when SVM solutions are unique and when they are global. We describe how support vector training can be practically implemented, and discuss in detail the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the data. We show how Support Vector machines can have very large (even infinite) VC dimension by computing the VC dimension for homogeneous polynomial and Gaussian radial basis function kernels. While very high VC dimension would normally bode ill for generalization performance, and while at present there exists no theory which shows that good generalization performance is guaranteed for SVMs, there are several arguments which support the observed high accuracy of SVMs, which we review. Results of some experiments which were inspired by these arguments are also presented. We give numerous examples and proofs of most of the key theorems. There is new material, and I hope that the reader will find that even old material is cast in a fresh light.

15,696 citations


Cites background from "Nonlinear component analysis as a k..."

  • ...Recent work has generalized the basic ideas (Smola, Schölkopf and Müller, 1998a; Smola and Schölkopf, 1998), shown connections to regularization theory (Smola, Schölkopf and Müller, 1998b; Girosi, 1998; Wahba, 1998), and shown how SVM ideas can be incorporated in a wide range of other algorithms (Schölkopf, Smola and Müller, 1998b; Schölkopf et al, 1998c)....

    [...]

  • ...This fact has been used to derive a nonlinear version of principal component analysis by (Schölkopf, Smola and Müller, 1998b); it seems likely that this trick will continue to find uses elsewhere....

    [...]

  • ...…Müller, 1998a; Smola and Schölkopf, 1998), shown connections to regularization theory (Smola, Schölkopf and Müller, 1998b; Girosi, 1998; Wahba, 1998), and shown how SVM ideas can be incorporated in a wide range of other algorithms (Schölkopf, Smola and Müller, 1998b; Schölkopf et al, 1998c)....

    [...]

  • ...…(with each image suffering the same permutation), an act of vandalism that would leave the best performing neural networks severely handicapped) and much work has been done on incorporating prior knowledge into SVMs (Schölkopf, Burges andVapnik, 1996; Schölkopf et al., 1998a; Burges, 1998)....

    [...]

  • ...Keywords: support vector machines, statistical learning theory, VC dimension, pattern recognition...

    [...]

Journal ArticleDOI
TL;DR: Recent work in the area of unsupervised feature learning and deep learning is reviewed, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks.
Abstract: The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.

11,201 citations


Cites methods from "Nonlinear component analysis as a k..."

  • ...this geometric perspective adopt a non-parametric approach, based on a training set nearest neighbor graph (Schölkopf et al., 1998; Roweis and Saul, 2000; Tenenbaum et al., 2000; Brand, 2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger and Saul, 2004; Hinton and Roweis, 2003; van der Maaten and Hinton, 2008)....

    [...]

  • ...The large majority of algorithms built on this geometric perspective adopt a non-parametric approach, based on a training set nearest neighbor graph (Schölkopf et al., 1998; Roweis and Saul, 2000; Tenenbaum et al., 2000; Brand, 2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger and…...

    [...]

Journal ArticleDOI
TL;DR: This tutorial gives an overview of the basic ideas underlying Support Vector (SV) machines for function estimation, and includes a summary of currently used algorithms for training SV machines, covering both the quadratic programming part and advanced methods for dealing with large datasets.
Abstract: In this tutorial we give an overview of the basic ideas underlying Support Vector (SV) machines for function estimation. Furthermore, we include a summary of currently used algorithms for training SV machines, covering both the quadratic (or convex) programming part and advanced methods for dealing with large datasets. Finally, we mention some modifications and extensions that have been applied to the standard SV algorithm, and discuss the aspect of regularization from a SV perspective.

10,696 citations

Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

References
More filters
Book
Vladimir Vapnik1
01 Jan 1995
TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Abstract: Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.

40,147 citations


"Nonlinear component analysis as a k..." refers background or methods in this paper

  • ...Clearly, the last point has yetto be evaluated in practise, however, for the Sup-port Vector machine, the utility of di erent kernelshas already been established (Sch olkopf, Burges,& Vapnik, 1995)....

    [...]

  • ...The general question which function k corre-sponds to a dot product in some space F hasbeen discussed by Boser, Guyon, & Vapnik (1992)and Vapnik (1995): Mercer's theorem of functionalanalysis states that if k is a continuous kernel ofa positive integral operator, we can construct amapping into a…...

    [...]

  • ...In addition, theyall construct their decision functions from an al-most identical subset of a small number of trainingpatterns, the Support Vectors (Sch olkopf, Burges,& Vapnik, 1995)....

    [...]

  • ...The number of components extracted thendetermines the size of of the rst hidden layer.Combining (24) with the Support Vector decisionfunction (Vapnik, 1995), we thus get machines ofthe typef(x) = sgn X̀i=1 iK2(~g(xi); ~g(x)) + b!...

    [...]

  • ...…convolutional 5{layerneural networks (5.0% were reported by LeCunet al., 1989) and nonlinear Support Vector classi- ers (4.0%, Sch olkopf, Burges, & Vapnik, 1995);it is far superior to linear classi ers operating di-rectly on the image data (a linear Support Vec-tor machine achieves 8.9%; Sch…...

    [...]

Journal ArticleDOI
TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Abstract: The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.

37,861 citations

Book
01 May 1986
TL;DR: In this article, the authors present a graphical representation of data using Principal Component Analysis (PCA) for time series and other non-independent data, as well as a generalization and adaptation of principal component analysis.
Abstract: Introduction * Properties of Population Principal Components * Properties of Sample Principal Components * Interpreting Principal Components: Examples * Graphical Representation of Data Using Principal Components * Choosing a Subset of Principal Components or Variables * Principal Component Analysis and Factor Analysis * Principal Components in Regression Analysis * Principal Components Used with Other Multivariate Techniques * Outlier Detection, Influential Observations and Robust Estimation * Rotation and Interpretation of Principal Components * Principal Component Analysis for Time Series and Other Non-Independent Data * Principal Component Analysis for Special Types of Data * Generalizations and Adaptations of Principal Component Analysis

17,446 citations

Journal ArticleDOI
TL;DR: A near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals, and that is easy to implement using a neural network architecture.
Abstract: We have developed a near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals. The computational approach taken in this system is motivated by both physiology and information theory, as well as by the practical requirements of near-real-time performance and accuracy. Our approach treats the face recognition problem as an intrinsically two-dimensional (2-D) recognition problem rather than requiring recovery of three-dimensional geometry, taking advantage of the fact that faces are normally upright and thus may be described by a small set of 2-D characteristic views. The system functions by projecting face images onto a feature space that spans the significant variations among known face images. The significant features are known as "eigenfaces," because they are the eigenvectors (principal components) of the set of faces; they do not necessarily correspond to features such as eyes, ears, and noses. The projection operation characterizes an individual face by a weighted sum of the eigenface features, and so to recognize a particular face it is necessary only to compare these weights to those of known individuals. Some particular advantages of our approach are that it provides for the ability to learn and later recognize new faces in an unsupervised manner, and that it is easy to implement using a neural network architecture.

14,562 citations


"Nonlinear component analysis as a k..." refers background or methods in this paper

  • ...PCA has been successfully used for face recogni-tion (Turk & Pentland, 1991) and face representa-tion (Vetter & Poggio, 1995)....

    [...]

  • ...This is due to the fact that for k(x;y) = (x y), the Support Vector decision function (Boser, Guyon, & Vapnik, 1992) f(x) = sgn(X̀i=1 ik(x;xi) + b) (28) can be expressed with a single weight vector w = Pì=1 ixi asf(x) = sgn((x w) + b): (29) Thus the nal stage of classi cation can be done extremely fast; the speed of the principal component extraction phase, on the other hand, and thus the accuracy{speed tradeo of the whole classi er, can be controlled by the number of components which we extract, or by the above reduced set parameter m....

    [...]

Proceedings ArticleDOI
01 Jul 1992
TL;DR: A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented, applicable to a wide variety of the classification functions, including Perceptrons, polynomials, and Radial Basis Functions.
Abstract: A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented. The technique is applicable to a wide variety of the classification functions, including Perceptrons, polynomials, and Radial Basis Functions. The effective number of parameters is adjusted automatically to match the complexity of the problem. The solution is expressed as a linear combination of supporting patterns. These are the subset of training patterns that are closest to the decision boundary. Bounds on the generalization performance based on the leave-one-out method and the VC-dimension are given. Experimental results on optical character recognition problems demonstrate the good generalization obtained when compared with other learning algorithms.

11,211 citations

Frequently Asked Questions (3)
Q1. What are the contributions in this paper?

The authors describe a new method for performing a nonlinear form of Principal Component Anal ysis The authors give the derivation of the method along with a discussion of other techniques which can be made nonlinear with the kernel approach and present rst experimental results on nonlinear fea ture extraction for pattern recognition AS and KRM are with GMD First Forschungszentrum Informationstechnik Rudower Chaussee Berlin AS and BS were supported by grants from the Studienstiftung des deutschen Volkes BS thanks the GMD First for hospitality during two visits AS and BS thank V Vapnik for introducing them to kernel representations of dot products during joint work on Support Vector machines This work pro ted from discussions with V Blanz L Bottou C Burges H B ultho K Gegenfurtner P Ha ner N Murata P Simard S Solla V Vapnik and T Vetter By the use of integral operator kernel functions the authors can e ciently compute principal components in high dimensional feature spaces related to input space by some nonlinear map for instance the space of all possible pixel products in images 

In other words the admissible kernels form a cone in the space of all integral operators Clearly k k corresponds to mapping into the direct sum of the respective spaces into which k and k map 

In input space locality consists of basing their com ponent extraction for a point x on other points in an appropriately chosen neighbourhood of x