scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Candid covariance-free incremental principal component analysis

01 Aug 2003-IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE Computer Society)-Vol. 25, Iss: 8, pp 1034-1040
TL;DR: A fast incremental principal component analysis (IPCA) algorithm, called candid covariance-free IPCA (CCIPCA), used to compute the principal components of a sequence of samples incrementally without estimating the covariance matrix (so covariances-free).
Abstract: Appearance-based image analysis techniques require fast computation of principal components of high-dimensional image vectors. We introduce a fast incremental principal component analysis (IPCA) algorithm, called candid covariance-free IPCA (CCIPCA), used to compute the principal components of a sequence of samples incrementally without estimating the covariance matrix (so covariance-free). The new method is motivated by the concept of statistical efficiency (the estimate has the smallest variance given the observed data). To do this, it keeps the scale of observations and computes the mean of observations incrementally, which is an efficient estimate for some well known distributions (e.g., Gaussian), although the highest possible efficiency is not guaranteed in our case because of unknown sample distribution. The method is for real-time applications and, thus, it does not allow iterations. It converges very fast for high-dimensional image vectors. Some links between IPCA and the development of the cerebral cortex are also discussed.

Summary (2 min read)

1 INTRODUCTION

  • A class of image analysis techniques called appearance-based approach has now become very popular.
  • Further, when the dimension of the image is high, both the computation and storage complexity grow dramatically.
  • Several IPCA techniques have been proposed to compute principal components without the covariance matrix [9], [10], [11].
  • An amnesic average technique is also used to dynamically determine the retaining rate of the old and new data, instead of a fixed learning rate.

2.1 The First Eigenvector

  • Suppose that sample vectors are acquired sequentially, uð1Þ; uð2Þ; . . . , possibly infinite.
  • Now, the question is how to estimate xðiÞ in (2).
  • The authors are with the Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824.
  • Procedure (6) is at the mercy of the magnitude of observation uðnÞ, where the first term has a unit norm, but the second can take any magnitude.
  • In (4), the statistical efficiency is realized by keeping the scale of the estimate at the same order of the new observations (the first and second terms properly weighted on the right side of (4) to get sample mean), which allows full use of every observation in terms of statistical efficiency.

2.2 Intuitive Explanation

  • An intuitive explanation of procedure (4) is as follows: Consider a set of two-dimensional data with a Gaussian probability distribution function (for any other physically arising distribution, the authors can consider its first two orders of statistics since PCA does so).
  • Noticing uT ðnÞ v1ðnÿ 1Þjjv1ðnÿ 1Þjj is a scalar, the authors know 1 n uðnÞuT ðnÞ v1ðnÿ 1Þjjv1ðnÿ 1Þjj is essentially a scaled vector of uðnÞ.
  • For the points uu in the upper half plane, the pure force will pull v1ðnÿ 1Þ toward the direction of v1 since there are more data points to the right side of v1ðnÿ 1Þ than those to the left side.
  • V1ðnÿ 1Þwill not stop moving until it is aligned with v1 when the pulling forces from both sides are balanced.
  • In other words, v1ðnÞ in (4) will converge to the first eigenvector.

2.3 Higher-Order Eigenvectors

  • Procedure (4) only estimates the first dominant eigenvector.
  • To compute the second order eigenvector, the authors first subtract from the data its projection on the estimated first order eigenvector v1ðnÞ, as shown in (9), u2ðnÞ ¼ u1ðnÞ ÿ uT1 ðnÞ v1ðnÞ jjv1ðnÞjj v1ðnÞ jjv1ðnÞjj ; ð9Þ where u1ðnÞ ¼ uðnÞ.
  • The orthogonality is always enforced when the convergence is reached, although not exactly so at early stages.
  • In either case, the statistical efficiency was not considered.
  • One may notice that the expensive steps in both SGA and CCIPCA are the dot products in the high-dimensional data space.

2.4 Equal Eigenvalues

  • Let us consider the case where there are equal eigenvalues.
  • Therefore, the estimate of eigenvectors ei, where i < l, will not be affected anyway.
  • Since their eigen- values are equal, the shape of the distribution in Fig. 1 is a hyper- sphere within the subspace.
  • Thus, the estimates of the multiple eigenvectors will converge to any set of the orthogonal basis of that subspace.
  • Where it converges to depends mainly on the early samples because of the averaging effect in (2), where the contribution of new data gets infinitely small when n increases without a bound.

2.5 Algorithm Summary

  • Combining the mechanisms discussed above, the authors have the candid covariance-free IPCA algorithm as follows: Procedure 1. (b) Otherwise, viðnÞ ¼ nÿ1ÿln viðnÿ.
  • 1Þ þ 1þln uiðnÞuTi ðnÞ viðnÿ1Þ jjviðnÿ1Þjj ; (10) uiþ1ðnÞ ¼ uiðnÞ ÿ uTi ðnÞ viðnÞ jjviðnÞjj viðnÞ jjviðnÞjj : (11) A mathematical proof of the convergence of CCIPCA can be founded in [12].

3 EMPIRICAL RESULTS ON CONVERGENCE

  • The authors performed experiments to study the statistical efficiency of the new algorithm as well as the existing IPCA algorithms, especially for high-dimensional data such as images.
  • In contrast, the proposed CCIPCA converges fast.
  • Shown in Fig. 7 are the first 10 eigenfaces estimated by batch PCA and CCIPCA (with the amnesic parameter l ¼ 2) after one epoch and 20 epochs, respectively.
  • For the general readership, an experiment was done on a lower dimension data set.
  • The authors extracted 10 x 10 pixel subimages around the right eye area in each image of the FERET data set, estimated their sample covariance matrix , and used MATLAB to generate 1,000 samples with the Gaussian distribution Nð0; Þ in the 100-dimensional space.

4 CONCLUSIONS AND DISCUSSIONS

  • This short paper concentrates on a challenging issue of computing dominating eigenvectors and eigenvalues from an incrementally arriving high-dimensional data stream without computing the corresponding covariance matrix and without knowing data in advance.
  • An amnesic average technique is implemented to further improve the convergence rate.
  • The importance of the result presented here is potentially beyond the apparent technical scope interesting to the computer vision community.
  • As discussed in [7], what a human brain does is not just computing—processing data—but, more importantly and more fundamentally, developing the computing engine itself, from real-world, online sensory data streams.
  • The link between incremental PCA and the developmental mechanisms of their brain is probably more intimate than one can fully appreciate now.the authors.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Candid Covariance-Free Incremental
Principal Component Analysis
Juyang Weng, Member, IEEE,
Yilu Zhang, Student Member, IEEE,and
Wey-Shiuan Hwang, Member, IEEE
Abstract—Appearance-based image analysis techniques require fast
computation of principal components of high-dimensional image vectors. We
introduce a fast incremental principal component analysis (IPCA) algorithm, called
candid covariance-free IPCA (CCIPCA), used to compute the principal
components of a sequence of samples incrementally without estimating the
covariance matrix (so covariance-free). The new method is motivated by the
concept of statistical efficiency (the estimate has the smallest variance given the
observed data). To do this, it keeps the scale of observations and computes the
mean of observations incrementally, which is an efficient estimate for some well-
known distributions (e.g., Gaussian), although the highest possible efficiency is not
guaranteed in our case because of unknown sample distribution. The method is for
real-time applications and, thus, it does not allow iterations. It converges very fast
for high-dimensional image vectors. Some links between IPCA and the
development of the cerebral cortex are also discussed.
Index Terms—Principal component analysis, incremental principal component
analysis, stochastic gradient ascent (SGA), generalized hebbian algorithm (GHA),
orthogonal complement.
æ
1INTRODUCTION
A class of image analysis techniques called appearance-based
approach has now become very popular. A major reason that leads
to its popularity is the use of statistics tools to automatically derive
features instead of relying on humans to define features. Although
principal component analysis is a well-known technique, Sirovich
and Kirby [1] appear to be among the first who used the technique
directly on the characterization of human faces—each image is
considered simply as a high-dimensional vector, each pixel
corresponding to a component. Turk and Pentland [2] were among
the first who used this representation for face recognition. The
technique has been extended to 3D object recognition [3], sign
recognition [4], and autonomous navigation [5] among many other
image analysis problems.
A well-known computational approach to PCA involves solving
an eigensystem problem, i.e., computing the eigenvectors and
eigenvalues of the sample covariance matrix, using a numerical
method such as the power method and the QR method [6]. This
approach requires that all the training images be available before the
principal components can be estimated. This is called a batch
method. This type of method no longer satisfies an up coming new
trend of computer vision research [7] in which all visual filters are
incrementally derived from very long online real-time video stream,
motivated by the development of animal vision systems. Online
development of visual filters requires that the system perform while
new sensory signals flow in. Further, when the dimension of the
image is high, both the computation and storage complexity grow
dramatically. For example, in the eigenface method, a moderate
gray image of 64 rows and 88 columns results in a d-dimensional
vector with d ¼ 5; 632. The symmetric covariance matrix requires
dðd þ 1Þ=2 elements, which amounts to 15,862,528 entries! A clever
saving method can be used when the number of images is smaller
than the number of pixels in the image [1]. However, an online
developing system must observe an open number of images and the
number is larger than the dimension of the observed vectors. Thus,
an incremental method is required to compute the principal
components for observations arriving sequentially, where the
estimate of principal components are updated by each arriving
observation vector. No covariance matrix is allowed to be estimated
as an intermediate result. There is evidence that biological neural
networks use an incremental method to perform various learning,
e.g., Hebbian learning [8].
Several IPCA techniques have been proposed to compute
principal components without the covariance matrix [9], [10], [11].
However, they ran into convergence problems when facing high-
dimensional image vectors. We explain in this article why. We
propose a new method, candid covariance-free IPCA (CCIPCA),
based on the work of Oja and Karhunen [10] and Sanger [11]. It is
motivated by a well-known statistical concept called efficient
estimate. An amnesic average technique is also used to dynamically
determine the retaining rate of the old and new data, instead of a
fixed learning rate.
2DERIVATION OF THE ALGORITHM
2.1 The First Eigenvector
Suppose that sample vectors are acquired sequentially, uð1Þ;uð2Þ;
..., possibly infinite. Each uðnÞ, n ¼ 1; 2; ...,isad-dimensional
vector and d can be as large as 5,000 and beyond. Without loss of
generality, we can assume that uðnÞ has a zero mean (the mean may
be incrementally estimated and subtracted out). A ¼ EfuðnÞu
T
ðnÞg
is the d d covariance matrix, which is neither known nor allowed to
be estimated as an intermediate result.
By definition, an eigenvector x of matrix A satisfies
x ¼ Ax; ð1Þ
where is the corresponding eigenvalue. By replacing the
unknown A with the sample covariance matrix and replacing the
x of (1) with its estimate xðiÞ at each time step i, we obtain an
illuminating expression for v ¼ x:
vðnÞ¼
1
n
X
n
i¼1
uðiÞu
T
ðiÞxðiÞ; ð2Þ
where vðnÞ is the nth step estimate of v. As we will see soon, this
equation is motivated by statistical efficiency. Once we have the
estimate of v, it is easy to get the eigenvector and the eigenvalue
since ¼jjvjj and x ¼ v=jjvjj.
Now, the question is how to estimate xðiÞ in (2). Considering
x ¼ v=jjvjj, we may choose xðiÞ as vði ÿ 1Þ=jjvði ÿ 1Þjj, which leads
to the following incremental expression:
vðnÞ¼
1
n
X
n
i¼1
uðiÞu
T
ðiÞ
vði ÿ 1Þ
jjvði ÿ 1Þjj
: ð3Þ
To begin with, we set vð0Þ¼uð1Þ, the first direction of data spread.
For incremental estimation, (3) is written in a recursive form,
vðnÞ¼
n ÿ 1
n
vðn ÿ 1Þþ
1
n
uðnÞu
T
ðnÞ
vðn ÿ 1Þ
jjvðn ÿ 1Þjj
; ð4Þ
where ðn ÿ 1Þ=n is the weight for the last estimate and 1=n is the
weight for the new data. We have proven that, with the algorithm
given by (4), v
1
ðnÞ!
1
e
1
when n !1, where
1
is the largest
eigenvalue of the covariance matrix of fuðnÞg and e
1
is the
corresponding eigenvector [12].
The derivation of (2), (3), and (4) is motivated by statistical
efficiency. An unbiased estimate
^
QQ of the parameter Q is said to be
the efficient estimate for the class D of distribution functions if, for every
distribution density function fðu; QÞ of D, the variance D
2
ð
^
QQÞ
(squared error) has reached the minimal value given by
1034 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 8, AUGUST 2003
. The authors are with the Department of Computer Science and
Engineering, Michigan State University, East Lansing, MI 48824.
E-mail: {weng, zhangyil, hwangwey}@cse.msu.edu.
Manuscript received 20 Feb. 2002; revised 4 Oct. 2002; accepted 28 Oct. 2002.
Recommended for acceptance by R. Beveridge.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number 115928.
0162-8828/03/$17.00 ß 2003 IEEE Published by the IEEE Computer Society

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 8, AUGUST 2003 1035
Fig. 1. Intuitive explanation of the incremental PCA algorithm.
Fig. 2. The correctness, or the correlation, represented by dot products, of the first 10 eigenvectors computed by (a) SGA, (b) GHA, and (c) the proposed CCIPCA with
the amnesic parameter l ¼ 2.

D
2
ð
^
QQÞ¼E½ð
^
QQ ÿ QÞ
2

1
n
R
þ1
ÿ1
@ log fðu;QÞ
@Q
hi
2
fðu; QÞdu
: ð5Þ
The right side of (5) is called the Crame
´
r-Rao bound. It says that the
efficient estimate is one that has the least variance from the real
parameter, and its variance is bounded below by the Crame
´
r-Rao
bound. For example, the sample mean
ww ¼
1
n
P
n
i¼1
wðiÞ is the efficient
estimate of the mean of a Gaussian distribution with a known
standard deviation [13]. For a vector version of the Crame
´
r-Rao
bound, the reader is referred to [14, pp. 203-204].
If we define wðiÞ¼uðiÞu
T
ðiÞxðiÞ, vðnÞ in (2) can be viewed as
the mean of “samples” wðiÞ. That is exactly why our method is
motivated by statistical efficiency in using averaging in (2). In other
words, statistically, the method tends to converge most quickly or
the estimate has the smallest error variance given the currently
observed samples. Of course, wðiÞ is not necessarily drawn from a
Gaussian distribution independently and, thus, the estimate using
the sample mean in (4) is not strictly efficient. However, the
estimate vðnÞ still has a high statistical efficiency and has a fairly
low error variance as we will show experimentally.
The Crame
´
r-Rao lower error bound in (5) can also be used to
estimate the error variance or, equivalently, the convergence rate,
using a Gaussian distribution model, as proposed and experi-
mented with by Weng et al. [14, Section 4.6]. This is a reasonable
estimate because of our near optimal statistical efficiency here.
Weng et al. [14] demonstrated that actual error variance is not very
sensitive to the distribution (e.g., uniform or Gaussian distribu-
tions). This error estimator is especially useful to estimate roughly
how many samples are needed for a given tolerable error variance.
IPCA algorithms have been studied by several researchers [15],
[16], [9], [10]. An early work with a rigorous proof for convergence
was given by Oja [9] and Oja and Karhunen [10], where they
introduced their stochastic gradient ascent (SGA) algorithm. SGA
computes,
~
vv
i
ðnÞ¼v
i
ðn ÿ 1Þþ
i
ðnÞuðnÞu
T
ðnÞv
i
ðn ÿ 1Þ; ð6Þ
v
i
ðnÞ¼orthonormalize
~
vv
i
ðnÞ w:r:t:v
j
ðnÞ;j¼ 1; 2; ...;iÿ 1; ð7Þ
where v
i
ðnÞ is the estimate of the ith dominant eigenvectors of the
sample covariance matrix A ¼ EfuðnÞu
T
ðnÞg and
~
vv
i
ðnÞ is the new
estimate. In practice, the orthonormalization in (7) can be done by a
standard Gram-Schmidt Orthonomalization (GSO) procedure. The
parameter
i
ðnÞ is a stochastic approximation gain. The convergence
of SGA has been proven under some assumptions of A and
i
ðnÞ [10].
SGA is essentially a gradient method associated with the
problem of choosing
i
ðnÞ, the learning rate. Simply speaking, the
learning rate should be appropriate so that the second term (the
correction term) on the right side of (6) is comparable to the first
term, neither too large nor too small. In practice,
i
ðnÞ depends
very much on the nature of the data and usually requires a trial-
and-error procedure, which is impractical for online applications.
Oja gave some suggestions on
i
ðnÞ in [9], which is typically 1=n
multiplied by some constants. However, procedure (6) is at the
mercy of the magnitude of observation uðnÞ, where the first term
has a unit norm, but the second can take any magnitude. If uðnÞ
has a very small magnitude, the second term will be too small to
make any changes in the new estimate. If uðnÞ has a large
magnitude, which is the case with high-dimensional images, the
second term will dominate the right side before a very large
number n and, hence, a small
i
ðnÞ has been reached. In either
case, the updating is inefficient and the convergence will be slow.
Contrasted with SGA, the first term on the right side of (4) is not
normalized. In effect, vðnÞ in (4) converges to e instead of e as it
does in (6), where is the eigenvalue and e is the eigenvector. In (4),
the statistical efficiency is realized by keeping the scale of the
estimate at the same order of the new observations (the first and
second terms properly weighted on the right side of (4) to get sample
mean), which allows full use of every observation in terms of
statistical efficiency. Note that the coefficient ðn ÿ 1Þ=n in (4) is as
important as the “learning rate” 1=n in the second term to realize
sample mean. Although ðn ÿ 1Þ=n is close to 1 when n is large, it is
very important for fast convergence with early samples. The point is
that, if the estimate does not converge well at the beginning, it is
harder to pull back later when n is large. Thus, one does not need to
worry about the nature of the observations. This is also the reason
that we used “candid” in naming the new algorithm.
It is true that the series of parameters,
i
ðnÞ, i ¼ 1; 2; ...;k, in SGA
can be manually tuned in an offline application so that it takes into
account the magnitude of uðnÞ. But, a predefined
i
ðnÞ cannot
accomplish statistical efficiency no matter how
i
ðnÞ is tuned. This is
true because all the “observations,” i.e., the last term in (4) and (6),
contribute to the estimate in (4) with the same weight for statistical
efficiency, but they contribute unequally in (6) due to normalization
of vðn ÿ 1Þ in the first term and, thus, damage the efficiency. Further,
the manual tuning is not suited for an online learning algorithm
since the user cannot predict signals in advance. An online
algorithm must automatically compute data-sensitive parameters.
There is a further improvement to procedure (4). In (4), all the
“samples”
wðiÞ¼uðiÞu
T
ðiÞ
vði ÿ 1Þ
jjvði ÿ 1Þjj
;
are weighted equally. However, since wðiÞ is generated by vðiÞ and
vðiÞ is far away from its real value at a early estimation stage, wðiÞ
is a “sample” with large “noise” when i is small. To speed up the
convergence of the estimation, it is preferable to give smaller
weight to these early “samples.” A way to implement this idea is to
use an amnesic average by changing (4) into
1036 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 8, AUGUST 2003
Fig. 3. The correctness of the eigenvalue,
jjv
i
jj
i
by CCIPCA.

vðnÞ¼
n ÿ 1 ÿ l
n
vðn ÿ 1Þþ
1 þ l
n
uðnÞu
T
ðnÞ
vðn ÿ 1Þ
jjvðn ÿ 1Þjj
; ð8Þ
where the positive parameter l is called the amnesic parameter. Note
that the two modified weights still sum to 1. With the presence of l,
larger weight is given to new “samples” and the effect of old
“samples” will fade out gradually. Typically, l ranges from 2 to 4.
2.2 Intuitive Explanation
An intuitive explanation of procedure (4) is as follows: Consider a set
of two-dimensional data with a Gaussian probability distribution
function (for any other physically arising distribution, we can
consider its first two orders of statistics since PCA does so). The data
is charactrized by an ellipse, as shown in Fig. 1. According to the
geometrical meaning of eigenvectors, we know that the first
eigenvector is aligned with the long axis (v
1
) of the ellipse. Suppose
v
1
ðn ÿ 1Þ is the ðn ÿ 1Þth-step estimation of the first eigenvector.
Noticing
u
T
ðnÞ
v
1
ðn ÿ 1Þ
jjv
1
ðn ÿ 1Þjj
is a scalar, we know
1
n
uðnÞu
T
ðnÞ
v
1
ðn ÿ 1Þ
jjv
1
ðn ÿ 1Þjj
is essentially a scaled vector of uðnÞ. According to (4), v
1
ðnÞ is a
weighted combination of the last estimate, v
1
ðn ÿ 1Þ and the scaled
vector of uðnÞ. Therefore, geometrically speaking, v
1
ðnÞ is obtained
by pulling v
1
ðn ÿ 1Þ toward uðnÞ by a small amount.
A line l
2
orthogonal to v
1
ðn ÿ 1Þ divides the whole plane into
two halves, the upper and the lower ones. Because every point u
l
in
the lower half plane has an obtuse angle with v
1
ðn ÿ 1Þ, u
T
l
v
1
ðnÿ1Þ
jjv
1
ðnÿ1Þjj
is a negative scalar. So, for u
l
, (4) may be written as,
vðnÞ¼
n ÿ 1
n
vðn ÿ 1Þþ
1
n
u
T
l
vðn ÿ 1Þ
jjvðn ÿ 1Þjj
ðÿu
l
Þ;
where ÿu
l
is an upper half plane point obtained by rotating u
l
for 180
degrees w.r.t. the origin. Since the ellipse is centrally symmetric, we
may rotate all the lower half plane points to the upper half plane and
only consider the pulling effect of upper half plane points. For the
points u
u
in the upper half plane, the pure force will pull v
1
ðn ÿ 1Þ
toward the direction of v
1
since there are more data points to the right
side of v
1
ðn ÿ 1Þ than those to the left side. As long as the first two
eigenvalues are different, this pulling force always exists and the
pulling direction is toward the eigenvector corresponding to a larger
eigenvalue. v
1
ðn ÿ 1Þ will not stop moving until it is aligned with v
1
when the pulling forces from both sides are balanced. In other words,
v
1
ðnÞ in (4) will converge to the first eigenvector. As we can imagine,
the larger the ratio of the first eigenvalue over the second eigenvalue,
the more unbalanced the force is and the faster the pulling or the
convergence will be. However, when
1
¼
2
, the ellipse degenerates
to a circle. The movement will not stop, which seems that the
algorithm does not converge. Actually, since any vector in that circle
can represent the eigenvector, it does not hurt to not converge. We
will get back to the cases of equal eigenvalues in Section 2.4.
2.3 Higher-Order Eigenvectors
Procedure (4) only estimates the first dominant eigenvector. One
way to compute the other higher order eigenvectors is following
what SGA does: Start with a set of orthonormalized vectors,
update them using the suggested iteration step, and recover the
orthogonality using GSO. For real-time online computation, we
need to avoid the time-consuming GSO. Further, breaking-then-
recovering orthogonality slows down the convergence compared
with keeping orthogonality all along. We know eigenvectors are
orthogonal to each other. So, it helps to generate “observations”
only in a complementary space for the computation of the higher
order eigenvectors. For example, to compute the second order
eigenvector, we first subtract from the data its projection on the
estimated first order eigenvector v
1
ðnÞ, as shown in (9),
u
2
ðnÞ¼u
1
ðnÞÿu
T
1
ðnÞ
v
1
ðnÞ
jjv
1
ðnÞjj
v
1
ðnÞ
jjv
1
ðnÞjj
; ð9Þ
where u
1
ðnÞ¼uðnÞ. The obtained residual, u
2
ðnÞ, which is in the
complementary space of v
1
ðnÞ, serves as the input data to the
iteration step. In this way, the orthogonality is always enforced
when the convergence is reached, although not exactly so at early
stages. This, in effect, better uses the sample available and, thus,
speeds up the convergence.
A similar idea has been used by some other researchers.
Kreyszig proposed an algorithm which finds the first eigenvector
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 8, AUGUST 2003 1037
Fig. 4. The absolute values of the first 10 eigenvalues.
Fig. 5. The effect of the amnesic parameter. The correctness of the first 10 eigenvectors computed by CCIPCA, with the amnesic parameter l ¼ 0. A comparison with Fig. 2c.

using a method equivalent to SGA and subtracts the first
component from the samples before computing the next compo-
nent [17]. Sanger suggested an algorithm, called generalized
hebbian algorithm (GHA), based on the same idea except that all
the components are computed at the same time [11]. However, in
either case, the statistical efficiency was not considered.
The new CCIPCA also saves computations. One may notice that
the expensive steps in both SGA and CCIPCA are the dot products
in the high-dimensional data space. CCIPCA requires one extra dot
product, i.e., u
T
i
ðnÞv
i
ðnÞ in (9), for each principal component in
each estimation step. For SGA, to do orthonormalization over
k new estimates of eigenvectors using GSO, we have totally kðk þ
1Þ=2 dot products. So, the average number of dot product saved by
CCIPCA over SGA for each eigenvector is ðk ÿ 1Þ=2.
2.4 Equal Eigenvalues
Let us consider the case where there are equal eigenvalues.
Suppose ordered eigenvalues between
l
and
m
are equal:
lÿ1
>
l
¼
lþ1
¼ ... ¼
m
>
mþ1
:
According to the explanation in Section 2.2, the vector estimate will
converge to the one with a larger eigenvalue first. Therefore, the
estimate of eigenvectors e
i
, where i<l, will not be affected anyway.
The vector estimates of e
l
to e
m
will converge into the subspace
spanned by the corresponding eigenvectors. Since their eigen-
values are equal, the shape of the distribution in Fig. 1 is a hyper-
sphere within the subspace. Thus, the estimates of the multiple
eigenvectors will converge to any set of the orthogonal basis of that
subspace. Where it converges to depends mainly on the early
1038 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 8, AUGUST 2003
Fig. 6. A longer data stream. The correctness of the first 10 eigenvectors computed by (a) SGA, (b) GHA, and (c) CCIPCA (with the amnesic parameter l ¼ 2),
respectively, over 20 epochs.

Citations
More filters
Journal ArticleDOI
06 Jun 1986-JAMA
TL;DR: The editors have done a masterful job of weaving together the biologic, the behavioral, and the clinical sciences into a single tapestry in which everyone from the molecular biologist to the practicing psychiatrist can find and appreciate his or her own research.
Abstract: I have developed "tennis elbow" from lugging this book around the past four weeks, but it is worth the pain, the effort, and the aspirin. It is also worth the (relatively speaking) bargain price. Including appendixes, this book contains 894 pages of text. The entire panorama of the neural sciences is surveyed and examined, and it is comprehensive in its scope, from genomes to social behaviors. The editors explicitly state that the book is designed as "an introductory text for students of biology, behavior, and medicine," but it is hard to imagine any audience, interested in any fragment of neuroscience at any level of sophistication, that would not enjoy this book. The editors have done a masterful job of weaving together the biologic, the behavioral, and the clinical sciences into a single tapestry in which everyone from the molecular biologist to the practicing psychiatrist can find and appreciate his or

7,563 citations

Journal ArticleDOI
TL;DR: The paper focuses on the use of principal component analysis in typical chemometric areas but the results are generally applicable.
Abstract: Principal component analysis is one of the most important and powerful methods in chemometrics as well as in a wealth of other areas. This paper provides a description of how to understand, use, and interpret principal component analysis. The paper focuses on the use of principal component analysis in typical chemometric areas but the results are generally applicable.

1,622 citations

Journal ArticleDOI
TL;DR: An innovative algorithm named 2D-LDA is proposed, which directly extracts the proper features from image matrices based on Fisher's Linear Discriminant Analysis, and achieves the best performance.

664 citations

Journal ArticleDOI
TL;DR: It is argued that entorhinal grid cells encode a low-dimensionality basis set for the predictive representation, useful for suppressing noise in predictions and extracting multiscale structure for hierarchical planning.
Abstract: The authors show how predictive representations are useful for maximizing future reward, particularly in spatial domains. They develop a predictive-map model of hippocampal place cells and entorhinal grid cells that captures a wide variety of effects from human and rodent literature. A cognitive map has long been the dominant metaphor for hippocampal function, embracing the idea that place cells encode a geometric representation of space. However, evidence for predictive coding, reward sensitivity and policy dependence in place cells suggests that the representation is not purely spatial. We approach this puzzle from a reinforcement learning perspective: what kind of spatial representation is most useful for maximizing future reward? We show that the answer takes the form of a predictive representation. This representation captures many aspects of place cell responses that fall outside the traditional view of a cognitive map. Furthermore, we argue that entorhinal grid cells encode a low-dimensionality basis set for the predictive representation, useful for suppressing noise in predictions and extracting multiscale structure for hierarchical planning.

616 citations


Cites background from "Candid covariance-free incremental ..."

  • ...For Fig 8, there are convergence guarantees on learning eigendecompositions by SGD (discussed in citation Weng et al (2003) and Supplement, Section "Learning eigenvectors by stochastic gradient descent")....

    [...]

Proceedings ArticleDOI
13 Oct 2003
TL;DR: An on-line auto-regressive model to capture and predict the behavior of dynamic scenes where the assumption of a static background is not valid and a new metric that is based on a state-driven comparison between the prediction and the actual frame is introduced.
Abstract: Background modeling and subtraction is a core component in motion analysis. The central idea behind such module is to create a probabilistic representation of the static scene that is compared with the current input to perform subtraction. Such approach is efficient when the scene to be modeled refers to a static structure with limited perturbation. In this paper, we address the problem of modeling dynamic scenes where the assumption of a static background is not valid. Waving trees, beaches, escalators, natural scenes with rain or snow are examples. Inspired by the work proposed by Doretto et al. (2003), we propose an on-line auto-regressive model to capture and predict the behavior of such scenes. Towards detection of events we introduce a new metric that is based on a state-driven comparison between the prediction and the actual frame. Promising results demonstrate the potentials of the proposed framework.

448 citations

References
More filters
Book
01 Jan 1983

34,729 citations

01 Jan 1994
TL;DR: The Diskette v 2.06, 3.5''[1.44M] for IBM PC, PS/2 and compatibles [DOS] Reference Record created on 2004-09-07, modified on 2016-08-08.
Abstract: Note: Includes bibliographical references, 3 appendixes and 2 indexes.- Diskette v 2.06, 3.5''[1.44M] for IBM PC, PS/2 and compatibles [DOS] Reference Record created on 2004-09-07, modified on 2016-08-08

19,881 citations


Additional excerpts

  • ...(b) Otherwise, viðnÞ ¼ nÿ1ÿln viðnÿ 1Þ þ 1þln uiðnÞuTi ðnÞ viðnÿ1Þ jjviðnÿ1Þjj ; (10) uiþ1ðnÞ ¼ uiðnÞ ÿ uTi ðnÞ viðnÞ jjviðnÞjj viðnÞ jjviðnÞjj : (11) A mathematical proof of the convergence of CCIPCA can be founded in [12]....

    [...]

Journal ArticleDOI
TL;DR: A near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals, and that is easy to implement using a neural network architecture.
Abstract: We have developed a near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals. The computational approach taken in this system is motivated by both physiology and information theory, as well as by the practical requirements of near-real-time performance and accuracy. Our approach treats the face recognition problem as an intrinsically two-dimensional (2-D) recognition problem rather than requiring recovery of three-dimensional geometry, taking advantage of the fact that faces are normally upright and thus may be described by a small set of 2-D characteristic views. The system functions by projecting face images onto a feature space that spans the significant variations among known face images. The significant features are known as "eigenfaces," because they are the eigenvectors (principal components) of the set of faces; they do not necessarily correspond to features such as eyes, ears, and noses. The projection operation characterizes an individual face by a weighted sum of the eigenface features, and so to recognize a particular face it is necessary only to compare these weights to those of known individuals. Some particular advantages of our approach are that it provides for the ability to learn and later recognize new faces in an unsupervised manner, and that it is easy to implement using a neural network architecture.

14,562 citations


"Candid covariance-free incremental ..." refers methods in this paper

  • ...The new method is motivated by the concept of statistical efficiency (the estimate has the smallest variance given the observed data)....

    [...]

Journal ArticleDOI
06 Jun 1986-JAMA
TL;DR: The editors have done a masterful job of weaving together the biologic, the behavioral, and the clinical sciences into a single tapestry in which everyone from the molecular biologist to the practicing psychiatrist can find and appreciate his or her own research.
Abstract: I have developed "tennis elbow" from lugging this book around the past four weeks, but it is worth the pain, the effort, and the aspirin. It is also worth the (relatively speaking) bargain price. Including appendixes, this book contains 894 pages of text. The entire panorama of the neural sciences is surveyed and examined, and it is comprehensive in its scope, from genomes to social behaviors. The editors explicitly state that the book is designed as "an introductory text for students of biology, behavior, and medicine," but it is hard to imagine any audience, interested in any fragment of neuroscience at any level of sophistication, that would not enjoy this book. The editors have done a masterful job of weaving together the biologic, the behavioral, and the clinical sciences into a single tapestry in which everyone from the molecular biologist to the practicing psychiatrist can find and appreciate his or

7,563 citations

Book
01 Jan 1991
TL;DR: This book is a detailed, logically-developed treatment that covers the theory and uses of collective computational networks, including associative memory, feed forward networks, and unsupervised learning.
Abstract: From the Publisher: This book is a comprehensive introduction to the neural network models currently under intensive study for computational applications. It is a detailed, logically-developed treatment that covers the theory and uses of collective computational networks, including associative memory, feed forward networks, and unsupervised learning. It also provides coverage of neural network applications in a variety of problems of both theoretical and practical interest.

7,518 citations


"Candid covariance-free incremental ..." refers methods in this paper

  • ...Index Terms—Principal component analysis, incremental principal component analysis, stochastic gradient ascent (SGA), generalized hebbian algorithm (GHA), orthogonal complement. æ...

    [...]

Frequently Asked Questions (6)
Q1. What have the authors contributed in "Candid covariance-free incremental principal component analysis" ?

The authors introduce a fast incremental principal component analysis ( IPCA ) algorithm, called candid covariance-free IPCA ( CCIPCA ), used to compute the principal components of a sequence of samples incrementally without estimating the covariance matrix ( so covariance-free ). 

A well-known computational approach to PCA involves solving an eigensystem problem, i.e., computing the eigenvectors and eigenvalues of the sample covariance matrix, using a numerical method such as the power method and the QR method [6]. 

Start with a set of orthonormalized vectors, update them using the suggested iteration step, and recover the orthogonality using GSO. 

This is true because all the “observations,” i.e., the last term in (4) and (6), contribute to the estimate in (4) with the same weight for statistical efficiency, but they contribute unequally in (6) due to normalization of vðnÿ 1Þ in the first term and, thus, damage the efficiency. 

Kreyszig proposed an algorithm which finds the first eigenvectorusing a method equivalent to SGA and subtracts the firstcomponent from the samples before computing the next compo-nent [17]. 

Simply speaking, the learning rate should be appropriate so that the second term (the correction term) on the right side of (6) is comparable to the first term, neither too large nor too small.