Journal Article•DOI•

Candid covariance-free incremental principal component analysis

Q: What have the authors contributed in "Candid covariance-free incremental principal component analysis" ?

The authors introduce a fast incremental principal component analysis ( IPCA ) algorithm, called candid covariance-free IPCA ( CCIPCA ), used to compute the principal components of a sequence of samples incrementally without estimating the covariance matrix ( so covariance-free ).

Q: What is the common approach to PCA?

A well-known computational approach to PCA involves solving an eigensystem problem, i.e., computing the eigenvectors and eigenvalues of the sample covariance matrix, using a numerical method such as the power method and the QR method [6].

Q: What is the way to compute the eigenvector?

Start with a set of orthonormalized vectors, update them using the suggested iteration step, and recover the orthogonality using GSO.

Q: Why is the coefficient n 1=n in (4) important?

This is true because all the “observations,” i.e., the last term in (4) and (6), contribute to the estimate in (4) with the same weight for statistical efficiency, but they contribute unequally in (6) due to normalization of vðnÿ 1Þ in the first term and, thus, damage the efficiency.

Q: What is the way to compute a eigenvector?

Kreyszig proposed an algorithm which finds the first eigenvectorusing a method equivalent to SGA and subtracts the firstcomponent from the samples before computing the next compo-nent [17].

Q: What is the learning rate of a correction term?

Simply speaking, the learning rate should be appropriate so that the second term (the correction term) on the right side of (6) is comparable to the first term, neither too large nor too small.

Juyang Weng¹, Yilu Zhang¹, Wey-Shiuan Hwang¹•Institutions (1)

Michigan State University¹

01 Aug 2003-IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE Computer Society)-Vol. 25, Iss: 8, pp 1034-1040

TL;DR: A fast incremental principal component analysis (IPCA) algorithm, called candid covariance-free IPCA (CCIPCA), used to compute the principal components of a sequence of samples incrementally without estimating the covariance matrix (so covariances-free).

read less

Abstract: Appearance-based image analysis techniques require fast computation of principal components of high-dimensional image vectors. We introduce a fast incremental principal component analysis (IPCA) algorithm, called candid covariance-free IPCA (CCIPCA), used to compute the principal components of a sequence of samples incrementally without estimating the covariance matrix (so covariance-free). The new method is motivated by the concept of statistical efficiency (the estimate has the smallest variance given the observed data). To do this, it keeps the scale of observations and computes the mean of observations incrementally, which is an efficient estimate for some well known distributions (e.g., Gaussian), although the highest possible efficiency is not guaranteed in our case because of unknown sample distribution. The method is for real-time applications and, thus, it does not allow iterations. It converges very fast for high-dimensional image vectors. Some links between IPCA and the development of the cerebral cortex are also discussed.

...read moreread less

Summary (2 min read)

Jump to: [1 INTRODUCTION] – [2.1 The First Eigenvector] – [2.2 Intuitive Explanation] – [2.3 Higher-Order Eigenvectors] – [2.4 Equal Eigenvalues] – [2.5 Algorithm Summary] – [3 EMPIRICAL RESULTS ON CONVERGENCE] and [4 CONCLUSIONS AND DISCUSSIONS]

1 INTRODUCTION

A class of image analysis techniques called appearance-based approach has now become very popular.
Further, when the dimension of the image is high, both the computation and storage complexity grow dramatically.
Several IPCA techniques have been proposed to compute principal components without the covariance matrix [9], [10], [11].
An amnesic average technique is also used to dynamically determine the retaining rate of the old and new data, instead of a fixed learning rate.

2.1 The First Eigenvector

Suppose that sample vectors are acquired sequentially, uð1Þ; uð2Þ; . . . , possibly infinite.
Now, the question is how to estimate xðiÞ in (2).
The authors are with the Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824.
Procedure (6) is at the mercy of the magnitude of observation uðnÞ, where the first term has a unit norm, but the second can take any magnitude.
In (4), the statistical efficiency is realized by keeping the scale of the estimate at the same order of the new observations (the first and second terms properly weighted on the right side of (4) to get sample mean), which allows full use of every observation in terms of statistical efficiency.

2.2 Intuitive Explanation

An intuitive explanation of procedure (4) is as follows: Consider a set of two-dimensional data with a Gaussian probability distribution function (for any other physically arising distribution, the authors can consider its first two orders of statistics since PCA does so).
Noticing uT ðnÞ v1ðnÿ 1Þjjv1ðnÿ 1Þjj is a scalar, the authors know 1 n uðnÞuT ðnÞ v1ðnÿ 1Þjjv1ðnÿ 1Þjj is essentially a scaled vector of uðnÞ.
For the points uu in the upper half plane, the pure force will pull v1ðnÿ 1Þ toward the direction of v1 since there are more data points to the right side of v1ðnÿ 1Þ than those to the left side.
V1ðnÿ 1Þwill not stop moving until it is aligned with v1 when the pulling forces from both sides are balanced.
In other words, v1ðnÞ in (4) will converge to the first eigenvector.

2.3 Higher-Order Eigenvectors

Procedure (4) only estimates the first dominant eigenvector.
To compute the second order eigenvector, the authors first subtract from the data its projection on the estimated first order eigenvector v1ðnÞ, as shown in (9), u2ðnÞ ¼ u1ðnÞ ÿ uT1 ðnÞ v1ðnÞ jjv1ðnÞjj v1ðnÞ jjv1ðnÞjj ; ð9Þ where u1ðnÞ ¼ uðnÞ.
The orthogonality is always enforced when the convergence is reached, although not exactly so at early stages.
In either case, the statistical efficiency was not considered.
One may notice that the expensive steps in both SGA and CCIPCA are the dot products in the high-dimensional data space.

2.4 Equal Eigenvalues

Let us consider the case where there are equal eigenvalues.
Therefore, the estimate of eigenvectors ei, where i < l, will not be affected anyway.
Since their eigen- values are equal, the shape of the distribution in Fig. 1 is a hyper- sphere within the subspace.
Thus, the estimates of the multiple eigenvectors will converge to any set of the orthogonal basis of that subspace.
Where it converges to depends mainly on the early samples because of the averaging effect in (2), where the contribution of new data gets infinitely small when n increases without a bound.

2.5 Algorithm Summary

Combining the mechanisms discussed above, the authors have the candid covariance-free IPCA algorithm as follows: Procedure 1. (b) Otherwise, viðnÞ ¼ nÿ1ÿln viðnÿ.
1Þ þ 1þln uiðnÞuTi ðnÞ viðnÿ1Þ jjviðnÿ1Þjj ; (10) uiþ1ðnÞ ¼ uiðnÞ ÿ uTi ðnÞ viðnÞ jjviðnÞjj viðnÞ jjviðnÞjj : (11) A mathematical proof of the convergence of CCIPCA can be founded in [12].

3 EMPIRICAL RESULTS ON CONVERGENCE

The authors performed experiments to study the statistical efficiency of the new algorithm as well as the existing IPCA algorithms, especially for high-dimensional data such as images.
In contrast, the proposed CCIPCA converges fast.
Shown in Fig. 7 are the first 10 eigenfaces estimated by batch PCA and CCIPCA (with the amnesic parameter l ¼ 2) after one epoch and 20 epochs, respectively.
For the general readership, an experiment was done on a lower dimension data set.
The authors extracted 10 x 10 pixel subimages around the right eye area in each image of the FERET data set, estimated their sample covariance matrix , and used MATLAB to generate 1,000 samples with the Gaussian distribution Nð0; Þ in the 100-dimensional space.

4 CONCLUSIONS AND DISCUSSIONS

This short paper concentrates on a challenging issue of computing dominating eigenvectors and eigenvalues from an incrementally arriving high-dimensional data stream without computing the corresponding covariance matrix and without knowing data in advance.
An amnesic average technique is implemented to further improve the convergence rate.
The importance of the result presented here is potentially beyond the apparent technical scope interesting to the computer vision community.
As discussed in [7], what a human brain does is not just computing—processing data—but, more importantly and more fundamentally, developing the computing engine itself, from real-world, online sensory data streams.
The link between incremental PCA and the developmental mechanisms of their brain is probably more intimate than one can fully appreciate now.the authors.

Did you find this useful? Give us your feedback

Figures (8)

Fig. 7. The first 10 eigenfaces obtained by (a) batch PCA, (b) CCIPCA (with amnesic parameter l ¼ 2) after one epoch, and (c) CCIPCA (with amnesic parameter l ¼ 2) after 20 epochs, shown as images.

Fig. 1. Intuitive explanation of the incremental PCA algorithm.

Fig. 2. The correctness, or the correlation, represented by dot products, of the first 10 eigenvectors computed by (a) SGA, (b) GHA, and (c) the proposed CCIPCA with the amnesic parameter l ¼ 2.

TABLE 1 The Average Execution Time for Estimating 10 Eigenvectors with One New Data

Fig. 3. The correctness of the eigenvalue, jjvi jj i by CCIPCA.

Fig. 4. The absolute values of the first 10 eigenvalues.

Fig. 5. The effect of the amnesic parameter. The correctness of the first 10 eigenvectors computed by CCIPCA, with the amnesic parameter l ¼ 0. A comparison with Fig. 2c.

Fig. 6. A longer data stream. The correctness of the first 10 eigenvectors computed by (a) SGA, (b) GHA, and (c) CCIPCA (with the amnesic parameter l ¼ 2), respectively, over 20 epochs.

Content maybe subject to copyright Report

Candid Covariance-Free Incremental

Principal Component Analysis

Juyang Weng, Member, IEEE,

Yilu Zhang, Student Member, IEEE,and

Wey-Shiuan Hwang, Member, IEEE

Abstract—Appearance-based image analysis techniques require fast

computation of principal components of high-dimensional image vectors. We

introduce a fast incremental principal component analysis (IPCA) algorithm, called

candid covariance-free IPCA (CCIPCA), used to compute the principal

components of a sequence of samples incrementally without estimating the

covariance matrix (so covariance-free). The new method is motivated by the

concept of statistical efficiency (the estimate has the smallest variance given the

observed data). To do this, it keeps the scale of observations and computes the

mean of observations incrementally, which is an efficient estimate for some well-

known distributions (e.g., Gaussian), although the highest possible efficiency is not

guaranteed in our case because of unknown sample distribution. The method is for

real-time applications and, thus, it does not allow iterations. It converges very fast

for high-dimensional image vectors. Some links between IPCA and the

development of the cerebral cortex are also discussed.

Index Terms—Principal component analysis, incremental principal component

analysis, stochastic gradient ascent (SGA), generalized hebbian algorithm (GHA),

orthogonal complement.

1INTRODUCTION

A class of image analysis techniques called appearance-based

approach has now become very popular. A major reason that leads

to its popularity is the use of statistics tools to automatically derive

features instead of relying on humans to define features. Although

principal component analysis is a well-known technique, Sirovich

and Kirby [1] appear to be among the first who used the technique

directly on the characterization of human faces—each image is

considered simply as a high-dimensional vector, each pixel

corresponding to a component. Turk and Pentland [2] were among

the first who used this representation for face recognition. The

technique has been extended to 3D object recognition [3], sign

recognition [4], and autonomous navigation [5] among many other

image analysis problems.

A well-known computational approach to PCA involves solving

an eigensystem problem, i.e., computing the eigenvectors and

eigenvalues of the sample covariance matrix, using a numerical

method such as the power method and the QR method [6]. This

approach requires that all the training images be available before the

principal components can be estimated. This is called a batch

method. This type of method no longer satisfies an up coming new

trend of computer vision research [7] in which all visual filters are

incrementally derived from very long online real-time video stream,

motivated by the development of animal vision systems. Online

development of visual filters requires that the system perform while

new sensory signals flow in. Further, when the dimension of the

image is high, both the computation and storage complexity grow

dramatically. For example, in the eigenface method, a moderate

gray image of 64 rows and 88 columns results in a d-dimensional

vector with d ¼ 5; 632. The symmetric covariance matrix requires

dðd þ 1Þ=2 elements, which amounts to 15,862,528 entries! A clever

saving method can be used when the number of images is smaller

than the number of pixels in the image [1]. However, an online

developing system must observe an open number of images and the

number is larger than the dimension of the observed vectors. Thus,

an incremental method is required to compute the principal

components for observations arriving sequentially, where the

estimate of principal components are updated by each arriving

observation vector. No covariance matrix is allowed to be estimated

as an intermediate result. There is evidence that biological neural

networks use an incremental method to perform various learning,

e.g., Hebbian learning [8].

Several IPCA techniques have been proposed to compute

principal components without the covariance matrix [9], [10], [11].

However, they ran into convergence problems when facing high-

dimensional image vectors. We explain in this article why. We

propose a new method, candid covariance-free IPCA (CCIPCA),

based on the work of Oja and Karhunen [10] and Sanger [11]. It is

motivated by a well-known statistical concept called efficient

estimate. An amnesic average technique is also used to dynamically

determine the retaining rate of the old and new data, instead of a

fixed learning rate.

2DERIVATION OF THE ALGORITHM

2.1 The First Eigenvector

Suppose that sample vectors are acquired sequentially, uð1Þ;uð2Þ;

..., possibly infinite. Each uðnÞ, n ¼ 1; 2; ...,isad-dimensional

vector and d can be as large as 5,000 and beyond. Without loss of

generality, we can assume that uðnÞ has a zero mean (the mean may

be incrementally estimated and subtracted out). A ¼ EfuðnÞu

ðnÞg

is the d  d covariance matrix, which is neither known nor allowed to

be estimated as an intermediate result.

By definition, an eigenvector x of matrix A satisfies

x ¼ Ax; ð1Þ

where  is the corresponding eigenvalue. By replacing the

unknown A with the sample covariance matrix and replacing the

x of (1) with its estimate xðiÞ at each time step i, we obtain an

illuminating expression for v ¼ x:

vðnÞ¼

i¼1

uðiÞu

ðiÞxðiÞ; ð2Þ

where vðnÞ is the nth step estimate of v. As we will see soon, this

equation is motivated by statistical efficiency. Once we have the

estimate of v, it is easy to get the eigenvector and the eigenvalue

since  ¼jjvjj and x ¼ v=jjvjj.

Now, the question is how to estimate xðiÞ in (2). Considering

x ¼ v=jjvjj, we may choose xðiÞ as vði ÿ 1Þ=jjvði ÿ 1Þjj, which leads

to the following incremental expression:

vðnÞ¼

i¼1

uðiÞu

ðiÞ

vði ÿ 1Þ

jjvði ÿ 1Þjj

: ð3Þ

To begin with, we set vð0Þ¼uð1Þ, the first direction of data spread.

For incremental estimation, (3) is written in a recursive form,

vðnÞ¼

n ÿ 1

vðn ÿ 1Þþ

uðnÞu

ðnÞ

vðn ÿ 1Þ

jjvðn ÿ 1Þjj

; ð4Þ

where ðn ÿ 1Þ=n is the weight for the last estimate and 1=n is the

weight for the new data. We have proven that, with the algorithm

given by (4), v

ðnÞ!

when n !1, where 

is the largest

eigenvalue of the covariance matrix of fuðnÞg and e

is the

corresponding eigenvector [12].

The derivation of (2), (3), and (4) is motivated by statistical

efficiency. An unbiased estimate

QQ of the parameter Q is said to be

the efficient estimate for the class D of distribution functions if, for every

distribution density function fðu; QÞ of D, the variance D

QQÞ

(squared error) has reached the minimal value given by

1034 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 8, AUGUST 2003

. The authors are with the Department of Computer Science and

Engineering, Michigan State University, East Lansing, MI 48824.

E-mail: {weng, zhangyil, hwangwey}@cse.msu.edu.

Manuscript received 20 Feb. 2002; revised 4 Oct. 2002; accepted 28 Oct. 2002.

Recommended for acceptance by R. Beveridge.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number 115928.

0162-8828/03/$17.00 ß 2003 IEEE Published by the IEEE Computer Society

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 8, AUGUST 2003 1035

Fig. 1. Intuitive explanation of the incremental PCA algorithm.

Fig. 2. The correctness, or the correlation, represented by dot products, of the first 10 eigenvectors computed by (a) SGA, (b) GHA, and (c) the proposed CCIPCA with

the amnesic parameter l ¼ 2.

QQÞ¼E½ð

QQ ÿ QÞ



þ1

ÿ1

@ log fðu;QÞ

fðu; QÞdu

: ð5Þ

The right side of (5) is called the Crame

r-Rao bound. It says that the

efficient estimate is one that has the least variance from the real

parameter, and its variance is bounded below by the Crame

r-Rao

bound. For example, the sample mean



ww ¼

i¼1

wðiÞ is the efficient

estimate of the mean of a Gaussian distribution with a known

standard deviation  [13]. For a vector version of the Crame

r-Rao

bound, the reader is referred to [14, pp. 203-204].

If we define wðiÞ¼uðiÞu

ðiÞxðiÞ, vðnÞ in (2) can be viewed as

the mean of “samples” wðiÞ. That is exactly why our method is

motivated by statistical efficiency in using averaging in (2). In other

words, statistically, the method tends to converge most quickly or

the estimate has the smallest error variance given the currently

observed samples. Of course, wðiÞ is not necessarily drawn from a

Gaussian distribution independently and, thus, the estimate using

the sample mean in (4) is not strictly efficient. However, the

estimate vðnÞ still has a high statistical efficiency and has a fairly

low error variance as we will show experimentally.

The Crame

r-Rao lower error bound in (5) can also be used to

estimate the error variance or, equivalently, the convergence rate,

using a Gaussian distribution model, as proposed and experi-

mented with by Weng et al. [14, Section 4.6]. This is a reasonable

estimate because of our near optimal statistical efficiency here.

Weng et al. [14] demonstrated that actual error variance is not very

sensitive to the distribution (e.g., uniform or Gaussian distribu-

tions). This error estimator is especially useful to estimate roughly

how many samples are needed for a given tolerable error variance.

IPCA algorithms have been studied by several researchers [15],

[16], [9], [10]. An early work with a rigorous proof for convergence

was given by Oja [9] and Oja and Karhunen [10], where they

introduced their stochastic gradient ascent (SGA) algorithm. SGA

computes,

ðnÞ¼v

ðn ÿ 1Þþ

ðnÞuðnÞu

ðnÞv

ðn ÿ 1Þ; ð6Þ

ðnÞ¼orthonormalize

ðnÞ w:r:t:v

ðnÞ;j¼ 1; 2; ...;iÿ 1; ð7Þ

where v

ðnÞ is the estimate of the ith dominant eigenvectors of the

sample covariance matrix A ¼ EfuðnÞu

ðnÞg and

ðnÞ is the new

estimate. In practice, the orthonormalization in (7) can be done by a

standard Gram-Schmidt Orthonomalization (GSO) procedure. The

parameter 

ðnÞ is a stochastic approximation gain. The convergence

of SGA has been proven under some assumptions of A and 

ðnÞ [10].

SGA is essentially a gradient method associated with the

problem of choosing 

ðnÞ, the learning rate. Simply speaking, the

learning rate should be appropriate so that the second term (the

correction term) on the right side of (6) is comparable to the first

term, neither too large nor too small. In practice, 

ðnÞ depends

very much on the nature of the data and usually requires a trial-

and-error procedure, which is impractical for online applications.

Oja gave some suggestions on 

ðnÞ in [9], which is typically 1=n

multiplied by some constants. However, procedure (6) is at the

mercy of the magnitude of observation uðnÞ, where the first term

has a unit norm, but the second can take any magnitude. If uðnÞ

has a very small magnitude, the second term will be too small to

make any changes in the new estimate. If uðnÞ has a large

magnitude, which is the case with high-dimensional images, the

second term will dominate the right side before a very large

number n and, hence, a small 

ðnÞ has been reached. In either

case, the updating is inefficient and the convergence will be slow.

Contrasted with SGA, the first term on the right side of (4) is not

normalized. In effect, vðnÞ in (4) converges to e instead of e as it

does in (6), where  is the eigenvalue and e is the eigenvector. In (4),

the statistical efficiency is realized by keeping the scale of the

estimate at the same order of the new observations (the first and

second terms properly weighted on the right side of (4) to get sample

mean), which allows full use of every observation in terms of

statistical efficiency. Note that the coefficient ðn ÿ 1Þ=n in (4) is as

important as the “learning rate” 1=n in the second term to realize

sample mean. Although ðn ÿ 1Þ=n is close to 1 when n is large, it is

very important for fast convergence with early samples. The point is

that, if the estimate does not converge well at the beginning, it is

harder to pull back later when n is large. Thus, one does not need to

worry about the nature of the observations. This is also the reason

that we used “candid” in naming the new algorithm.

It is true that the series of parameters, 

ðnÞ, i ¼ 1; 2; ...;k, in SGA

can be manually tuned in an offline application so that it takes into

account the magnitude of uðnÞ. But, a predefined 

ðnÞ cannot

accomplish statistical efficiency no matter how 

ðnÞ is tuned. This is

true because all the “observations,” i.e., the last term in (4) and (6),

contribute to the estimate in (4) with the same weight for statistical

efficiency, but they contribute unequally in (6) due to normalization

of vðn ÿ 1Þ in the first term and, thus, damage the efficiency. Further,

the manual tuning is not suited for an online learning algorithm

since the user cannot predict signals in advance. An online

algorithm must automatically compute data-sensitive parameters.

There is a further improvement to procedure (4). In (4), all the

“samples”

wðiÞ¼uðiÞu

ðiÞ

vði ÿ 1Þ

jjvði ÿ 1Þjj

;

are weighted equally. However, since wðiÞ is generated by vðiÞ and

vðiÞ is far away from its real value at a early estimation stage, wðiÞ

is a “sample” with large “noise” when i is small. To speed up the

convergence of the estimation, it is preferable to give smaller

weight to these early “samples.” A way to implement this idea is to

use an amnesic average by changing (4) into

1036 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 8, AUGUST 2003

Fig. 3. The correctness of the eigenvalue,

jjv



by CCIPCA.

vðnÞ¼

n ÿ 1 ÿ l

vðn ÿ 1Þþ

1 þ l

uðnÞu

ðnÞ

vðn ÿ 1Þ

jjvðn ÿ 1Þjj

; ð8Þ

where the positive parameter l is called the amnesic parameter. Note

that the two modified weights still sum to 1. With the presence of l,

larger weight is given to new “samples” and the effect of old

“samples” will fade out gradually. Typically, l ranges from 2 to 4.

2.2 Intuitive Explanation

An intuitive explanation of procedure (4) is as follows: Consider a set

of two-dimensional data with a Gaussian probability distribution

function (for any other physically arising distribution, we can

consider its first two orders of statistics since PCA does so). The data

is charactrized by an ellipse, as shown in Fig. 1. According to the

geometrical meaning of eigenvectors, we know that the first

eigenvector is aligned with the long axis (v

) of the ellipse. Suppose

ðn ÿ 1Þ is the ðn ÿ 1Þth-step estimation of the first eigenvector.

Noticing

ðnÞ

ðn ÿ 1Þ

jjv

ðn ÿ 1Þjj

is a scalar, we know

uðnÞu

ðnÞ

ðn ÿ 1Þ

jjv

ðn ÿ 1Þjj

is essentially a scaled vector of uðnÞ. According to (4), v

ðnÞ is a

weighted combination of the last estimate, v

ðn ÿ 1Þ and the scaled

vector of uðnÞ. Therefore, geometrically speaking, v

ðnÞ is obtained

by pulling v

ðn ÿ 1Þ toward uðnÞ by a small amount.

A line l

orthogonal to v

ðn ÿ 1Þ divides the whole plane into

two halves, the upper and the lower ones. Because every point u

the lower half plane has an obtuse angle with v

ðn ÿ 1Þ, u

ðnÿ1Þ

jjv

ðnÿ1Þjj

is a negative scalar. So, for u

, (4) may be written as,

vðnÞ¼

n ÿ 1

vðn ÿ 1Þþ

vðn ÿ 1Þ

jjvðn ÿ 1Þjj



ðÿu

Þ;

where ÿu

is an upper half plane point obtained by rotating u

for 180

degrees w.r.t. the origin. Since the ellipse is centrally symmetric, we

may rotate all the lower half plane points to the upper half plane and

only consider the pulling effect of upper half plane points. For the

points u

in the upper half plane, the pure force will pull v

ðn ÿ 1Þ

toward the direction of v

since there are more data points to the right

side of v

ðn ÿ 1Þ than those to the left side. As long as the first two

eigenvalues are different, this pulling force always exists and the

pulling direction is toward the eigenvector corresponding to a larger

eigenvalue. v

ðn ÿ 1Þ will not stop moving until it is aligned with v

when the pulling forces from both sides are balanced. In other words,

ðnÞ in (4) will converge to the first eigenvector. As we can imagine,

the larger the ratio of the first eigenvalue over the second eigenvalue,

the more unbalanced the force is and the faster the pulling or the

convergence will be. However, when 

¼ 

, the ellipse degenerates

to a circle. The movement will not stop, which seems that the

algorithm does not converge. Actually, since any vector in that circle

can represent the eigenvector, it does not hurt to not converge. We

will get back to the cases of equal eigenvalues in Section 2.4.

2.3 Higher-Order Eigenvectors

Procedure (4) only estimates the first dominant eigenvector. One

way to compute the other higher order eigenvectors is following

what SGA does: Start with a set of orthonormalized vectors,

update them using the suggested iteration step, and recover the

orthogonality using GSO. For real-time online computation, we

need to avoid the time-consuming GSO. Further, breaking-then-

recovering orthogonality slows down the convergence compared

with keeping orthogonality all along. We know eigenvectors are

orthogonal to each other. So, it helps to generate “observations”

only in a complementary space for the computation of the higher

order eigenvectors. For example, to compute the second order

eigenvector, we first subtract from the data its projection on the

estimated first order eigenvector v

ðnÞ, as shown in (9),

ðnÞ¼u

ðnÞÿu

ðnÞ

jjv

ðnÞjj

ðnÞ

jjv

ðnÞjj

; ð9Þ

where u

ðnÞ¼uðnÞ. The obtained residual, u

ðnÞ, which is in the

complementary space of v

ðnÞ, serves as the input data to the

iteration step. In this way, the orthogonality is always enforced

when the convergence is reached, although not exactly so at early

stages. This, in effect, better uses the sample available and, thus,

speeds up the convergence.

A similar idea has been used by some other researchers.

Kreyszig proposed an algorithm which finds the first eigenvector

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 8, AUGUST 2003 1037

Fig. 4. The absolute values of the first 10 eigenvalues.

Fig. 5. The effect of the amnesic parameter. The correctness of the first 10 eigenvectors computed by CCIPCA, with the amnesic parameter l ¼ 0. A comparison with Fig. 2c.

using a method equivalent to SGA and subtracts the first

component from the samples before computing the next compo-

nent [17]. Sanger suggested an algorithm, called generalized

hebbian algorithm (GHA), based on the same idea except that all

the components are computed at the same time [11]. However, in

either case, the statistical efficiency was not considered.

The new CCIPCA also saves computations. One may notice that

the expensive steps in both SGA and CCIPCA are the dot products

in the high-dimensional data space. CCIPCA requires one extra dot

product, i.e., u

ðnÞv

ðnÞ in (9), for each principal component in

each estimation step. For SGA, to do orthonormalization over

k new estimates of eigenvectors using GSO, we have totally kðk þ

1Þ=2 dot products. So, the average number of dot product saved by

CCIPCA over SGA for each eigenvector is ðk ÿ 1Þ=2.

2.4 Equal Eigenvalues

Let us consider the case where there are equal eigenvalues.

Suppose ordered eigenvalues between 

and 

are equal:



lÿ1

>

¼ 

lþ1

¼ ... ¼ 

>

mþ1

According to the explanation in Section 2.2, the vector estimate will

converge to the one with a larger eigenvalue first. Therefore, the

estimate of eigenvectors e

, where i<l, will not be affected anyway.

The vector estimates of e

to e

will converge into the subspace

spanned by the corresponding eigenvectors. Since their eigen-

values are equal, the shape of the distribution in Fig. 1 is a hyper-

sphere within the subspace. Thus, the estimates of the multiple

eigenvectors will converge to any set of the orthogonal basis of that

subspace. Where it converges to depends mainly on the early

1038 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 8, AUGUST 2003

Fig. 6. A longer data stream. The correctness of the first 10 eigenvectors computed by (a) SGA, (b) GHA, and (c) CCIPCA (with the amnesic parameter l ¼ 2),

respectively, over 20 epochs.

HTML Viewer

Frequently Asked Questions (6)

Q1. What have the authors contributed in "Candid covariance-free incremental principal component analysis" ?

The authors introduce a fast incremental principal component analysis ( IPCA ) algorithm, called candid covariance-free IPCA ( CCIPCA ), used to compute the principal components of a sequence of samples incrementally without estimating the covariance matrix ( so covariance-free ).

Q2. What is the common approach to PCA?

A well-known computational approach to PCA involves solving an eigensystem problem, i.e., computing the eigenvectors and eigenvalues of the sample covariance matrix, using a numerical method such as the power method and the QR method [6].

Q3. What is the way to compute the eigenvector?

Start with a set of orthonormalized vectors, update them using the suggested iteration step, and recover the orthogonality using GSO.

Q4. Why is the coefficient n 1=n in (4) important?

This is true because all the “observations,” i.e., the last term in (4) and (6), contribute to the estimate in (4) with the same weight for statistical efficiency, but they contribute unequally in (6) due to normalization of vðnÿ 1Þ in the first term and, thus, damage the efficiency.

Q5. What is the way to compute a eigenvector?

Kreyszig proposed an algorithm which finds the first eigenvectorusing a method equivalent to SGA and subtracts the firstcomponent from the samples before computing the next compo-nent [17].

Q6. What is the learning rate of a correction term?

Simply speaking, the learning rate should be appropriate so that the second term (the correction term) on the right side of (6) is comparable to the first term, neither too large nor too small.

Candid covariance-free incremental principal component analysis

Summary (2 min read)

1 INTRODUCTION

2.1 The First Eigenvector

2.2 Intuitive Explanation

2.3 Higher-Order Eigenvectors

2.4 Equal Eigenvalues

2.5 Algorithm Summary

3 EMPIRICAL RESULTS ON CONVERGENCE

4 CONCLUSIONS AND DISCUSSIONS

Figures (8)

Citations

Cites background from "Candid covariance-free incremental ..."

References

Additional excerpts

"Candid covariance-free incremental ..." refers methods in this paper

"Candid covariance-free incremental ..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (6)

Q1. What have the authors contributed in "Candid covariance-free incremental principal component analysis" ?

Q2. What is the common approach to PCA?

Q3. What is the way to compute the eigenvector?

Q4. Why is the coefficient n 1=n in (4) important?

Q5. What is the way to compute a eigenvector?

Q6. What is the learning rate of a correction term?