What are the contributions in this paper?

In this paper the authors demonstrate how the principal axes of a set of observed data vectors may be determined through maximum likelihood estimation of parameters in a latent variable model closely related to factor analysis The authors consider the properties of the associated likelihood function giving an EM algorithm for estimating the principal subspace iteratively and discuss the advantages conveyed by the de nition of a probability density function for PCA Probabilistic Principal Component Analysis

What is the simplest way to reduce the likelihood of a given vector?

Then consider a perturbation to this solution of the form W cW VR where is an arbitrarily small constant and the d q matrix V is given byV uiIt will be su cient to only consider those ui that are not in Uq A solution with a repeated eigenvector implies one lj becoming zero and thus a decrease in the likelihood Arbitrary permu tations of the columns of V with all valid ui thus implies that the resulting vectors vec VR are a complete orthogonal basis for the directions of interest on the likelihood surface

What is the simplest way to determine the log likelihood?

Uq may contain any of the eigenvectors of S so to identify those which maximise the likelihood the expression forW in is substituted into the log likelihood function to giveL N d ln q X j ln jdX j q j d q ln qwhere q is the number of non zero lj Di erentiating the log likelihood with respect to and substituting forW from givesd q dX j q jand soL N q X j ln j d q ln d q dX j q j A d ln dNote that implies that if rank S q as stated earlier

What is the likelihood of the data?

WTt of the data the least squares reconstruction errorE recN NX n k tn WW Ttn kis minimised when the columns ofW span the principal subspace of the data covariance matrix

What is the weight matrix for lj?

First the authors express the weight matrixW in terms of its singular value decompositionW ULVTwhere U is a d q matrix of orthonormal column vectors L diag l l lq is the q q diagonal matrix of singular values and V is a q q orthogonal matrix NowC W The authorWWT WW The authorWTWULVT The authorVLUTULVTULVTV The authorLUTUL VTUL The authorL VTThen at the stationary pointsSC W WSUL The authorL VT ULVTSUL U The authorL LFor lj equation implies that if U u u uq then each column vector uj must be an eigenvector of S with corresponding eigenvalue j such that l j j and solj jFor lj uj is arbitrary

What is the way to minimise a discarded eigenvalue?

By reference to equation the authors can deduce from this that the smallest eigenvalue must be discarded and included in the right hand term of Given the requirement that the discarded eigenvalues must be contiguous A must then be minimised when the smallest d q eigenvalues are present in the right hand term of and so L is maximised when j j f qg are the largest eigenvalues of SIt should also be noted that A is minimised with respect to q when there are fewest terms in the sum in which occurs when q q and therefore no lj is zero Furthermore L is minimised whenW which may be seen to be equivalent to the case of qIf stationary points represented by minor eigenvector solutions are stable maxima then local maximisation via an EM algorithm for example is not guaranteed to converge on the optimal solution comprising the principal eigenvectors

what is the q rank of the weight matrix?

In the context of standard PCA such a result is only attainable if q rank S For probabilistic PCA it is necessary to consider the case in which the d q smallest eigenvalues of S are identical or trivially q d because C S is attainable with min the smallest eigenvalue of S As discussed in Section W is then identi able sinceI WWT SWWT S Iwhich has a known solution at W U The authorR where U is a square matrix whose columns are the eigenvectors of S with the corresponding diagonal matrix of eigenvalues and R is an arbitrary orthogonal rotation matrixSC W W withW and C SThe authors are interested in case where C S and the model is approximate

What is the smallest eigenvalue in the spectrum of a discarded e?

Then substituting for cW gives The authorcWTcW RTKqR such thatCG SVR RTK q R VR soG C V iK q The authorRwhereiiwith i in the corresponding position to ui in V Thenvec VR Tvec G tr GTVRtr RT iK q The authorV TC VRi ki u T i C uiwhere ki is the value in Kq in the corresponding position to i Since C is positive de nite clearly uTi C ui is always positive

(Open Access) Probabilistic Principal Component Analysis (1999) | Michael E. Tipping

Neural Computing Research Group

Dept of Computer Science & Applied Mathematics

Aston University

Birmingham B4 7ET

United Kingdom

Tel: +44 (0)121 333 4631

Fax: +44 (0)121 333 4586

http://www.ncrg.aston.ac.uk/

Probabilistic

Principal Comp onent Analysis

Michael E. Tipping

M.E.Tipping@aston.ac.uk

Christopher M. Bishop

C.M.Bishop@aston.ac.uk

Technical Rep ort NCRG/97/010 September 4, 1997

Submitted for publication.

Abstract

Principal comp onent analysis (PCA) is a ubiquitous technique for data analysis and processing,

but one which is not based up on a probability model. In this pap er we demonstrate how the

principal axes of a set of observed data vectors may be determined through maximum{likelihood

estimation of parameters in a latentvariable model closely related to factor analysis. We consider

the properties of the asso ciated likeliho o d function, giving an EM algorithm for estimating the

principal subspace iteratively, and discuss the advantages conveyed by the denition of a probability

density function for PCA.

Probabilistic Principal Comp onent Analysis

1 Intro duction

Principal comp onent analysis (PCA) (Jollie 1986) is a well{established technique for dimension

reduction, and a chapter on the sub ject may be found in practically every text on multivariate anal-

ysis. Examples of its many applications include data compression, image processing, visualization,

exploratory data analysis, pattern recognition and time series prediction.

The most common derivation of PCA is in terms of a standardised linear pro jection whichmax-

imises the variance in the pro jected space (Hotelling 1933). For asetofobserved

-dimensional

data vectors

 n

2 f

:::N

, the

principal axes

2 f

:::q

, are those orthonormal

axes onto which the retained variance under pro jection is maximal. It can be shown that the

vectors

are given bythe

dominanteigenvectors (i.e. those with the largest asso ciated eigen-

values



) of the sample covariance matrix

(

;



)(

;



)

]such that



. The

principal comp onents of the observed vector

are given bythevector

(

;



), where



::: 

)

. The variables

are then decorellated such that the covariance matrix



] is diagonal with elements



A complementary property of PCA, and that most closely related to the original discussions of

Pearson (1901) is that, of all orthogonal linear pro jections

(

;



), the principal com-

ponent pro jection minimises the squared reconstruction error

;

, where the optimal

linear reconstruction of

is given by



One limiting disadvantage of b oth these denitions of PCA is the absence of a probability density

model and associated likelihoo d measure. Deriving PCA from the persp ective of density estimation

would oer a numb er of imp ortantadvantages including:



The denition of a likelihood measure p ermits comparison with other density{estimation

techniques and facilitates statistical testing.



Bayesian inference methods maybe applied (e.g. for mo del comparison) by combining the

likelihood with a prior.



If PCA is used to mo del the class{conditional densities in a classication problem, the p os-

terior probabilities of class membership may be computed.



The probability density function gives a measure of the novelty of a new data p oint.



The single PCA mo del may b e extended to a mixture of suchmodels.

The key result of this pap er is to show that principal component analysis may indeed be obtained

from a probabilit

y mo del. This follows from incorporating

within a particular form of latent

variable densitymodelwhich is closely related to statistical factor analysis. Under this formulation,

the maximum{likelihood estimator of

is the matrix of (scaled and rotated) principal axes of

the data. Estimation of

in this way, using an iterative EM algorithm for example, is generally

more computationally exp ensive than the standard eigen-decomposition approach. However, using

the given derivation

may b e computed in the standard fashion and subsequently incorp orated

in the mo del in order to realise the advantages listed ab ove.

In the next section we briey introduce the concept of latentvariable mo dels, and outline factor

analysis in particular. Section 3 then shows how principal component analysis emerges from a

particular mo del parameterisation, and we conclude with a discussion in Section 4. Proofs of key

results are left to the appendix.

Probabilistic Principal Comp onent Analysis

2 Latent Variable Mo dels

A latentvariable mo del seeks to relate the set of

{dimensional observed data vectors

to a

corresponding set of

{dimensional latentvariables

(









(1)

where

(



) is a function of the latentvariable

with parameters



,and



is an

-independent

noise pro cess. Generally,

q<d

such that the latentvariables oer a more parsimonious description

of the data. By dening a prior distribution over

, equation (1) induces a corresponding distri-

bution in the data space and the mo del parameters may be determined bymaximum{likelihoo d.

In standard factor analysis (Bartholomew 1987) the mapping

(





) is linear:







(2)

where the latentvariables



(



)have a unit isotropic Gaussian distribution. The error, or

noise, mo del is Gaussian such that





(





), with



diagonal, the (



) parameter matrix

contains the

factor loadings

,and



is a constant whose maximum{likelihood estimator is the

mean of the data. Given this formulation, the mo del for

is also normal

(





), where the

covariance



. The motivation, and indeed key assumption, for this model is that,

because of the diagonalityof



, the observed variables

are

conditional ly independent

given the

values of the latentvariables,

. Thus the reduced{dimensional distribution

is intended to mo del

the dep endencies b et

ween the observed variables while



represents the independent noise. This

is in contrast to PCA which treats the inter{variable dependencies and the indep endent noise

identically. In factor analysis the columns of

will generally

not

correspond to the principal

subspace of the data. Furthermore, unlike PCA, there is no analytic solution for

and



, and so

their values must be determined by iterative pro cedures. Note also that because of the

term,

the covariance

, and thus likelihood, is invariant with resp ect to orthogonal p ost{multiplication

. That is,

,where

is an arbitrary



orthogonal matrix, gives an equivalent

3 A Probability Mo del for PCA

Because of the

diagonal noise model



, the factor loadings

will, in general, dier from the

principal axes (even when taking the arbitrary rotation into account). As considered by Anderson

(1963), principal components emerge when the data is assumed to comprise a systematic comp o-

nent, plus an independent error term for eachvariable with common variance



. This implies that

the diagonal elements of the error matrix



in factor analysis ab ove should b e identical. Indeed,

the similaritybetween the factor loadings and the principal axes has often b een observed in situ-

ations in which the elements of



are approximately equal (Rao 1955). Basilevsky (1994) further

notes that when the mo del



is exact, and therefore equal to

, the factor loadings are

identiable and can b e determined analytically through eigen{decomposition of

, without resort

to iteration.

As well as assuming the accuracy of the mo del, such observations do not consider the maximum{

likelihood context. By considering the mo del given by (2) with an isotropic noise structure, such

that





, we showinthis paper that even when the covariance mo del is approximate, the

maximum{likelihood estimator

is that matrix whose columns are the scaled and rotated prin-

cipal eigenvectors of the sample covariance matrix

. An important consequence of this derivation

is that PCA may be expressed in terms of a density model, the denition of whichnow follows.

Probabilistic Principal Comp onent Analysis

3.1 The Probability Mo del

For the isotropic, noise model





(



), equation (2) implies a probability distribution over

{space for a given

given by

(

)=(2



)

;

exp



;



;





(3)

With a Gaussian prior over the latentvariables dened by

(

)=(2



)

;

exp



;





(4)

we obtain the marginal distribution of

in the form

(

)

(

)



(5)

=(2



)

;

exp



;

(

;



)

;

(

;



)





(6)

where the mo del covariance is



(7)

Using Bayes' rule, the

posterior

distribution of the latentvariables

given the observed

maybe

calculated:

(

) = (2



)

;



;



exp



;



;

(

;



)



(



;

)



;

(

;



)







(8)

where the p osterior covariance matrix is given by



;



(



)

;

(9)

Note that



while



The log{likelihood of observing the data under this mo del is:

(

)



;

ln(2



)

;



;





(10)

where

(

;



)(

;



)



(11)

the sample covariance matrix of the observed

. The parameters for this model can thus be

estimated by maximising the log{likelihoo d

, and an EM algorithm to achievethis is given in

Appendix B.

3.2 Prop erties of the Maximum{Likeliho o d Estimators

The log{likelihoo d (10) is maximised when the columns of

span the principal subspace of the

data. Toshow this we consider the derivative of (10) with resp ect to

(

;

)



(12)

Probabilistic Principal Comp onent Analysis

whichmay be obtained from standard matrix dierentiation results see Krzanowski and Marriott

1994, pp 133]. In App endix A it is shown, with

given by (7), that the only non{zero stationary

points of (12) o ccur for:

(



;



)



(13)

where the

column vectors in

are eigenvectors of

, with corresp onding eigenvalues in the

diagonal matrix



, and

is an arbitrary



orthogonal rotation matrix. Furthermore, it is

also shown that the stationary point corresponding to the

global maximum

of the likelihood occurs

when

comprises the

principal

eigenvectors of

, and that all other combinations of eigenvectors

represent saddle{p oints of the likelihoo d surface. Thus, from (13), the columns of the maximum{

likelihood estimator

contain the principal eigenvectors of

, with a scaling determined by

the corresp onding eigenvalue and the parameter



, and with arbitrary rotation.

It may also b e shown that for

, the maximum{likelihood estimator for



is given by



;





(14)

which has a clear interpretation as the variance `lost' in the pro jection, averaged over the lost

dimensions.

It should be noted that the columns of

are not orthogonal since

(

)

(



;



)



(15)

which is not diagonal for

. In common with factor analysis, and indeed many other iterative

PCA algorithms, there exists an elementof rotational ambiguity. An orthonormal basis for the

principal subspace may easily b e extracted using standard techniques if required. Furthermore, the

actual principal axes may also b e determined by noting that equation (15) represents an eigenvector

decomposition of (

)

, where the transp osed rotation matrix

is simply the matrix

whose columns are the eigenvectors of the



matrix (

)

However, with reference to the optimal reconstruction prop erty of PCA, further pro cessing of

the parameters is not necessary. From (8) it maybe seen that the

posterior mean

pro jection of

is given by

;

(

;



). When



;

(

)

;

and

;

then becomes an orthogonal pro jection, and so PCA is recovered. However, the density mo del

then becomes singular, and thus undened, while for



0, the pro jection onto the manifold

becomes skewed towards the origin as a result of the prior over

. Because of this,

not

an orthogonal pro jection of

. However, each data point may still be optimally reconstructed

from the latent variable by taking this skewing into account. With

the required

reconstruction is given by

(

)

;



(16)

and is derived in Appendix C. Thus the latent variables convey the necessary information to

reconstruct the original data vector optimally,even in the case of



4 Discussion

In this pap er we haveshown how principal comp onent analysis maybeviewed as a maximum{

likelihood pro cedure based on a probability density mo del of the observed data.

In addition, we have given an EM algorithm for determining the necessary mo del parameters,

and although we are not necessarily adv

ocating that standard principal components should be

estimated in this way, the EM algorithm plays a crucial r^ole when, for example, extending the

approach to mixture models. (Even for standard PCA, there maybe an advantage in an iterative

Probabilistic Principal Component Analysis

Citations

Principal Component Analysis

Representation Learning: A Review and New Perspectives

Pattern Recognition and Machine Learning

Probabilistic Matrix Factorization

A survey of collaborative filtering techniques

References

The Nature of Statistical Learning Theory

Support-Vector Networks

Statistical Analysis with Missing Data

Principal Component Analysis

A training algorithm for optimal margin classifiers

Related Papers (5)

Maximum likelihood from incomplete data via the EM algorithm

Principal Component Analysis

Pattern Recognition and Machine Learning

Nonlinear dimensionality reduction by locally linear embedding.

A global geometric framework for nonlinear dimensionality reduction.

Frequently Asked Questions (8)

Q1. What are the contributions in this paper?

Q2. What is the simplest way to reduce the likelihood of a given vector?

Q3. What is the simplest way to determine the log likelihood?

Q4. What is the likelihood of the data?

Q5. What is the weight matrix for lj?

Q6. What is the way to minimise a discarded eigenvalue?

Q7. what is the q rank of the weight matrix?

Q8. What is the smallest eigenvalue in the spectrum of a discarded e?