scispace - formally typeset
Open AccessJournal ArticleDOI

Probabilistic Principal Component Analysis

Reads0
Chats0
TLDR
In this paper, the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis.
Abstract
Principal component analysis (PCA) is a ubiquitous technique for data analysis and processing, but one which is not based upon a probability model. In this paper we demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis. We consider the properties of the associated likelihood function, giving an EM algorithm for estimating the principal subspace iteratively, and discuss the advantages conveyed by the definition of a probability density function for PCA.

read more

Content maybe subject to copyright    Report

Neural Computing Research Group
Dept of Computer Science & Applied Mathematics
Aston University
Birmingham B4 7ET
United Kingdom
Tel: +44 (0)121 333 4631
Fax: +44 (0)121 333 4586
http://www.ncrg.aston.ac.uk/
Probabilistic
Principal Comp onent Analysis
Michael E. Tipping
M.E.Tipping@aston.ac.uk
Christopher M. Bishop
C.M.Bishop@aston.ac.uk
Technical Rep ort NCRG/97/010 September 4, 1997
Submitted for publication.
Abstract
Principal comp onent analysis (PCA) is a ubiquitous technique for data analysis and processing,
but one which is not based up on a probability model. In this pap er we demonstrate how the
principal axes of a set of observed data vectors may be determined through maximum{likelihood
estimation of parameters in a latentvariable model closely related to factor analysis. We consider
the properties of the asso ciated likeliho o d function, giving an EM algorithm for estimating the
principal subspace iteratively, and discuss the advantages conveyed by the denition of a probability
density function for PCA.

2
Probabilistic Principal Comp onent Analysis
1 Intro duction
Principal comp onent analysis (PCA) (Jollie 1986) is a well{established technique for dimension
reduction, and a chapter on the sub ject may be found in practically every text on multivariate anal-
ysis. Examples of its many applications include data compression, image processing, visualization,
exploratory data analysis, pattern recognition and time series prediction.
The most common derivation of PCA is in terms of a standardised linear pro jection whichmax-
imises the variance in the pro jected space (Hotelling 1933). For asetofobserved
d
-dimensional
data vectors
f
t
n
g
n
2 f
1
:::N
g
, the
q
principal axes
w
j
,
j
2 f
1
:::q
g
, are those orthonormal
axes onto which the retained variance under pro jection is maximal. It can be shown that the
vectors
w
j
are given bythe
q
dominanteigenvectors (i.e. those with the largest asso ciated eigen-
values
j
) of the sample covariance matrix
S
=
E
(
t
;
)(
t
;
)
T
]such that
Sw
j
=
j
w
j
. The
q
principal comp onents of the observed vector
t
n
are given bythevector
x
n
=
W
T
(
t
n
;
), where
W
T
=(
w
1
w
2
:::
w
q
)
T
. The variables
x
j
are then decorellated such that the covariance matrix
E
xx
T
] is diagonal with elements
j
.
A complementary property of PCA, and that most closely related to the original discussions of
Pearson (1901) is that, of all orthogonal linear pro jections
x
n
=
W
T
(
t
n
;
), the principal com-
ponent pro jection minimises the squared reconstruction error
P
n
k
t
n
;
^
t
n
k
2
, where the optimal
linear reconstruction of
t
n
is given by
^
t
n
=
Wx
n
+
.
One limiting disadvantage of b oth these denitions of PCA is the absence of a probability density
model and associated likelihoo d measure. Deriving PCA from the persp ective of density estimation
would oer a numb er of imp ortantadvantages including:
The denition of a likelihood measure p ermits comparison with other density{estimation
techniques and facilitates statistical testing.
Bayesian inference methods maybe applied (e.g. for mo del comparison) by combining the
likelihood with a prior.
If PCA is used to mo del the class{conditional densities in a classication problem, the p os-
terior probabilities of class membership may be computed.
The probability density function gives a measure of the novelty of a new data p oint.
The single PCA mo del may b e extended to a mixture of suchmodels.
The key result of this pap er is to show that principal component analysis may indeed be obtained
from a probabilit
y mo del. This follows from incorporating
W
within a particular form of latent
variable densitymodelwhich is closely related to statistical factor analysis. Under this formulation,
the maximum{likelihood estimator of
W
is the matrix of (scaled and rotated) principal axes of
the data. Estimation of
W
in this way, using an iterative EM algorithm for example, is generally
more computationally exp ensive than the standard eigen-decomposition approach. However, using
the given derivation
W
may b e computed in the standard fashion and subsequently incorp orated
in the mo del in order to realise the advantages listed ab ove.
In the next section we briey introduce the concept of latentvariable mo dels, and outline factor
analysis in particular. Section 3 then shows how principal component analysis emerges from a
particular mo del parameterisation, and we conclude with a discussion in Section 4. Proofs of key
results are left to the appendix.

Probabilistic Principal Comp onent Analysis
3
2 Latent Variable Mo dels
A latentvariable mo del seeks to relate the set of
d
{dimensional observed data vectors
f
t
n
g
to a
corresponding set of
q
{dimensional latentvariables
f
x
n
g
:
t
=
y
(
x
)+
(1)
where
y
(
x
) is a function of the latentvariable
x
with parameters
,and
is an
x
-independent
noise pro cess. Generally,
q<d
such that the latentvariables oer a more parsimonious description
of the data. By dening a prior distribution over
x
, equation (1) induces a corresponding distri-
bution in the data space and the mo del parameters may be determined bymaximum{likelihoo d.
In standard factor analysis (Bartholomew 1987) the mapping
y
(
x
) is linear:
t
=
Wx
+
+
(2)
where the latentvariables
x
N
(
0
I
)have a unit isotropic Gaussian distribution. The error, or
noise, mo del is Gaussian such that
N
(
0
), with
diagonal, the (
d
q
) parameter matrix
W
contains the
factor loadings
,and
is a constant whose maximum{likelihood estimator is the
mean of the data. Given this formulation, the mo del for
t
is also normal
N
(
C
), where the
covariance
C
=
+
WW
T
. The motivation, and indeed key assumption, for this model is that,
because of the diagonalityof
, the observed variables
t
are
conditional ly independent
given the
values of the latentvariables,
x
. Thus the reduced{dimensional distribution
x
is intended to mo del
the dep endencies b et
ween the observed variables while
represents the independent noise. This
is in contrast to PCA which treats the inter{variable dependencies and the indep endent noise
identically. In factor analysis the columns of
W
will generally
not
correspond to the principal
subspace of the data. Furthermore, unlike PCA, there is no analytic solution for
W
and
, and so
their values must be determined by iterative pro cedures. Note also that because of the
WW
T
term,
the covariance
C
, and thus likelihood, is invariant with resp ect to orthogonal p ost{multiplication
of
W
. That is,
WR
,where
R
is an arbitrary
q
q
orthogonal matrix, gives an equivalent
C
.
3 A Probability Mo del for PCA
Because of the
diagonal noise model
, the factor loadings
W
will, in general, dier from the
principal axes (even when taking the arbitrary rotation into account). As considered by Anderson
(1963), principal components emerge when the data is assumed to comprise a systematic comp o-
nent, plus an independent error term for eachvariable with common variance
2
. This implies that
the diagonal elements of the error matrix
in factor analysis ab ove should b e identical. Indeed,
the similaritybetween the factor loadings and the principal axes has often b een observed in situ-
ations in which the elements of
are approximately equal (Rao 1955). Basilevsky (1994) further
notes that when the mo del
WW
T
+
2
I
is exact, and therefore equal to
S
, the factor loadings are
identiable and can b e determined analytically through eigen{decomposition of
S
, without resort
to iteration.
As well as assuming the accuracy of the mo del, such observations do not consider the maximum{
likelihood context. By considering the mo del given by (2) with an isotropic noise structure, such
that
=
2
I
, we showinthis paper that even when the covariance mo del is approximate, the
maximum{likelihood estimator
W
ML
is that matrix whose columns are the scaled and rotated prin-
cipal eigenvectors of the sample covariance matrix
S
. An important consequence of this derivation
is that PCA may be expressed in terms of a density model, the denition of whichnow follows.

4
Probabilistic Principal Comp onent Analysis
3.1 The Probability Mo del
For the isotropic, noise model
N
(
0

2
I
), equation (2) implies a probability distribution over
t
{space for a given
x
given by
p
(
t
j
x
)=(2

2
)
;
d=
2
exp
;
1
2
2
k
t
;
Wx
;
k
2
:
(3)
With a Gaussian prior over the latentvariables dened by
p
(
x
)=(2
)
;
q=
2
exp
;
1
2
x
T
x
(4)
we obtain the marginal distribution of
t
in the form
p
(
t
)=
Z
p
(
t
j
x
)
p
(
x
)
d
x
(5)
=(2
)
;
d=
2
j
C
j
;
1
=
2
exp
;
1
2
(
t
;
)
T
C
;
1
(
t
;
)
(6)
where the mo del covariance is
C
=
2
I
+
WW
T
:
(7)
Using Bayes' rule, the
posterior
distribution of the latentvariables
x
given the observed
t
maybe
calculated:
p
(
x
j
t
) = (2
)
;
q=
2
j
;
2
M
j
1
=
2
exp
;
1
2
x
;
M
;
1
W
T
(
t
;
)
T
(
;
2
M
)
x
;
M
;
1
W
T
(
t
;
)
(8)
where the p osterior covariance matrix is given by
2
M
;
1
=
2
(
2
I
+
W
T
W
)
;
1
:
(9)
Note that
M
is
q
q
while
C
is
d
d
.
The log{likelihood of observing the data under this mo del is:
L
=
N
X
n
=1
ln
f
p
(
t
n
)
g
=
;
Nd
2
ln(2
)
;
N
2
ln
j
C
j;
N
2
tr
C
;
1
S
(10)
where
S
=
1
N
N
X
n
(
t
n
;
)(
t
n
;
)
T
(11)
the sample covariance matrix of the observed
f
t
n
g
. The parameters for this model can thus be
estimated by maximising the log{likelihoo d
L
, and an EM algorithm to achievethis is given in
Appendix B.
3.2 Prop erties of the Maximum{Likeliho o d Estimators
The log{likelihoo d (10) is maximised when the columns of
W
span the principal subspace of the
data. Toshow this we consider the derivative of (10) with resp ect to
W
:
@
L
@
W
=
N
(
C
;
1
SC
;
1
W
;
C
;
1
W
)
(12)

Probabilistic Principal Comp onent Analysis
5
whichmay be obtained from standard matrix dierentiation results see Krzanowski and Marriott
1994, pp 133]. In App endix A it is shown, with
C
given by (7), that the only non{zero stationary
points of (12) o ccur for:
W
=
U
q
(
q
;
2
I
)
1
=
2
R
(13)
where the
q
column vectors in
U
q
are eigenvectors of
S
, with corresp onding eigenvalues in the
diagonal matrix
q
, and
R
is an arbitrary
q
q
orthogonal rotation matrix. Furthermore, it is
also shown that the stationary point corresponding to the
global maximum
of the likelihood occurs
when
U
q
comprises the
principal
eigenvectors of
S
, and that all other combinations of eigenvectors
represent saddle{p oints of the likelihoo d surface. Thus, from (13), the columns of the maximum{
likelihood estimator
W
ML
contain the principal eigenvectors of
S
, with a scaling determined by
the corresp onding eigenvalue and the parameter
2
, and with arbitrary rotation.
It may also b e shown that for
W
=
W
ML
, the maximum{likelihood estimator for
2
is given by
2
ML
=
1
d
;
q
d
X
j
=
q
+1
j
(14)
which has a clear interpretation as the variance `lost' in the pro jection, averaged over the lost
dimensions.
It should be noted that the columns of
W
ML
are not orthogonal since
(
W
ML
)
T
W
ML
=
R
T
(
q
;
2
I
)
R
(15)
which is not diagonal for
R
6
=
I
. In common with factor analysis, and indeed many other iterative
PCA algorithms, there exists an elementof rotational ambiguity. An orthonormal basis for the
principal subspace may easily b e extracted using standard techniques if required. Furthermore, the
actual principal axes may also b e determined by noting that equation (15) represents an eigenvector
decomposition of (
W
ML
)
T
W
ML
, where the transp osed rotation matrix
R
T
is simply the matrix
whose columns are the eigenvectors of the
q
q
matrix (
W
ML
)
T
W
ML
.
However, with reference to the optimal reconstruction prop erty of PCA, further pro cessing of
the parameters is not necessary. From (8) it maybe seen that the
posterior mean
pro jection of
t
n
is given by
h
x
n
i
=
M
;
1
W
T
(
t
n
;
). When
2
!
0,
M
;
1
!
(
W
T
W
)
;
1
and
WM
;
1
W
T
then becomes an orthogonal pro jection, and so PCA is recovered. However, the density mo del
then becomes singular, and thus undened, while for
2
>
0, the pro jection onto the manifold
becomes skewed towards the origin as a result of the prior over
x
. Because of this,
W
h
x
n
i
is
not
an orthogonal pro jection of
t
n
. However, each data point may still be optimally reconstructed
from the latent variable by taking this skewing into account. With
W
=
W
ML
the required
reconstruction is given by
^
t
n
=
W
ML
f
(
W
ML
)
T
W
ML
g
;
1
M
h
x
n
i
(16)
and is derived in Appendix C. Thus the latent variables convey the necessary information to
reconstruct the original data vector optimally,even in the case of
2
>
0.
4 Discussion
In this pap er we haveshown how principal comp onent analysis maybeviewed as a maximum{
likelihood pro cedure based on a probability density mo del of the observed data.
In addition, we have given an EM algorithm for determining the necessary mo del parameters,
and although we are not necessarily adv
ocating that standard principal components should be
estimated in this way, the EM algorithm plays a crucial r^ole when, for example, extending the
approach to mixture models. (Even for standard PCA, there maybe an advantage in an iterative

Citations
More filters
Reference EntryDOI

Principal Component Analysis

TL;DR: Principal component analysis (PCA) as discussed by the authors replaces the p original variables by a smaller number, q, of derived variables, the principal components, which are linear combinations of the original variables.
Journal ArticleDOI

Representation Learning: A Review and New Perspectives

TL;DR: Recent work in the area of unsupervised feature learning and deep learning is reviewed, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks.

Pattern Recognition and Machine Learning

TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Proceedings Article

Probabilistic Matrix Factorization

TL;DR: The Probabilistic Matrix Factorization (PMF) model is presented, which scales linearly with the number of observations and performs well on the large, sparse, and very imbalanced Netflix dataset and is extended to include an adaptive prior on the model parameters.
Journal ArticleDOI

A survey of collaborative filtering techniques

TL;DR: From basic techniques to the state-of-the-art, this paper attempts to present a comprehensive survey for CF techniques, which can be served as a roadmap for research and practice in this area.
References
More filters
Book

The Nature of Statistical Learning Theory

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Journal ArticleDOI

Support-Vector Networks

TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Book

Statistical Analysis with Missing Data

TL;DR: This work states that maximum Likelihood for General Patterns of Missing Data: Introduction and Theory with Ignorable Nonresponse and large-Sample Inference Based on Maximum Likelihood Estimates is likely to be high.
Book

Principal Component Analysis

TL;DR: In this article, the authors present a graphical representation of data using Principal Component Analysis (PCA) for time series and other non-independent data, as well as a generalization and adaptation of principal component analysis.
Proceedings ArticleDOI

A training algorithm for optimal margin classifiers

TL;DR: A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented, applicable to a wide variety of the classification functions, including Perceptrons, polynomials, and Radial Basis Functions.
Frequently Asked Questions (8)
Q1. What are the contributions in this paper?

In this paper the authors demonstrate how the principal axes of a set of observed data vectors may be determined through maximum likelihood estimation of parameters in a latent variable model closely related to factor analysis The authors consider the properties of the associated likelihood function giving an EM algorithm for estimating the principal subspace iteratively and discuss the advantages conveyed by the de nition of a probability density function for PCA Probabilistic Principal Component Analysis 

Then consider a perturbation to this solution of the form W cW VR where is an arbitrarily small constant and the d q matrix V is given byV uiIt will be su cient to only consider those ui that are not in Uq A solution with a repeated eigenvector implies one lj becoming zero and thus a decrease in the likelihood Arbitrary permu tations of the columns of V with all valid ui thus implies that the resulting vectors vec VR are a complete orthogonal basis for the directions of interest on the likelihood surface 

Uq may contain any of the eigenvectors of S so to identify those which maximise the likelihood the expression forW in is substituted into the log likelihood function to giveL N d ln q X j ln jdX j q j d q ln qwhere q is the number of non zero lj Di erentiating the log likelihood with respect to and substituting forW from givesd q dX j q jand soL N q X j ln j d q ln d q dX j q j A d ln dNote that implies that if rank S q as stated earlier 

WTt of the data the least squares reconstruction errorE recN NX n k tn WW Ttn kis minimised when the columns ofW span the principal subspace of the data covariance matrix 

First the authors express the weight matrixW in terms of its singular value decompositionW ULVTwhere U is a d q matrix of orthonormal column vectors L diag l l lq is the q q diagonal matrix of singular values and V is a q q orthogonal matrix NowC W The authorWWT WW The authorWTWULVT The authorVLUTULVTULVTV The authorLUTUL VTUL The authorL VTThen at the stationary pointsSC W WSUL The authorL VT ULVTSUL U The authorL LFor lj equation implies that if U u u uq then each column vector uj must be an eigenvector of S with corresponding eigenvalue j such that l j j and solj jFor lj uj is arbitrary 

By reference to equation the authors can deduce from this that the smallest eigenvalue must be discarded and included in the right hand term of Given the requirement that the discarded eigenvalues must be contiguous A must then be minimised when the smallest d q eigenvalues are present in the right hand term of and so L is maximised when j j f qg are the largest eigenvalues of SIt should also be noted that A is minimised with respect to q when there are fewest terms in the sum in which occurs when q q and therefore no lj is zero Furthermore L is minimised whenW which may be seen to be equivalent to the case of qIf stationary points represented by minor eigenvector solutions are stable maxima then local maximisation via an EM algorithm for example is not guaranteed to converge on the optimal solution comprising the principal eigenvectors 

In the context of standard PCA such a result is only attainable if q rank S For probabilistic PCA it is necessary to consider the case in which the d q smallest eigenvalues of S are identical or trivially q d because C S is attainable with min the smallest eigenvalue of S As discussed in Section W is then identi able sinceI WWT SWWT S Iwhich has a known solution at W U The authorR where U is a square matrix whose columns are the eigenvectors of S with the corresponding diagonal matrix of eigenvalues and R is an arbitrary orthogonal rotation matrixSC W W withW and C SThe authors are interested in case where C S and the model is approximate 

Then substituting for cW gives The authorcWTcW RTKqR such thatCG SVR RTK q R VR soG C V iK q The authorRwhereiiwith i in the corresponding position to ui in V Thenvec VR Tvec G tr GTVRtr RT iK q The authorV TC VRi ki u T i C uiwhere ki is the value in Kq in the corresponding position to i Since C is positive de nite clearly uTi C ui is always positive