scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Fast and robust fixed-point algorithms for independent component analysis

01 May 1999-IEEE Transactions on Neural Networks (IEEE)-Vol. 10, Iss: 3, pp 626-634
TL;DR: Using maximum entropy approximations of differential entropy, a family of new contrast (objective) functions for ICA enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of individual independent components as projection pursuit directions.
Abstract: Independent component analysis (ICA) is a statistical method for transforming an observed multidimensional random vector into components that are statistically as independent from each other as possible. We use a combination of two different approaches for linear ICA: Comon's information theoretic approach and the projection pursuit approach. Using maximum entropy approximations of differential entropy, we introduce a family of new contrast functions for ICA. These contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of individual independent components as projection pursuit directions. The statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. Finally, we introduce simple fixed-point algorithms for practical optimization of the contrast functions.

Summary (3 min read)

Introduction

  • For computational and conceptual simplicity, the representation is often sought as a linear transformation of the original data.
  • The authors treat in this paper the problem of estimating the transformation given by independent component analysis (ICA) [7], [27].
  • Thus this method is a special case of redundancy reduction [2].
  • Using the concept of differential entropy, one can define the mutual information between the random variables [7], [8].

B. Contrast Functions through Approximations of Negentropy

  • The authors use here the new approximations developed in [19], based on the maximum entropy principle.
  • In the simplest case, these new approximations are of the form (6) where is practically any nonquadratic function, is an irrelevant constant, and is a Gaussian variable of zero mean and unit variance (i.e., standardized).
  • The random variable is assumed to be of zero mean and unit variance.
  • Maximizing the sum of one-unit contrast functions, and taking into account the constraint of decorrelation, one obtains the following optimization problem: maximize wrt. under constraint (8) where at the maximum, every vector gives one of the rows of the matrix , and the ICA transformation is then given by .
  • Authorized licensed use limited to: Helsingin Yliopisto.

A. Behavior Under the ICA Data Model

  • The authors analyze the behavior of the estimators given above when the data follows the ICA data model (2), with a square mixing matrix.
  • For simplicity, the authors consider only the estimation of a single independent component, and neglect the effects of decorrelation.
  • In [18], evaluation of asymptotic variances was addressed using a related family of contrast functions.
  • In fact, it can be seen that the results in [18] are valid even in this case, and thus the authors have the following theorem.
  • In particular, if one choosesa function that is bounded, is also bounded, and is rather robust against outliers.

B. Practical Choice of Contrast Function

  • 1) Performance in the Exponential Power Family:Now the authors shall treat the question of choosing the contrast function in practice.
  • For , one obtains a sparse, super-Gaussian density (i.e., a density of positive kurtosis).
  • Taking also into account the fact that most independent components encountered in practice are super-Gaussian [3], [25], one reaches the conclusion that as a general-purpose contrast function, one should choose a function that resembles rather where (13).
  • This point is, however, so application-dependent that the authors cannot say much in general.
  • The authors will show below that the fixed-point algorithms have very appealing convergence properties, making them a very interesting alternative to adaptive learning rules in environments where fast real-time adaptation is not necessary.

B. Fixed-Point Algorithm for One Unit

  • To begin with, the authors shall derive the fixed-point algorithm for one unit, with sphered data.
  • Denoting the function on the left-hand side of (17) by , the authors obtain its Jacobian matrix as (18).
  • Due to the approximations used in the derivation of the fixed-point algorithm, one may wonder if it really converges to the right points.
  • Moreover, it is proven that the convergence is quadratic, as usual with Newton methods.
  • If the convergence is not satisfactory, one may then increase the sample size.

C. Fixed-Point Algorithm for Several Units

  • The one-unit algorithm of the preceding section can be used to construct a system of neurons to estimate the whole ICA transformation using the multiunit contrast function in (8).
  • Prevent different neurons from converging to the same maxima the authors mustdecorrelatethe outputs after every iteration.
  • When the authors have estimatedindependent components, or vectors , they run the one-unit fixed-point algorithm for , and after every iteration step subtract from the “projections” of the previously estimated vectors, and then renormalize 1. Let 2. Let (24).
  • Finally, let us note that explicit inversion of the matrix in (22) or (23) can be avoided by using the identity which is valid for any decorrelating .

D. Properties of the Fixed-Point Algorithm

  • The fixed-point algorithm and the underlying contrast functions have a number of desirable properties when compared with existing methods for ICA.
  • This illustrates the fast convergence of the fixed-point algorithm.
  • This resulted in a generalization of the kurtosis-based approach in [7] and [9], and also enabled estimation of the independent components one by one.
  • Next, a new family of algorithms for optimizing the contrast functions were i troduced.

A. Proof of Convergence of Algorithm (20)

  • The convergence is proven under the assumptions that first, the data follows the ICA data model (2) and second, that the expectations are evaluated exactly.
  • The authors must also make the following technical assumption for any (27) which can be considered a generalization of the condition, valid when they use kurtosis as contrast, that the kurtosis of the independent components must be nonzero.
  • If (27) is true for a subset of independent components, the authors can estimate just those independent components.
  • This shows clearly that under the assumption (27), the algorithm converges to such a vectorthat and for .
  • In other cases, the convergence is quadratic.

B. Proof of Convergence of (26)

  • Thus, after iterations, the eigenvalues of are obtained as )))), where is applied times on the , which are the eigenvalues of for the original matrix before the iterations.
  • Denoting by the weight matrix whose rows are the weight vectors of the neurons, the authors obtain diag (39) where is the learning rate sequence, and the function is applied separately on every component of the vector .
  • J. H. Friedman, “Exploratory projection pursuit,”J. Amer. Statist.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

626 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 3, MAY 1999
Fast and Robust Fixed-Point Algorithms
for Independent Component Analysis
Aapo Hyv
¨
arinen
AbstractIndependent component analysis (ICA) is a statistical
method for transforming an observed multidimensional random
vector into components that are statistically as independent from
each other as possible. In this paper, we use a combination of
two different approaches for linear ICA: Comon’s information-
theoretic approach and the projection pursuit approach. Using
maximum entropy approximations of differential entropy, we
introduce a family of new contrast (objective) functions for ICA.
These contrast functions enable both the estimation of the whole
decomposition by minimizing mutual information, and estima-
tion of individual independent components as projection pursuit
directions. The statistical properties of the estimators based on
such contrast functions are analyzed under the assumption of
the linear mixture model, and it is shown how to choose contrast
functions that are robust and/or of minimum variance. Finally, we
introduce simple fixed-point algorithms for practical optimization
of the contrast functions. These algorithms optimize the contrast
functions very fast and reliably.
I. INTRODUCTION
A
CENTRAL problem in neural-network research, as well
as in statistics and signal processing, is finding a suitable
representation or transformation of the data. For computational
and conceptual simplicity, the representation is often sought as
a linear transformation of the original data. Let us denote by
a zero-mean -dimensional random
variable that can be observed, and by
its -dimensional transform. Then the problem is to determine
a constant (weight) matrix
so that the linear transformation
of the observed variables
(1)
has some suitable properties. Several principles and methods
have been developed to find such a linear representation,
including principal component analysis [30], factor analysis
[15], projection pursuit [12], [16], independent component
analysis [27], etc. The transformation may be defined using
such criteria as optimal dimension reduction, statistical “inter-
estingness” of the resulting components
, simplicity of the
transformation, or other criteria, including application-oriented
ones.
We treat in this paper the problem of estimating the trans-
formation given by (linear) independent component analysis
(ICA) [7], [27]. As the name implies, the basic goal in
determining the transformation is to find a representation
in which the transformed components
are statistically as
Manuscript received November 20, 1997; revised November 19, 1998 and
January 29, 1999.
The author is with Helsinki University of Technology, Laboratory of
Computer and Information Science, FIN-02015 HUT, Finland.
Publisher Item Identifier S 1045-9227(99)03830-8.
independent from each other as possible. Thus this method is
a special case of redundancy reduction [2].
Two promising applications of ICA are blind source sepa-
ration and feature extraction. In blind source separation [27],
the observed values of
correspond to a realization of an
-dimensional discrete-time signal , . Then
the components
are called source signals, which are
usually original, uncorrupted signals or noise sources. Often
such sources are statistically independent from each other, and
thus the signals can be recovered from linear mixtures
by
finding a transformation in which the transformed signals are
as independent as possible, as in ICA. In feature extraction [4],
[25],
is the coefficient of the th feature in the observed data
vector
. The use of ICA for feature extraction is motivated by
results in neurosciences that suggest that the similar principle
of redundancy reduction [2], [32] explains some aspects of
the early processing of sensory data by the brain. ICA has
also applications in exploratory data analysis in the same way
as the closely related method of projection pursuit [12], [16].
In this paper, new objective (contrast) functions and algo-
rithms for ICA are introduced. Starting from an information-
theoretic viewpoint, the ICA problem is formulated as min-
imization of mutual information between the transformed
variables
, and a new family of contrast functions for ICA
is introduced (Section II). These contrast functions can also
be interpreted from the viewpoint of projection pursuit, and
enable the sequential (one-by-one) extraction of independent
components. The behavior of the resulting estimators is then
evaluated in the framework of the linear mixture model,
obtaining guidelines for choosing among the many contrast
functions contained in the introduced family. Practical choice
of the contrast function is discussed as well, based on the
statistical criteria together with some numerical and pragmatic
criteria (Section III). For practical maximization of the contrast
functions, we introduce a novel family of fixed-point algo-
rithms (Section IV). These algorithms are shown to have very
appealing convergence properties. Simulations confirming the
usefulness of the novel contrast functions and algorithms are
reported in Section V, together with references to real-life
experiments using these methods. Some conclusions are drawn
in Section VI.
II. C
ONTRAST FUNCTIONS FOR ICA
A. ICA Data Model, Minimization of Mutual
Information, and Projection Pursuit
One popular way of formulating the ICA problem is to
consider the estimation of the following generative model for
1045–9227/99$10.00 1999 IEEE
Authorized licensed use limited to: Helsingin Yliopisto. Downloaded on March 23, 2009 at 08:38 from IEEE Xplore. Restrictions apply.

HYV
¨
ARINEN: FAST AND ROBUST FIXED-POINT ALGORITHMS 627
the data [1], [3], [5], [6], [23], [24], [27], [28], [31]:
(2)
where
is an observed -dimensional vector, is an -
dimensional (latent) random vector whose components are
assumed mutually independent, and
is a constant
matrix to be estimated. It is usually further assumed that the
dimensions of
and are equal, i.e., ; we make this
assumption in the rest of the paper. A noise vector may also
be present. The matrix
defining the transformation as in (1)
is then obtained as the (pseudo)inverse of the estimate of the
matrix
. Non-Gaussianity of the independent components is
necessary for the identifiability of the model (2), see [7].
Comon [7] showed how to obtain a more general formu-
lation for ICA that does not need to assume an underlying
data model. This definition is based on the concept of mutual
information. First, we define the differential entropy
of a
random vector
with density as follows
[33]:
d (3)
Differential entropy can be normalized to give raise to the
definition of negentropy, which has the appealing property of
being invariant for linear transformations. The definition of
negentropy
is given by
(4)
where
is a Gaussian random variable of the same
covariance matrix as
. Negentropy can also be interpreted
as a measure of nongaussianity [7]. Using the concept of
differential entropy, one can define the mutual information
between the (scalar) random variables [7],
[8]. Mutual information is a natural measure of the dependence
between random variables. It is particularly interesting to
express mutual information using negentropy, constraining the
variables to be uncorrelated. In this case, we have [7]
(5)
Since mutual information is the information-theoretic mea-
sure of the independence of random variables, it is natural
to use it as the criterion for finding the ICA transform.
Thus we define in this paper, following [7], the ICA of
a random vector
as an invertible transformation
as in (1) where the matrix is determined so that
the mutual information of the transformed components
is
minimized. Note that mutual information (or the independence
of the components) is not affected by multiplication of the
components by scalar constants. Therefore, this definition only
defines the independent components up to some multiplicative
constants. Moreover, the constraint of uncorrelatedness of the
is adopted in this paper. This constraint is not strictly
necessary, but simplifies the computations considerably.
Because negentropy is invariant for invertible linear trans-
formations [7], it is now obvious from (5) that finding an
invertible transformation
that minimizes the mutual infor-
mation is roughly equivalent to finding directions in which the
negentropy is maximized. This formulation of ICA also shows
explicitly the connection between ICA and projection pursuit
[11], [12], [16], [26]. In fact, finding a single direction that
maximizes negentropy is a form of projection pursuit, and
could also be interpreted as estimation of a single independent
component [24].
B. Contrast Functions through Approximations of Negentropy
To use the definition of ICA given above, a simple estimate
of the negentropy (or of differential entropy) is needed. We use
here the new approximations developed in [19], based on the
maximum entropy principle. In [19] it was shown that these
approximations are often considerably more accurate than the
conventional, cumulant-based approximations in [1], [7], and
[26]. In the simplest case, these new approximations are of
the form
(6)
where
is practically any nonquadratic function, is an
irrelevant constant, and
is a Gaussian variable of zero mean
and unit variance (i.e., standardized). The random variable
is assumed to be of zero mean and unit variance. For
symmetric variables, this is a generalization of the cumulant-
based approximation in [7], which is obtained by taking
. The choice of the function is deferred to
Section III.
The approximation of negentropy given above in (6) gives
readily a new objective function for estimating the ICA
transform in our framework. First, to find one independent
component, or projection pursuit direction as
,we
maximize the function
given by
(7)
where
is an -dimensional (weight) vector constrained so
that
(we can fix the scale arbitrarily). Several
independent components can then be estimated one-by-one
using a deflation scheme, see Section IV.
Second, using the approach of minimizing mutual infor-
mation, the above one-unit contrast function can be simply
extended to compute the whole matrix
in (1). To do
this, recall from (5) that mutual information is minimized
(under the constraint of decorrelation) when the sum of the
negentropies of the components in maximized. Maximizing the
sum of
one-unit contrast functions, and taking into account
the constraint of decorrelation, one obtains the following
optimization problem:
maximize
wrt.
under constraint (8)
where at the maximum, every vector
gives
one of the rows of the matrix
, and the ICA transformation
is then given by
. Thus we have defined our ICA
estimator by an optimization problem. Below we analyze the
properties of the estimators, giving guidelines for the choice
of
, and propose algorithms for solving the optimization
problems in practice.
Authorized licensed use limited to: Helsingin Yliopisto. Downloaded on March 23, 2009 at 08:38 from IEEE Xplore. Restrictions apply.

628 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 3, MAY 1999
III. ANALYSIS OF ESTIMATORS AND
CHOICE OF
CONTRAST FUNCTION
A. Behavior Under the ICA Data Model
In this section, we analyze the behavior of the estimators
given above when the data follows the ICA data model (2),
with a square mixing matrix. For simplicity, we consider only
the estimation of a single independent component, and neglect
the effects of decorrelation. Let us denote by
a vector
obtained by maximizing
in (7). The vector is thus an
estimator of a row of the matrix
.
1) Consistency: First of all, we prove that
is a (locally)
consistent estimator for one component in the ICA data model.
To prove this, we have the following theorem.
Theorem 1: Assume that the input data follows the ICA
data model in (2), and that
is a sufficiently smooth even
function. Then the set of local maxima of
under the
constraint
, includes the th row of the
inverse of the mixing matrix
such that the corresponding
independent component
fulfills
(9)
where
is the derivative of , and is a standardized
Gaussian variable.
This theorem can be considered a corollary of the theorem
in [24]. The condition in Theorem 1 seems to be true for
most reasonable choices of
, and distributions of the .In
particular, if
, the condition is fulfilled for any
distribution of nonzero kurtosis. In that case, it can also be
proven that there are no spurious optima [9].
2) Asymptotic Variance: Asymptotic variance is one crite-
rion for choosing the function
to be used in the contrast
function. Comparison of, say, the traces of the asymptotic co-
variance matrices of two estimators enables direct comparison
of the mean-square error of the estimators. In [18], evaluation
of asymptotic variances was addressed using a related family
of contrast functions. In fact, it can be seen that the results
in [18] are valid even in this case, and thus we have the
following theorem.
Theorem 2: The trace of the asymptotic (co)variance of
is minimized when is of the form
(10)
where
is the density function of , and are
arbitrary constants.
For simplicity, one can choose
.
Thus the optimal contrast function is the same as the one
obtained by the maximum likelihood approach [34], or the
infomax approach [3]. Almost identical results have also been
obtained in [5] for another algorithm. The theorem above
treats, however, the one-unit case instead of the multiunit case
treated by the other authors.
3) Robustness: Another very attractive property of an es-
timator is robustness against outliers [14]. This means that
single, highly erroneous observations do not have much influ-
ence on the estimator. To obtain a simple form of robustness
called B-robustness, we would like the estimator to have a
bounded influence function [14]. Again, we can adapt the
results in [18]. It turns out to be impossible to have a
completely bounded influence function, but we do have a
simpler form of robustness, as stated in the following theorem.
Theorem 3: Assume that the data
is whitened (sphered)
in a robust manner (see Section IV for this form of pre-
processing). Then the influence function of the estimator
is never bounded for all . However, if is
bounded, the influence function is bounded in sets of the form
for every , where is the derivative
of
.
In particular, if one chooses a function
that is bounded,
is also bounded, and is rather robust against outliers. If
this is not possible, one should at least choose a function
that does not grow very fast when grows.
B. Practical Choice of Contrast Function
1) Performance in the Exponential Power Family: Now we
shall treat the question of choosing the contrast function
in practice. It is useful to analyze the implications of the
theoretical results of the preceding sections by considering the
following exponential power family of density functions
(11)
where
is a positive parameter, and are normalization
constants that ensure that
is a probability density of unit
variance. For different values of alpha, the densities in this
family exhibit different shapes. For
, one obtains
a sparse, super-Gaussian density (i.e., a density of positive
kurtosis). For
, one obtains the Gaussian distribution,
and for
, a sub-Gaussian density (i.e., a density of
negative kurtosis). Thus the densities in this family can be
used as examples of different non-Gaussian densities.
Using Theorem 2, one sees that in terms of asymptotic
variance, an optimal contrast function for estimating an in-
dependent component whose density function equals
,isof
the form
(12)
where the arbitrary constants have been dropped for simplicity.
This implies roughly that for super-Gaussian (respectively,
sub-Gaussian) densities, the optimal contrast function is a
function that grows slower than quadratically (respectively,
faster than quadratically). Next, recall from Section III-A-3
that if
grows fast with , the estimator becomes highly
nonrobust against outliers. Taking also into account the fact
that most independent components encountered in practice
are super-Gaussian [3], [25], one reaches the conclusion that
as a general-purpose contrast function, one should choose a
function
that resembles rather
where (13)
The problem with such contrast functions is, however, that
they are not differentiable at zero for
. Thus it is better to
use approximating differentiable functions that have the same
kind of qualitative behavior. Considering
, in which case
one has a double exponential density, one could use instead the
Authorized licensed use limited to: Helsingin Yliopisto. Downloaded on March 23, 2009 at 08:38 from IEEE Xplore. Restrictions apply.

HYV
¨
ARINEN: FAST AND ROBUST FIXED-POINT ALGORITHMS 629
function where is a constant.
Note that the derivative of
is then the familiar tanh function
(for
). In the case of , i.e., highly super-Gaussian
independent components, one could approximate the behavior
of
for large using a Gaussian function (with a minus
sign):
, where is a constant. The
derivative of this function is like a sigmoid for small values,
but goes to zero for larger values. Note that this function
also fulfills the condition in Theorem 3, thus providing an
estimator that is as robust as possible in the framework of
estimators of type (8). As regards the constants, we have found
experimentally
and to provide good
approximations.
2) Choosing the Contrast Function in Practice: The theo-
retical analysis given above gives some guidelines as for the
choice of
. In practice, however, there are also other criteria
that are important, in particular the following two.
First, we have computational simplicity: The contrast func-
tion should be fast to compute. It must be noted that poly-
nomial functions tend to be faster to compute than, say, the
hyperbolic tangent. However, nonpolynomial contrast func-
tions could be replaced by piecewise linear approximations
without losing the benefits of nonpolynomial functions.
The second point to consider is the order in which the
components are estimated, if one-by-one estimation is used.
We can influence this order because the basins of attraction of
the maxima of the contrast function have different sizes. Any
ordinary method of optimization tends to first find maxima that
have large basins of attraction. Of course, it is not possible
to determine with certainty this order, but a suitable choice
of the contrast function means that independent components
with certain distributions tend to be found first. This point is,
however, so application-dependent that we cannot say much
in general.
Thus, taking into account all these criteria, we reach the
following general conclusion. We have basically the following
choices for the contrast function (for future use, we also give
their derivatives):
(14)
(15)
(16)
where
are constants, and piecewise
linear approximations of (14) and (15) may also be used. The
benefits of the different contrast functions may be summarized
as follows:
is a good general-purpose contrast function;
when the independent components are highly super-
Gaussian, or when robustness is very important,
may
be better;
if computational overhead must be reduced, piecewise
linear approximations of
and may be used;
using kurtosis, or
, is justified on statistical grounds
only for estimating sub-Gaussian independent compo-
nents when there are no outliers.
Finally, we emphasize in contrast to many other ICA
methods, our framework provides estimators that work for
(practically) any distributions of the independent components
and for any choice of the contrast function. The choice of the
contrast function is only important if one wants to optimize
the performance of the method.
IV. F
IXED-POINT ALGORITHMS FOR ICA
A. Introduction
In the preceding sections, we introduced new contrast (or
objective) functions for ICA based on minimization of mutual
information (and projection pursuit), analyzed some of their
properties, and gave guidelines for the practical choice of the
function
used in the contrast functions. In practice, one also
needs an algorithm for maximizing the contrast functions in
(7) or (8).
A simple method to maximize the contrast function would
be to use stochastic gradient descent; the constraint could be
taken into account by a bigradient feedback. This leads to
neural (adaptive) algorithms that are closely related to those
introduced in [24]. We show in the Appendix B how to modify
the algorithms in [24] to minimize the contrast functions used
in this paper.
The advantage of neural on-line learning rules is that the
inputs
can be used in the algorithm at once, thus enabling
faster adaptation in a nonstationary environment. A resulting
tradeoff, however, is that the convergence is slow, and depends
on a good choice of the learning rate sequence, i.e., the step
size at each iteration. A bad choice of the learning rate can, in
practice, destroy convergence. Therefore, it would important
in practice to make the learning faster and more reliable.
This can be achieved by the fixed-point iteration algorithms
that we introduce here. In the fixed-point algorithms, the
computations are made in batch (or block) mode, i.e., a
large number of data points are used in a single step of
the algorithm. In other respects, however, the algorithms
may be considered neural. In particular, they are parallel,
distributed, computationally simple, and require little memory
space. We will show below that the fixed-point algorithms
have very appealing convergence properties, making them
a very interesting alternative to adaptive learning rules in
environments where fast real-time adaptation is not necessary.
Note that our basic ICA algorithms require a preliminary
sphering or whitening of the data
, though also some versions
for nonsphered data will be given. Sphering means that the
original observed variable, say
is linearly transformed to a
variable
such that the correlation matrix of equals
unity:
. This transformation is always possible;
indeed, it can be accomplished by classical PCA. For details,
see [7] and [12].
B. Fixed-Point Algorithm for One Unit
To begin with, we shall derive the fixed-point algorithm
for one unit, with sphered data. First note that the maxima
Authorized licensed use limited to: Helsingin Yliopisto. Downloaded on March 23, 2009 at 08:38 from IEEE Xplore. Restrictions apply.

630 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 3, MAY 1999
of are obtained at certain optima of .
According to the Kuhn–Tucker conditions [29], the optima of
under the constraint
are obtained at points where
(17)
where
is a constant that can be easily evaluated to give
, where is the value of at
the optimum. Let us try to solve this equation by Newton’s
method. Denoting the function on the left-hand side of (17)
by
, we obtain its Jacobian matrix as
(18)
To simplify the inversion of this matrix, we decide to ap-
proximate the first term in (18). Since the data is sphered, a
reasonable approximation seems to be
. Thus the Jacobian
matrix becomes diagonal, and can easily be inverted. We also
approximate
using the current value of instead of .
Thus we obtain the following approximative Newton iteration
(19)
where
denotes the new value of , ,
and the normalization has been added to improve the stability.
This algorithm can be further simplified by multiplying both
sides of the first equation in (19) by
. This
gives the following fixed-point algorithm
(20)
which was introduced in [17] using a more heuristic derivation.
An earlier version (for kurtosis only) was derived as a fixed-
point iteration of a neural learning rule in [23], which is where
its name comes from. We retain this name for the algorithm,
although in the light of the above derivation, it is rather a
Newton method than a fixed-point iteration.
Due to the approximations used in the derivation of the
fixed-point algorithm, one may wonder if it really converges
to the right points. First of all, since only the Jacobian matrix is
approximated, any convergence point of the algorithm must be
a solution of the Kuhn–Tucker condition in (17). In Appendix
A it is further proven that the algorithm does converge to the
right extrema (those corresponding to maxima of the contrast
function), under the assumption of the ICA data model.
Moreover, it is proven that the convergence is quadratic,
as usual with Newton methods. In fact, if the densities of
the
are symmetric, the convergence is even cubic. The
convergence proven in the Appendix is local. However, in
the special case where kurtosis is used as a contrast function,
i.e., if
, the convergence is proven globally.
The above derivation also enables a useful modification of
the fixed-point algorithm. It is well known that the conver-
gence of the Newton method may be rather uncertain. To
ameliorate this, one may add a step size in (19), obtaining
the stabilized fixed-point algorithm
(21)
where
as above, and is a step size
parameter that may change with the iteration count. Taking a
that is much smaller than unity (say, 0.1 or 0.01), the algorithm
(21) converges with much more certainty. In particular, it is
often a good strategy to start with
, in which case the
algorithm is equivalent to the original fixed-point algorithm
in (20). If convergence seems problematic,
may then be
decreased gradually until convergence is satisfactory. Note that
we thus have a continuum between a Newton optimization
method, corresponding to
, and a gradient descent
method, corresponding to a very small
.
The fixed-point algorithms may also be simply used for the
original, that is, not sphered data. Transforming the data back
to the nonsphered variables, one sees easily that the following
modification of the algorithm (20) works for nonsphered data:
(22)
where
is the covariance matrix of the data.
The stabilized version, algorithm (21), can also be modified
as follows to work with nonsphered data:
(23)
Using these two algorithms, one obtains directly an indepen-
dent component as the linear combination
, where need
not be sphered (prewhitened). These modifications presuppose,
of course, that the covariance matrix is not singular. If it is
singular or near-singular, the dimension of the data must be
reduced, for example with PCA [7], [28].
In practice, the expectations in the fixed-point algorithms
must be replaced by their estimates. The natural estimates are
of course the corresponding sample means. Ideally, all the
data available should be used, but this is often not a good idea
because the computations may become too demanding. Then
the averages can estimated using a smaller sample, whose size
may have a considerable effect on the accuracy of the final
estimates. The sample points should be chosen separately at
every iteration. If the convergence is not satisfactory, one may
then increase the sample size. A reduction of the step size
in the stabilized version has a similar effect, as is well-known
in stochastic approximation methods [24], [28].
C. Fixed-Point Algorithm for Several Units
The one-unit algorithm of the preceding section can be used
to construct a system of
neurons to estimate the whole ICA
transformation using the multiunit contrast function in (8). To
Authorized licensed use limited to: Helsingin Yliopisto. Downloaded on March 23, 2009 at 08:38 from IEEE Xplore. Restrictions apply.

Citations
More filters
Journal ArticleDOI
TL;DR: The basic theory and applications of ICA are presented, and the goal is to find a linear representation of non-Gaussian data so that the components are statistically independent, or as independent as possible.

8,231 citations

Journal ArticleDOI
TL;DR: A novel fast algorithm for independent component analysis is introduced, which can be used for blind source separation and feature extraction, and the convergence speed is shown to be cubic.
Abstract: We introduce a novel fast algorithm for independent component analysis, which can be used for blind source separation and feature extraction. We show how a neural network learning rule can be transformed into a fixedpoint iteration, which provides an algorithm that is very simple, does not depend on any user-defined parameters, and is fast to converge to the most accurate solution allowed by the data. The algorithm finds, one at a time, all nongaussian independent components, regardless of their probability distributions. The computations can be performed in either batch mode or a semiadaptive manner. The convergence of the algorithm is rigorously proved, and the convergence speed is shown to be cubic. Some comparisons to gradient-based algorithms are made, showing that the new algorithm is usually 10 to 100 times faster, sometimes giving the solution in just a few iterations.

3,215 citations


Cites background or methods or result from "Fast and robust fixed-point algorit..."

  • ...To avoid overlearning, regularization is often used in MLPs, and similarly, regularizing the mixing matrix in ICA could be most useful [10]....

    [...]

  • ...The JohnsonLindenstrauss result [15, 10, 8] gives bounds that are much higher than the ones that suffice to give good results on our empirical data....

    [...]

  • ...For a simple proof of this result, see [10, 8]....

    [...]

Reference EntryDOI
31 Aug 2012
TL;DR: A statistical generative model called independent component analysis is discussed, which shows how sparse coding can be interpreted as providing a Bayesian prior, and answers some questions which were not properly answered in the sparse coding framework.
Abstract: Independent component models have gained increasing interest in various fields of applications in recent years. The basic independent component model is a semiparametric model assuming that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate an unmixing matrix by means of which the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example. Keywords: blind source separation; fastICA; independent component model; independent subspace analysis; mixing matrix; overcomplete ICA; undercomplete ICA; unmixing matrix

2,976 citations

Journal ArticleDOI
TL;DR: This work presents tools for hierarchical clustering of imaged objects according to the shapes of their boundaries, learning of probability models for clusters of shapes, and testing of newly observed shapes under competing probability models.
Abstract: Using a differential-geometric treatment of planar shapes, we present tools for: 1) hierarchical clustering of imaged objects according to the shapes of their boundaries, 2) learning of probability models for clusters of shapes, and 3) testing of newly observed shapes under competing probability models. Clustering at any level of hierarchy is performed using a minimum variance type criterion and a Markov process. Statistical means of clusters provide shapes to be clustered at the next higher level, thus building a hierarchy of shapes. Using finite-dimensional approximations of spaces tangent to the shape space at sample means, we (implicitly) impose probability models on the shape space, and results are illustrated via random sampling and classification (hypothesis testing). Together, hierarchical clustering and hypothesis testing provide an efficient framework for shape retrieval. Examples are presented using shapes and images from ETH, Surrey, and AMCOM databases.

2,858 citations

Proceedings Article
28 Jun 2011
TL;DR: This paper proposes a new framework for learning from large scale datasets based on iterative learning from small mini-batches by adding the right amount of noise to a standard stochastic gradient optimization algorithm and shows that the iterates will converge to samples from the true posterior distribution as the authors anneal the stepsize.
Abstract: In this paper we propose a new framework for learning from large scale datasets based on iterative learning from small mini-batches. By adding the right amount of noise to a standard stochastic gradient optimization algorithm we show that the iterates will converge to samples from the true posterior distribution as we anneal the stepsize. This seamless transition between optimization and Bayesian posterior sampling provides an inbuilt protection against overfitting. We also propose a practical method for Monte Carlo estimates of posterior statistics which monitors a "sampling threshold" and collects samples after it has been surpassed. We apply the method to three models: a mixture of Gaussians, logistic regression and ICA with natural gradients.

2,080 citations


Cites methods from "Fast and robust fixed-point algorit..."

  • ...To initialize the sampling algorithms, we first ran fastICA (Hyvarinen, 1999) to find an initial estimate of the de-mixing matrix W ....

    [...]

References
More filters
Book
01 Jan 1991
TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
Abstract: Preface to the Second Edition. Preface to the First Edition. Acknowledgments for the Second Edition. Acknowledgments for the First Edition. 1. Introduction and Preview. 1.1 Preview of the Book. 2. Entropy, Relative Entropy, and Mutual Information. 2.1 Entropy. 2.2 Joint Entropy and Conditional Entropy. 2.3 Relative Entropy and Mutual Information. 2.4 Relationship Between Entropy and Mutual Information. 2.5 Chain Rules for Entropy, Relative Entropy, and Mutual Information. 2.6 Jensen's Inequality and Its Consequences. 2.7 Log Sum Inequality and Its Applications. 2.8 Data-Processing Inequality. 2.9 Sufficient Statistics. 2.10 Fano's Inequality. Summary. Problems. Historical Notes. 3. Asymptotic Equipartition Property. 3.1 Asymptotic Equipartition Property Theorem. 3.2 Consequences of the AEP: Data Compression. 3.3 High-Probability Sets and the Typical Set. Summary. Problems. Historical Notes. 4. Entropy Rates of a Stochastic Process. 4.1 Markov Chains. 4.2 Entropy Rate. 4.3 Example: Entropy Rate of a Random Walk on a Weighted Graph. 4.4 Second Law of Thermodynamics. 4.5 Functions of Markov Chains. Summary. Problems. Historical Notes. 5. Data Compression. 5.1 Examples of Codes. 5.2 Kraft Inequality. 5.3 Optimal Codes. 5.4 Bounds on the Optimal Code Length. 5.5 Kraft Inequality for Uniquely Decodable Codes. 5.6 Huffman Codes. 5.7 Some Comments on Huffman Codes. 5.8 Optimality of Huffman Codes. 5.9 Shannon-Fano-Elias Coding. 5.10 Competitive Optimality of the Shannon Code. 5.11 Generation of Discrete Distributions from Fair Coins. Summary. Problems. Historical Notes. 6. Gambling and Data Compression. 6.1 The Horse Race. 6.2 Gambling and Side Information. 6.3 Dependent Horse Races and Entropy Rate. 6.4 The Entropy of English. 6.5 Data Compression and Gambling. 6.6 Gambling Estimate of the Entropy of English. Summary. Problems. Historical Notes. 7. Channel Capacity. 7.1 Examples of Channel Capacity. 7.2 Symmetric Channels. 7.3 Properties of Channel Capacity. 7.4 Preview of the Channel Coding Theorem. 7.5 Definitions. 7.6 Jointly Typical Sequences. 7.7 Channel Coding Theorem. 7.8 Zero-Error Codes. 7.9 Fano's Inequality and the Converse to the Coding Theorem. 7.10 Equality in the Converse to the Channel Coding Theorem. 7.11 Hamming Codes. 7.12 Feedback Capacity. 7.13 Source-Channel Separation Theorem. Summary. Problems. Historical Notes. 8. Differential Entropy. 8.1 Definitions. 8.2 AEP for Continuous Random Variables. 8.3 Relation of Differential Entropy to Discrete Entropy. 8.4 Joint and Conditional Differential Entropy. 8.5 Relative Entropy and Mutual Information. 8.6 Properties of Differential Entropy, Relative Entropy, and Mutual Information. Summary. Problems. Historical Notes. 9. Gaussian Channel. 9.1 Gaussian Channel: Definitions. 9.2 Converse to the Coding Theorem for Gaussian Channels. 9.3 Bandlimited Channels. 9.4 Parallel Gaussian Channels. 9.5 Channels with Colored Gaussian Noise. 9.6 Gaussian Channels with Feedback. Summary. Problems. Historical Notes. 10. Rate Distortion Theory. 10.1 Quantization. 10.2 Definitions. 10.3 Calculation of the Rate Distortion Function. 10.4 Converse to the Rate Distortion Theorem. 10.5 Achievability of the Rate Distortion Function. 10.6 Strongly Typical Sequences and Rate Distortion. 10.7 Characterization of the Rate Distortion Function. 10.8 Computation of Channel Capacity and the Rate Distortion Function. Summary. Problems. Historical Notes. 11. Information Theory and Statistics. 11.1 Method of Types. 11.2 Law of Large Numbers. 11.3 Universal Source Coding. 11.4 Large Deviation Theory. 11.5 Examples of Sanov's Theorem. 11.6 Conditional Limit Theorem. 11.7 Hypothesis Testing. 11.8 Chernoff-Stein Lemma. 11.9 Chernoff Information. 11.10 Fisher Information and the Cram-er-Rao Inequality. Summary. Problems. Historical Notes. 12. Maximum Entropy. 12.1 Maximum Entropy Distributions. 12.2 Examples. 12.3 Anomalous Maximum Entropy Problem. 12.4 Spectrum Estimation. 12.5 Entropy Rates of a Gaussian Process. 12.6 Burg's Maximum Entropy Theorem. Summary. Problems. Historical Notes. 13. Universal Source Coding. 13.1 Universal Codes and Channel Capacity. 13.2 Universal Coding for Binary Sequences. 13.3 Arithmetic Coding. 13.4 Lempel-Ziv Coding. 13.5 Optimality of Lempel-Ziv Algorithms. Compression. Summary. Problems. Historical Notes. 14. Kolmogorov Complexity. 14.1 Models of Computation. 14.2 Kolmogorov Complexity: Definitions and Examples. 14.3 Kolmogorov Complexity and Entropy. 14.4 Kolmogorov Complexity of Integers. 14.5 Algorithmically Random and Incompressible Sequences. 14.6 Universal Probability. 14.7 Kolmogorov complexity. 14.9 Universal Gambling. 14.10 Occam's Razor. 14.11 Kolmogorov Complexity and Universal Probability. 14.12 Kolmogorov Sufficient Statistic. 14.13 Minimum Description Length Principle. Summary. Problems. Historical Notes. 15. Network Information Theory. 15.1 Gaussian Multiple-User Channels. 15.2 Jointly Typical Sequences. 15.3 Multiple-Access Channel. 15.4 Encoding of Correlated Sources. 15.5 Duality Between Slepian-Wolf Encoding and Multiple-Access Channels. 15.6 Broadcast Channel. 15.7 Relay Channel. 15.8 Source Coding with Side Information. 15.9 Rate Distortion with Side Information. 15.10 General Multiterminal Networks. Summary. Problems. Historical Notes. 16. Information Theory and Portfolio Theory. 16.1 The Stock Market: Some Definitions. 16.2 Kuhn-Tucker Characterization of the Log-Optimal Portfolio. 16.3 Asymptotic Optimality of the Log-Optimal Portfolio. 16.4 Side Information and the Growth Rate. 16.5 Investment in Stationary Markets. 16.6 Competitive Optimality of the Log-Optimal Portfolio. 16.7 Universal Portfolios. 16.8 Shannon-McMillan-Breiman Theorem (General AEP). Summary. Problems. Historical Notes. 17. Inequalities in Information Theory. 17.1 Basic Inequalities of Information Theory. 17.2 Differential Entropy. 17.3 Bounds on Entropy and Relative Entropy. 17.4 Inequalities for Types. 17.5 Combinatorial Bounds on Entropy. 17.6 Entropy Rates of Subsets. 17.7 Entropy and Fisher Information. 17.8 Entropy Power Inequality and Brunn-Minkowski Inequality. 17.9 Inequalities for Determinants. 17.10 Inequalities for Ratios of Determinants. Summary. Problems. Historical Notes. Bibliography. List of Symbols. Index.

45,034 citations


Additional excerpts

  • ...n [8, 7]....

    [...]

Book
01 Jan 1965
TL;DR: This chapter discusses the concept of a Random Variable, the meaning of Probability, and the axioms of probability in terms of Markov Chains and Queueing Theory.
Abstract: Part 1 Probability and Random Variables 1 The Meaning of Probability 2 The Axioms of Probability 3 Repeated Trials 4 The Concept of a Random Variable 5 Functions of One Random Variable 6 Two Random Variables 7 Sequences of Random Variables 8 Statistics Part 2 Stochastic Processes 9 General Concepts 10 Random Walk and Other Applications 11 Spectral Representation 12 Spectral Estimation 13 Mean Square Estimation 14 Entropy 15 Markov Chains 16 Markov Processes and Queueing Theory

13,886 citations

Journal ArticleDOI
TL;DR: It is suggested that information maximization provides a unifying framework for problems in "blind" signal processing and dependencies of information transfer on time delays are derived.
Abstract: We derive a new self-organizing learning algorithm that maximizes the information transferred in a network of nonlinear units. The algorithm does not assume any knowledge of the input distributions, and is defined here for the zero-noise limit. Under these conditions, information maximization has extra properties not found in the linear case (Linsker 1989). The nonlinearities in the transfer function are able to pick up higher-order moments of the input distributions and perform something akin to true redundancy reduction between units in the output representation. This enables the network to separate statistically independent components in the inputs: a higher-order generalization of principal components analysis. We apply the network to the source separation (or cocktail party) problem, successfully separating unknown mixtures of up to 10 speakers. We also show that a variant on the network architecture is able to perform blind deconvolution (cancellation of unknown echoes and reverberation in a speech signal). Finally, we derive dependencies of information transfer on time delays. We suggest that information maximization provides a unifying framework for problems in "blind" signal processing.

9,157 citations


"Fast and robust fixed-point algorit..." refers background or methods in this paper

  • ...the data [1], [3], [5], [6], [23], [24], [27], [28], [31]:...

    [...]

  • ...Taking also into account the fact that most independent components encountered in practice are super-Gaussian [3], [25], one reaches the conclusion that as a general-purpose contrast function, one should choose a function that resembles rather...

    [...]

  • ...Thus the optimal contrast function is the same as the one obtained by the maximum likelihood approach [34], or the infomax approach [3]....

    [...]

Journal ArticleDOI
TL;DR: An efficient algorithm is proposed, which allows the computation of the ICA of a data matrix within a polynomial time and may actually be seen as an extension of the principal component analysis (PCA).

8,522 citations


"Fast and robust fixed-point algorit..." refers background or methods in this paper

  • ...We treat in this paper the problem of estimating the transformation given by (linear) independent component analysis (ICA) [7], [27]....

    [...]

  • ...For symmetric variables, this is a generalization of the cumulantbased approximation in [7], which is obtained by taking ....

    [...]

  • ...Comon [7] showed how to obtain a more general formulation for ICA that does not need to assume an underlying data model....

    [...]

  • ...Non-Gaussianity of the independent components is necessary for the identifiability of the model (2), see [7]....

    [...]

  • ...[31] , “The nonlinear PCA learning rule in independent component analysis,”Neurocomputing,vol. 17, no. 1, pp. 25–46, 1997....

    [...]

Journal ArticleDOI
13 Jun 1996-Nature
TL;DR: It is shown that a learning algorithm that attempts to find sparse linear codes for natural scenes will develop a complete family of localized, oriented, bandpass receptive fields, similar to those found in the primary visual cortex.
Abstract: The receptive fields of simple cells in mammalian primary visual cortex can be characterized as being spatially localized, oriented and bandpass (selective to structure at different spatial scales), comparable to the basis functions of wavelet transforms. One approach to understanding such response properties of visual neurons has been to consider their relationship to the statistical structure of natural images in terms of efficient coding. Along these lines, a number of studies have attempted to train unsupervised learning algorithms on natural images in the hope of developing receptive fields with similar properties, but none has succeeded in producing a full set that spans the image space and contains all three of the above properties. Here we investigate the proposal that a coding strategy that maximizes sparseness is sufficient to account for these properties. We show that a learning algorithm that attempts to find sparse linear codes for natural scenes will develop a complete family of localized, oriented, bandpass receptive fields, similar to those found in the primary visual cortex. The resulting sparse image code provides a more efficient representation for later stages of processing because it possesses a higher degree of statistical independence among its outputs.

5,947 citations


"Fast and robust fixed-point algorit..." refers background in this paper

  • ...of redundancy reduction [2], [32] explains some aspects of...

    [...]

Frequently Asked Questions (12)
Q1. What are the contributions mentioned in the paper "Fast and robust fixed-point algorithms for independent component analysis" ?

In this paper, the authors use a combination of two different approaches for linear ICA: Comon ’ s informationtheoretic approach and the projection pursuit approach. Using maximum entropy approximations of differential entropy, the authors introduce a family of new contrast ( objective ) functions for ICA. These contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of individual independent components as projection pursuit directions. Finally, the authors introduce simple fixed-point algorithms for practical optimization of the contrast functions. 

The advantage of neural on-line learning rules is that the inputs can be used in the algorithm at once, thus enabling faster adaptation in a nonstationary environment. 

When the authors have estimatedindependent components, or vectors , the authors run the one-unit fixed-point algorithm for , and after every iteration step subtract from the “projections” of the previously estimated vectors, and then renormalize1. 

Four independent components of different distributions (two sub-Gaussian and two super-Gaussian) were artificially generated, and the symmetric version of the fixedpoint algorithm for sphered data was used. 

The main advantage of the fixed-point algorithms is that their convergence can be shown to be very fast (cubic or at least quadratic). 

Using Theorem 2, one sees that in terms of asymptotic variance, an optimal contrast function for estimating an independent component whose density function equals, is of the form(12)where the arbitrary constants have been dropped for simplicity. 

First of all, since only the Jacobian matrix is approximated, any convergence point of the algorithm must be a solution of the Kuhn–Tucker condition in (17). 

Some extensions of the methods introduced in this paper are presented in [20], in which the problem of noisy data is addressed, and in [22], which deals with the situation where there are more independent components than observed variables. 

This implies roughly that for super-Gaussian (respectively, sub-Gaussian) densities, the optimal contrast function is a function that growsslower than quadratically(respectively, faster than quadratically). 

These applications include artifact ca cellation in EEG and MEG [36], [37], decomposition of evoked fields in MEG [38], and feature extraction of image data [25], [35]. 

The authors observed that for all three contrast functions, onlythree iterations were necessary, on the average, to achieve the maximum accuracy allowed by the data. 

using the approach of minimizing mutual information, the above one-unit contrast function can be simply extended to compute the whole matrix in (1).