Journal Article•DOI•

Fast and robust fixed-point algorithms for independent component analysis

01 May 1999-IEEE Transactions on Neural Networks (IEEE)-Vol. 10, Iss: 3, pp 626-634

TL;DR: Using maximum entropy approximations of differential entropy, a family of new contrast (objective) functions for ICA enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of individual independent components as projection pursuit directions.

read less

Abstract: Independent component analysis (ICA) is a statistical method for transforming an observed multidimensional random vector into components that are statistically as independent from each other as possible. We use a combination of two different approaches for linear ICA: Comon's information theoretic approach and the projection pursuit approach. Using maximum entropy approximations of differential entropy, we introduce a family of new contrast functions for ICA. These contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of individual independent components as projection pursuit directions. The statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. Finally, we introduce simple fixed-point algorithms for practical optimization of the contrast functions.

...read moreread less

Summary (3 min read)

Jump to: [Introduction] – [B. Contrast Functions through Approximations of Negentropy] – [A. Behavior Under the ICA Data Model] – [B. Practical Choice of Contrast Function] – [B. Fixed-Point Algorithm for One Unit] – [C. Fixed-Point Algorithm for Several Units] – [D. Properties of the Fixed-Point Algorithm] – [A. Proof of Convergence of Algorithm (20)] and [B. Proof of Convergence of (26)]

Introduction

For computational and conceptual simplicity, the representation is often sought as a linear transformation of the original data.
The authors treat in this paper the problem of estimating the transformation given by independent component analysis (ICA) [7], [27].
Thus this method is a special case of redundancy reduction [2].
Using the concept of differential entropy, one can define the mutual information between the random variables [7], [8].

B. Contrast Functions through Approximations of Negentropy

The authors use here the new approximations developed in [19], based on the maximum entropy principle.
In the simplest case, these new approximations are of the form (6) where is practically any nonquadratic function, is an irrelevant constant, and is a Gaussian variable of zero mean and unit variance (i.e., standardized).
The random variable is assumed to be of zero mean and unit variance.
Maximizing the sum of one-unit contrast functions, and taking into account the constraint of decorrelation, one obtains the following optimization problem: maximize wrt. under constraint (8) where at the maximum, every vector gives one of the rows of the matrix , and the ICA transformation is then given by .
Authorized licensed use limited to: Helsingin Yliopisto.

A. Behavior Under the ICA Data Model

The authors analyze the behavior of the estimators given above when the data follows the ICA data model (2), with a square mixing matrix.
For simplicity, the authors consider only the estimation of a single independent component, and neglect the effects of decorrelation.
In [18], evaluation of asymptotic variances was addressed using a related family of contrast functions.
In fact, it can be seen that the results in [18] are valid even in this case, and thus the authors have the following theorem.
In particular, if one choosesa function that is bounded, is also bounded, and is rather robust against outliers.

B. Practical Choice of Contrast Function

1) Performance in the Exponential Power Family:Now the authors shall treat the question of choosing the contrast function in practice.
For , one obtains a sparse, super-Gaussian density (i.e., a density of positive kurtosis).
Taking also into account the fact that most independent components encountered in practice are super-Gaussian [3], [25], one reaches the conclusion that as a general-purpose contrast function, one should choose a function that resembles rather where (13).
This point is, however, so application-dependent that the authors cannot say much in general.
The authors will show below that the fixed-point algorithms have very appealing convergence properties, making them a very interesting alternative to adaptive learning rules in environments where fast real-time adaptation is not necessary.

B. Fixed-Point Algorithm for One Unit

To begin with, the authors shall derive the fixed-point algorithm for one unit, with sphered data.
Denoting the function on the left-hand side of (17) by , the authors obtain its Jacobian matrix as (18).
Due to the approximations used in the derivation of the fixed-point algorithm, one may wonder if it really converges to the right points.
Moreover, it is proven that the convergence is quadratic, as usual with Newton methods.
If the convergence is not satisfactory, one may then increase the sample size.

C. Fixed-Point Algorithm for Several Units

The one-unit algorithm of the preceding section can be used to construct a system of neurons to estimate the whole ICA transformation using the multiunit contrast function in (8).
Prevent different neurons from converging to the same maxima the authors mustdecorrelatethe outputs after every iteration.
When the authors have estimatedindependent components, or vectors , they run the one-unit fixed-point algorithm for , and after every iteration step subtract from the “projections” of the previously estimated vectors, and then renormalize 1. Let 2. Let (24).
Finally, let us note that explicit inversion of the matrix in (22) or (23) can be avoided by using the identity which is valid for any decorrelating .

D. Properties of the Fixed-Point Algorithm

The fixed-point algorithm and the underlying contrast functions have a number of desirable properties when compared with existing methods for ICA.
This illustrates the fast convergence of the fixed-point algorithm.
This resulted in a generalization of the kurtosis-based approach in [7] and [9], and also enabled estimation of the independent components one by one.
Next, a new family of algorithms for optimizing the contrast functions were i troduced.

A. Proof of Convergence of Algorithm (20)

The convergence is proven under the assumptions that first, the data follows the ICA data model (2) and second, that the expectations are evaluated exactly.
The authors must also make the following technical assumption for any (27) which can be considered a generalization of the condition, valid when they use kurtosis as contrast, that the kurtosis of the independent components must be nonzero.
If (27) is true for a subset of independent components, the authors can estimate just those independent components.
This shows clearly that under the assumption (27), the algorithm converges to such a vectorthat and for .
In other cases, the convergence is quadratic.

B. Proof of Convergence of (26)

Thus, after iterations, the eigenvalues of are obtained as )))), where is applied times on the , which are the eigenvalues of for the original matrix before the iterations.
Denoting by the weight matrix whose rows are the weight vectors of the neurons, the authors obtain diag (39) where is the learning rate sequence, and the function is applied separately on every component of the vector .
J. H. Friedman, “Exploratory projection pursuit,”J. Amer. Statist.

Did you find this useful? Give us your feedback

Figures (2)

Fig. 2. The noisy case. Finite-sample estimation errors plotted for different contrast functions and distributions of the independent components. Asterisk: uniform distribution. Plus sign: Double exponential. Circle: cube of Gaussian.

Fig. 1. Finite-sample estimation errors plotted for different contrast functions and distributions of the independent components, in the noiseless case. Asterisk: uniform distribution. Plus sign: Double exponential. Circle: cube of Gaussian.

Content maybe subject to copyright Report

626 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 3, MAY 1999

Fast and Robust Fixed-Point Algorithms

for Independent Component Analysis

Aapo Hyv

arinen

Abstract—Independent component analysis (ICA) is a statistical

method for transforming an observed multidimensional random

vector into components that are statistically as independent from

each other as possible. In this paper, we use a combination of

two different approaches for linear ICA: Comon’s information-

theoretic approach and the projection pursuit approach. Using

maximum entropy approximations of differential entropy, we

introduce a family of new contrast (objective) functions for ICA.

These contrast functions enable both the estimation of the whole

decomposition by minimizing mutual information, and estima-

tion of individual independent components as projection pursuit

directions. The statistical properties of the estimators based on

such contrast functions are analyzed under the assumption of

the linear mixture model, and it is shown how to choose contrast

functions that are robust and/or of minimum variance. Finally, we

introduce simple ﬁxed-point algorithms for practical optimization

of the contrast functions. These algorithms optimize the contrast

functions very fast and reliably.

I. INTRODUCTION

CENTRAL problem in neural-network research, as well

as in statistics and signal processing, is ﬁnding a suitable

representation or transformation of the data. For computational

and conceptual simplicity, the representation is often sought as

a linear transformation of the original data. Let us denote by

a zero-mean -dimensional random

variable that can be observed, and by

its -dimensional transform. Then the problem is to determine

a constant (weight) matrix

so that the linear transformation

of the observed variables

(1)

has some suitable properties. Several principles and methods

have been developed to ﬁnd such a linear representation,

including principal component analysis [30], factor analysis

[15], projection pursuit [12], [16], independent component

analysis [27], etc. The transformation may be deﬁned using

such criteria as optimal dimension reduction, statistical “inter-

estingness” of the resulting components

, simplicity of the

transformation, or other criteria, including application-oriented

ones.

We treat in this paper the problem of estimating the trans-

formation given by (linear) independent component analysis

(ICA) [7], [27]. As the name implies, the basic goal in

determining the transformation is to ﬁnd a representation

in which the transformed components

are statistically as

Manuscript received November 20, 1997; revised November 19, 1998 and

January 29, 1999.

The author is with Helsinki University of Technology, Laboratory of

Computer and Information Science, FIN-02015 HUT, Finland.

Publisher Item Identiﬁer S 1045-9227(99)03830-8.

independent from each other as possible. Thus this method is

a special case of redundancy reduction [2].

Two promising applications of ICA are blind source sepa-

ration and feature extraction. In blind source separation [27],

the observed values of

correspond to a realization of an

-dimensional discrete-time signal , . Then

the components

are called source signals, which are

usually original, uncorrupted signals or noise sources. Often

such sources are statistically independent from each other, and

thus the signals can be recovered from linear mixtures

ﬁnding a transformation in which the transformed signals are

as independent as possible, as in ICA. In feature extraction [4],

[25],

is the coefﬁcient of the th feature in the observed data

vector

. The use of ICA for feature extraction is motivated by

results in neurosciences that suggest that the similar principle

of redundancy reduction [2], [32] explains some aspects of

the early processing of sensory data by the brain. ICA has

also applications in exploratory data analysis in the same way

as the closely related method of projection pursuit [12], [16].

In this paper, new objective (contrast) functions and algo-

rithms for ICA are introduced. Starting from an information-

theoretic viewpoint, the ICA problem is formulated as min-

imization of mutual information between the transformed

variables

, and a new family of contrast functions for ICA

is introduced (Section II). These contrast functions can also

be interpreted from the viewpoint of projection pursuit, and

enable the sequential (one-by-one) extraction of independent

components. The behavior of the resulting estimators is then

evaluated in the framework of the linear mixture model,

obtaining guidelines for choosing among the many contrast

functions contained in the introduced family. Practical choice

of the contrast function is discussed as well, based on the

statistical criteria together with some numerical and pragmatic

criteria (Section III). For practical maximization of the contrast

functions, we introduce a novel family of ﬁxed-point algo-

rithms (Section IV). These algorithms are shown to have very

appealing convergence properties. Simulations conﬁrming the

usefulness of the novel contrast functions and algorithms are

reported in Section V, together with references to real-life

experiments using these methods. Some conclusions are drawn

in Section VI.

II. C

ONTRAST FUNCTIONS FOR ICA

A. ICA Data Model, Minimization of Mutual

Information, and Projection Pursuit

One popular way of formulating the ICA problem is to

consider the estimation of the following generative model for

1045–9227/99$10.00  1999 IEEE

Authorized licensed use limited to: Helsingin Yliopisto. Downloaded on March 23, 2009 at 08:38 from IEEE Xplore. Restrictions apply.

HYV

ARINEN: FAST AND ROBUST FIXED-POINT ALGORITHMS 627

the data [1], [3], [5], [6], [23], [24], [27], [28], [31]:

(2)

where

is an observed -dimensional vector, is an -

dimensional (latent) random vector whose components are

assumed mutually independent, and

is a constant

matrix to be estimated. It is usually further assumed that the

dimensions of

and are equal, i.e., ; we make this

assumption in the rest of the paper. A noise vector may also

be present. The matrix

deﬁning the transformation as in (1)

is then obtained as the (pseudo)inverse of the estimate of the

matrix

. Non-Gaussianity of the independent components is

necessary for the identiﬁability of the model (2), see [7].

Comon [7] showed how to obtain a more general formu-

lation for ICA that does not need to assume an underlying

data model. This deﬁnition is based on the concept of mutual

information. First, we deﬁne the differential entropy

of a

random vector

with density as follows

[33]:

d (3)

Differential entropy can be normalized to give raise to the

deﬁnition of negentropy, which has the appealing property of

being invariant for linear transformations. The deﬁnition of

negentropy

is given by

(4)

where

is a Gaussian random variable of the same

covariance matrix as

. Negentropy can also be interpreted

as a measure of nongaussianity [7]. Using the concept of

differential entropy, one can deﬁne the mutual information

between the (scalar) random variables [7],

[8]. Mutual information is a natural measure of the dependence

between random variables. It is particularly interesting to

express mutual information using negentropy, constraining the

variables to be uncorrelated. In this case, we have [7]

(5)

Since mutual information is the information-theoretic mea-

sure of the independence of random variables, it is natural

to use it as the criterion for ﬁnding the ICA transform.

Thus we deﬁne in this paper, following [7], the ICA of

a random vector

as an invertible transformation

as in (1) where the matrix is determined so that

the mutual information of the transformed components

minimized. Note that mutual information (or the independence

of the components) is not affected by multiplication of the

components by scalar constants. Therefore, this deﬁnition only

deﬁnes the independent components up to some multiplicative

constants. Moreover, the constraint of uncorrelatedness of the

is adopted in this paper. This constraint is not strictly

necessary, but simpliﬁes the computations considerably.

Because negentropy is invariant for invertible linear trans-

formations [7], it is now obvious from (5) that ﬁnding an

invertible transformation

that minimizes the mutual infor-

mation is roughly equivalent to ﬁnding directions in which the

negentropy is maximized. This formulation of ICA also shows

explicitly the connection between ICA and projection pursuit

[11], [12], [16], [26]. In fact, ﬁnding a single direction that

maximizes negentropy is a form of projection pursuit, and

could also be interpreted as estimation of a single independent

component [24].

B. Contrast Functions through Approximations of Negentropy

To use the deﬁnition of ICA given above, a simple estimate

of the negentropy (or of differential entropy) is needed. We use

here the new approximations developed in [19], based on the

maximum entropy principle. In [19] it was shown that these

approximations are often considerably more accurate than the

conventional, cumulant-based approximations in [1], [7], and

[26]. In the simplest case, these new approximations are of

the form

(6)

where

is practically any nonquadratic function, is an

irrelevant constant, and

is a Gaussian variable of zero mean

and unit variance (i.e., standardized). The random variable

is assumed to be of zero mean and unit variance. For

symmetric variables, this is a generalization of the cumulant-

based approximation in [7], which is obtained by taking

. The choice of the function is deferred to

Section III.

The approximation of negentropy given above in (6) gives

readily a new objective function for estimating the ICA

transform in our framework. First, to ﬁnd one independent

component, or projection pursuit direction as

,we

maximize the function

given by

(7)

where

is an -dimensional (weight) vector constrained so

that

(we can ﬁx the scale arbitrarily). Several

independent components can then be estimated one-by-one

using a deﬂation scheme, see Section IV.

Second, using the approach of minimizing mutual infor-

mation, the above one-unit contrast function can be simply

extended to compute the whole matrix

in (1). To do

this, recall from (5) that mutual information is minimized

(under the constraint of decorrelation) when the sum of the

negentropies of the components in maximized. Maximizing the

sum of

one-unit contrast functions, and taking into account

the constraint of decorrelation, one obtains the following

optimization problem:

maximize

wrt.

under constraint (8)

where at the maximum, every vector

gives

one of the rows of the matrix

, and the ICA transformation

is then given by

. Thus we have deﬁned our ICA

estimator by an optimization problem. Below we analyze the

properties of the estimators, giving guidelines for the choice

, and propose algorithms for solving the optimization

problems in practice.

Authorized licensed use limited to: Helsingin Yliopisto. Downloaded on March 23, 2009 at 08:38 from IEEE Xplore. Restrictions apply.

628 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 3, MAY 1999

III. ANALYSIS OF ESTIMATORS AND

CHOICE OF

CONTRAST FUNCTION

A. Behavior Under the ICA Data Model

In this section, we analyze the behavior of the estimators

given above when the data follows the ICA data model (2),

with a square mixing matrix. For simplicity, we consider only

the estimation of a single independent component, and neglect

the effects of decorrelation. Let us denote by

a vector

obtained by maximizing

in (7). The vector is thus an

estimator of a row of the matrix

1) Consistency: First of all, we prove that

is a (locally)

consistent estimator for one component in the ICA data model.

To prove this, we have the following theorem.

Theorem 1: Assume that the input data follows the ICA

data model in (2), and that

is a sufﬁciently smooth even

function. Then the set of local maxima of

under the

constraint

, includes the th row of the

inverse of the mixing matrix

such that the corresponding

independent component

fulﬁlls

(9)

where

is the derivative of , and is a standardized

Gaussian variable.

This theorem can be considered a corollary of the theorem

in [24]. The condition in Theorem 1 seems to be true for

most reasonable choices of

, and distributions of the .In

particular, if

, the condition is fulﬁlled for any

distribution of nonzero kurtosis. In that case, it can also be

proven that there are no spurious optima [9].

2) Asymptotic Variance: Asymptotic variance is one crite-

rion for choosing the function

to be used in the contrast

function. Comparison of, say, the traces of the asymptotic co-

variance matrices of two estimators enables direct comparison

of the mean-square error of the estimators. In [18], evaluation

of asymptotic variances was addressed using a related family

of contrast functions. In fact, it can be seen that the results

in [18] are valid even in this case, and thus we have the

following theorem.

Theorem 2: The trace of the asymptotic (co)variance of

is minimized when is of the form

(10)

where

is the density function of , and are

arbitrary constants.

For simplicity, one can choose

Thus the optimal contrast function is the same as the one

obtained by the maximum likelihood approach [34], or the

infomax approach [3]. Almost identical results have also been

obtained in [5] for another algorithm. The theorem above

treats, however, the one-unit case instead of the multiunit case

treated by the other authors.

3) Robustness: Another very attractive property of an es-

timator is robustness against outliers [14]. This means that

single, highly erroneous observations do not have much inﬂu-

ence on the estimator. To obtain a simple form of robustness

called B-robustness, we would like the estimator to have a

bounded inﬂuence function [14]. Again, we can adapt the

results in [18]. It turns out to be impossible to have a

completely bounded inﬂuence function, but we do have a

simpler form of robustness, as stated in the following theorem.

Theorem 3: Assume that the data

is whitened (sphered)

in a robust manner (see Section IV for this form of pre-

processing). Then the inﬂuence function of the estimator

is never bounded for all . However, if is

bounded, the inﬂuence function is bounded in sets of the form

for every , where is the derivative

In particular, if one chooses a function

that is bounded,

is also bounded, and is rather robust against outliers. If

this is not possible, one should at least choose a function

that does not grow very fast when grows.

B. Practical Choice of Contrast Function

1) Performance in the Exponential Power Family: Now we

shall treat the question of choosing the contrast function

in practice. It is useful to analyze the implications of the

theoretical results of the preceding sections by considering the

following exponential power family of density functions

(11)

where

is a positive parameter, and are normalization

constants that ensure that

is a probability density of unit

variance. For different values of alpha, the densities in this

family exhibit different shapes. For

, one obtains

a sparse, super-Gaussian density (i.e., a density of positive

kurtosis). For

, one obtains the Gaussian distribution,

and for

, a sub-Gaussian density (i.e., a density of

negative kurtosis). Thus the densities in this family can be

used as examples of different non-Gaussian densities.

Using Theorem 2, one sees that in terms of asymptotic

variance, an optimal contrast function for estimating an in-

dependent component whose density function equals

,isof

the form

(12)

where the arbitrary constants have been dropped for simplicity.

This implies roughly that for super-Gaussian (respectively,

sub-Gaussian) densities, the optimal contrast function is a

function that grows slower than quadratically (respectively,

faster than quadratically). Next, recall from Section III-A-3

that if

grows fast with , the estimator becomes highly

nonrobust against outliers. Taking also into account the fact

that most independent components encountered in practice

are super-Gaussian [3], [25], one reaches the conclusion that

as a general-purpose contrast function, one should choose a

function

that resembles rather

where (13)

The problem with such contrast functions is, however, that

they are not differentiable at zero for

. Thus it is better to

use approximating differentiable functions that have the same

kind of qualitative behavior. Considering

, in which case

one has a double exponential density, one could use instead the

Authorized licensed use limited to: Helsingin Yliopisto. Downloaded on March 23, 2009 at 08:38 from IEEE Xplore. Restrictions apply.

HYV

ARINEN: FAST AND ROBUST FIXED-POINT ALGORITHMS 629

function where is a constant.

Note that the derivative of

is then the familiar tanh function

(for

). In the case of , i.e., highly super-Gaussian

independent components, one could approximate the behavior

for large using a Gaussian function (with a minus

sign):

, where is a constant. The

derivative of this function is like a sigmoid for small values,

but goes to zero for larger values. Note that this function

also fulﬁlls the condition in Theorem 3, thus providing an

estimator that is as robust as possible in the framework of

estimators of type (8). As regards the constants, we have found

experimentally

and to provide good

approximations.

2) Choosing the Contrast Function in Practice: The theo-

retical analysis given above gives some guidelines as for the

choice of

. In practice, however, there are also other criteria

that are important, in particular the following two.

First, we have computational simplicity: The contrast func-

tion should be fast to compute. It must be noted that poly-

nomial functions tend to be faster to compute than, say, the

hyperbolic tangent. However, nonpolynomial contrast func-

tions could be replaced by piecewise linear approximations

without losing the beneﬁts of nonpolynomial functions.

The second point to consider is the order in which the

components are estimated, if one-by-one estimation is used.

We can inﬂuence this order because the basins of attraction of

the maxima of the contrast function have different sizes. Any

ordinary method of optimization tends to ﬁrst ﬁnd maxima that

have large basins of attraction. Of course, it is not possible

to determine with certainty this order, but a suitable choice

of the contrast function means that independent components

with certain distributions tend to be found ﬁrst. This point is,

however, so application-dependent that we cannot say much

in general.

Thus, taking into account all these criteria, we reach the

following general conclusion. We have basically the following

choices for the contrast function (for future use, we also give

their derivatives):

(14)

(15)

(16)

where

are constants, and piecewise

linear approximations of (14) and (15) may also be used. The

beneﬁts of the different contrast functions may be summarized

as follows:

•

is a good general-purpose contrast function;

• when the independent components are highly super-

Gaussian, or when robustness is very important,

may

be better;

• if computational overhead must be reduced, piecewise

linear approximations of

and may be used;

• using kurtosis, or

, is justiﬁed on statistical grounds

only for estimating sub-Gaussian independent compo-

nents when there are no outliers.

Finally, we emphasize in contrast to many other ICA

methods, our framework provides estimators that work for

(practically) any distributions of the independent components

and for any choice of the contrast function. The choice of the

contrast function is only important if one wants to optimize

the performance of the method.

IV. F

IXED-POINT ALGORITHMS FOR ICA

A. Introduction

In the preceding sections, we introduced new contrast (or

objective) functions for ICA based on minimization of mutual

information (and projection pursuit), analyzed some of their

properties, and gave guidelines for the practical choice of the

function

used in the contrast functions. In practice, one also

needs an algorithm for maximizing the contrast functions in

(7) or (8).

A simple method to maximize the contrast function would

be to use stochastic gradient descent; the constraint could be

taken into account by a bigradient feedback. This leads to

neural (adaptive) algorithms that are closely related to those

introduced in [24]. We show in the Appendix B how to modify

the algorithms in [24] to minimize the contrast functions used

in this paper.

The advantage of neural on-line learning rules is that the

inputs

can be used in the algorithm at once, thus enabling

faster adaptation in a nonstationary environment. A resulting

tradeoff, however, is that the convergence is slow, and depends

on a good choice of the learning rate sequence, i.e., the step

size at each iteration. A bad choice of the learning rate can, in

practice, destroy convergence. Therefore, it would important

in practice to make the learning faster and more reliable.

This can be achieved by the ﬁxed-point iteration algorithms

that we introduce here. In the ﬁxed-point algorithms, the

computations are made in batch (or block) mode, i.e., a

large number of data points are used in a single step of

the algorithm. In other respects, however, the algorithms

may be considered neural. In particular, they are parallel,

distributed, computationally simple, and require little memory

space. We will show below that the ﬁxed-point algorithms

have very appealing convergence properties, making them

a very interesting alternative to adaptive learning rules in

environments where fast real-time adaptation is not necessary.

Note that our basic ICA algorithms require a preliminary

sphering or whitening of the data

, though also some versions

for nonsphered data will be given. Sphering means that the

original observed variable, say

is linearly transformed to a

variable

such that the correlation matrix of equals

unity:

. This transformation is always possible;

indeed, it can be accomplished by classical PCA. For details,

see [7] and [12].

B. Fixed-Point Algorithm for One Unit

To begin with, we shall derive the ﬁxed-point algorithm

for one unit, with sphered data. First note that the maxima

Authorized licensed use limited to: Helsingin Yliopisto. Downloaded on March 23, 2009 at 08:38 from IEEE Xplore. Restrictions apply.

630 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 3, MAY 1999

of are obtained at certain optima of .

According to the Kuhn–Tucker conditions [29], the optima of

under the constraint

are obtained at points where

(17)

where

is a constant that can be easily evaluated to give

, where is the value of at

the optimum. Let us try to solve this equation by Newton’s

method. Denoting the function on the left-hand side of (17)

, we obtain its Jacobian matrix as

(18)

To simplify the inversion of this matrix, we decide to ap-

proximate the ﬁrst term in (18). Since the data is sphered, a

reasonable approximation seems to be

. Thus the Jacobian

matrix becomes diagonal, and can easily be inverted. We also

approximate

using the current value of instead of .

Thus we obtain the following approximative Newton iteration

(19)

where

denotes the new value of , ,

and the normalization has been added to improve the stability.

This algorithm can be further simpliﬁed by multiplying both

sides of the ﬁrst equation in (19) by

. This

gives the following ﬁxed-point algorithm

(20)

which was introduced in [17] using a more heuristic derivation.

An earlier version (for kurtosis only) was derived as a ﬁxed-

point iteration of a neural learning rule in [23], which is where

its name comes from. We retain this name for the algorithm,

although in the light of the above derivation, it is rather a

Newton method than a ﬁxed-point iteration.

Due to the approximations used in the derivation of the

ﬁxed-point algorithm, one may wonder if it really converges

to the right points. First of all, since only the Jacobian matrix is

approximated, any convergence point of the algorithm must be

a solution of the Kuhn–Tucker condition in (17). In Appendix

A it is further proven that the algorithm does converge to the

right extrema (those corresponding to maxima of the contrast

function), under the assumption of the ICA data model.

Moreover, it is proven that the convergence is quadratic,

as usual with Newton methods. In fact, if the densities of

the

are symmetric, the convergence is even cubic. The

convergence proven in the Appendix is local. However, in

the special case where kurtosis is used as a contrast function,

i.e., if

, the convergence is proven globally.

The above derivation also enables a useful modiﬁcation of

the ﬁxed-point algorithm. It is well known that the conver-

gence of the Newton method may be rather uncertain. To

ameliorate this, one may add a step size in (19), obtaining

the stabilized ﬁxed-point algorithm

(21)

where

as above, and is a step size

parameter that may change with the iteration count. Taking a

that is much smaller than unity (say, 0.1 or 0.01), the algorithm

(21) converges with much more certainty. In particular, it is

often a good strategy to start with

, in which case the

algorithm is equivalent to the original ﬁxed-point algorithm

in (20). If convergence seems problematic,

may then be

decreased gradually until convergence is satisfactory. Note that

we thus have a continuum between a Newton optimization

method, corresponding to

, and a gradient descent

method, corresponding to a very small

The ﬁxed-point algorithms may also be simply used for the

original, that is, not sphered data. Transforming the data back

to the nonsphered variables, one sees easily that the following

modiﬁcation of the algorithm (20) works for nonsphered data:

(22)

where

is the covariance matrix of the data.

The stabilized version, algorithm (21), can also be modiﬁed

as follows to work with nonsphered data:

(23)

Using these two algorithms, one obtains directly an indepen-

dent component as the linear combination

, where need

not be sphered (prewhitened). These modiﬁcations presuppose,

of course, that the covariance matrix is not singular. If it is

singular or near-singular, the dimension of the data must be

reduced, for example with PCA [7], [28].

In practice, the expectations in the ﬁxed-point algorithms

must be replaced by their estimates. The natural estimates are

of course the corresponding sample means. Ideally, all the

data available should be used, but this is often not a good idea

because the computations may become too demanding. Then

the averages can estimated using a smaller sample, whose size

may have a considerable effect on the accuracy of the ﬁnal

estimates. The sample points should be chosen separately at

every iteration. If the convergence is not satisfactory, one may

then increase the sample size. A reduction of the step size

in the stabilized version has a similar effect, as is well-known

in stochastic approximation methods [24], [28].

C. Fixed-Point Algorithm for Several Units

The one-unit algorithm of the preceding section can be used

to construct a system of

neurons to estimate the whole ICA

transformation using the multiunit contrast function in (8). To

Authorized licensed use limited to: Helsingin Yliopisto. Downloaded on March 23, 2009 at 08:38 from IEEE Xplore. Restrictions apply.

HTML Viewer

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "Fast and robust fixed-point algorithms for independent component analysis" ?

In this paper, the authors use a combination of two different approaches for linear ICA: Comon ’ s informationtheoretic approach and the projection pursuit approach. Using maximum entropy approximations of differential entropy, the authors introduce a family of new contrast ( objective ) functions for ICA. These contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of individual independent components as projection pursuit directions. Finally, the authors introduce simple fixed-point algorithms for practical optimization of the contrast functions.

Q2. What is the advantage of neural on-line learning rules?

The advantage of neural on-line learning rules is that the inputs can be used in the algorithm at once, thus enabling faster adaptation in a nonstationary environment.

Q3. What is the way to estimate the independent components of a vector?

When the authors have estimatedindependent components, or vectors , the authors run the one-unit fixed-point algorithm for , and after every iteration step subtract from the “projections” of the previously estimated vectors, and then renormalize1.

Q4. What was the version of the fixedpoint algorithm for sphered data?

Four independent components of different distributions (two sub-Gaussian and two super-Gaussian) were artificially generated, and the symmetric version of the fixedpoint algorithm for sphered data was used.

Q5. What is the main advantage of the fixed-point algorithms?

The main advantage of the fixed-point algorithms is that their convergence can be shown to be very fast (cubic or at least quadratic).

Q6. What is the contrast function for estimating an independent component?

Using Theorem 2, one sees that in terms of asymptotic variance, an optimal contrast function for estimating an independent component whose density function equals, is of the form(12)where the arbitrary constants have been dropped for simplicity.

Q7. What is the condition for the convergence of the Jacobian matrix?

First of all, since only the Jacobian matrix is approximated, any convergence point of the algorithm must be a solution of the Kuhn–Tucker condition in (17).

Q8. What are some of the extensions of the contrast functions introduced in this paper?

Some extensions of the methods introduced in this paper are presented in [20], in which the problem of noisy data is addressed, and in [22], which deals with the situation where there are more independent components than observed variables.

Q9. What is the contrast function for estimating independent components?

This implies roughly that for super-Gaussian (respectively, sub-Gaussian) densities, the optimal contrast function is a function that growsslower than quadratically(respectively, faster than quadratically).

Q10. What kinds of applications have been performed using the contrast functions and algorithms introduced in this paper?

These applications include artifact ca cellation in EEG and MEG [36], [37], decomposition of evoked fields in MEG [38], and feature extraction of image data [25], [35].

Q11. How many iterations were necessary to achieve the maximum accuracy?

The authors observed that for all three contrast functions, onlythree iterations were necessary, on the average, to achieve the maximum accuracy allowed by the data.

Q12. How can the authors compute the whole matrix in (1)?

using the approach of minimizing mutual information, the above one-unit contrast function can be simply extended to compute the whole matrix in (1).

Fast and robust fixed-point algorithms for independent component analysis

Summary (3 min read)

Introduction

B. Contrast Functions through Approximations of Negentropy

A. Behavior Under the ICA Data Model

B. Practical Choice of Contrast Function

B. Fixed-Point Algorithm for One Unit

C. Fixed-Point Algorithm for Several Units

D. Properties of the Fixed-Point Algorithm

A. Proof of Convergence of Algorithm (20)

B. Proof of Convergence of (26)

Figures (2)

Citations

Cites background or methods or result from "Fast and robust fixed-point algorit..."

Cites methods from "Fast and robust fixed-point algorit..."

References

Additional excerpts

"Fast and robust fixed-point algorit..." refers background or methods in this paper

"Fast and robust fixed-point algorit..." refers background or methods in this paper

"Fast and robust fixed-point algorit..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "Fast and robust fixed-point algorithms for independent component analysis" ?

Q2. What is the advantage of neural on-line learning rules?

Q3. What is the way to estimate the independent components of a vector?

Q4. What was the version of the fixedpoint algorithm for sphered data?

Q5. What is the main advantage of the fixed-point algorithms?

Q6. What is the contrast function for estimating an independent component?

Q7. What is the condition for the convergence of the Jacobian matrix?

Q8. What are some of the extensions of the contrast functions introduced in this paper?

Q9. What is the contrast function for estimating independent components?

Q10. What kinds of applications have been performed using the contrast functions and algorithms introduced in this paper?

Q11. How many iterations were necessary to achieve the maximum accuracy?

Q12. How can the authors compute the whole matrix in (1)?