Fast and robust fixed-point algorithms for independent component analysis
Summary (3 min read)
Introduction
- For computational and conceptual simplicity, the representation is often sought as a linear transformation of the original data.
- The authors treat in this paper the problem of estimating the transformation given by independent component analysis (ICA) [7], [27].
- Thus this method is a special case of redundancy reduction [2].
- Using the concept of differential entropy, one can define the mutual information between the random variables [7], [8].
B. Contrast Functions through Approximations of Negentropy
- The authors use here the new approximations developed in [19], based on the maximum entropy principle.
- In the simplest case, these new approximations are of the form (6) where is practically any nonquadratic function, is an irrelevant constant, and is a Gaussian variable of zero mean and unit variance (i.e., standardized).
- The random variable is assumed to be of zero mean and unit variance.
- Maximizing the sum of one-unit contrast functions, and taking into account the constraint of decorrelation, one obtains the following optimization problem: maximize wrt. under constraint (8) where at the maximum, every vector gives one of the rows of the matrix , and the ICA transformation is then given by .
- Authorized licensed use limited to: Helsingin Yliopisto.
A. Behavior Under the ICA Data Model
- The authors analyze the behavior of the estimators given above when the data follows the ICA data model (2), with a square mixing matrix.
- For simplicity, the authors consider only the estimation of a single independent component, and neglect the effects of decorrelation.
- In [18], evaluation of asymptotic variances was addressed using a related family of contrast functions.
- In fact, it can be seen that the results in [18] are valid even in this case, and thus the authors have the following theorem.
- In particular, if one choosesa function that is bounded, is also bounded, and is rather robust against outliers.
B. Practical Choice of Contrast Function
- 1) Performance in the Exponential Power Family:Now the authors shall treat the question of choosing the contrast function in practice.
- For , one obtains a sparse, super-Gaussian density (i.e., a density of positive kurtosis).
- Taking also into account the fact that most independent components encountered in practice are super-Gaussian [3], [25], one reaches the conclusion that as a general-purpose contrast function, one should choose a function that resembles rather where (13).
- This point is, however, so application-dependent that the authors cannot say much in general.
- The authors will show below that the fixed-point algorithms have very appealing convergence properties, making them a very interesting alternative to adaptive learning rules in environments where fast real-time adaptation is not necessary.
B. Fixed-Point Algorithm for One Unit
- To begin with, the authors shall derive the fixed-point algorithm for one unit, with sphered data.
- Denoting the function on the left-hand side of (17) by , the authors obtain its Jacobian matrix as (18).
- Due to the approximations used in the derivation of the fixed-point algorithm, one may wonder if it really converges to the right points.
- Moreover, it is proven that the convergence is quadratic, as usual with Newton methods.
- If the convergence is not satisfactory, one may then increase the sample size.
C. Fixed-Point Algorithm for Several Units
- The one-unit algorithm of the preceding section can be used to construct a system of neurons to estimate the whole ICA transformation using the multiunit contrast function in (8).
- Prevent different neurons from converging to the same maxima the authors mustdecorrelatethe outputs after every iteration.
- When the authors have estimatedindependent components, or vectors , they run the one-unit fixed-point algorithm for , and after every iteration step subtract from the “projections” of the previously estimated vectors, and then renormalize 1. Let 2. Let (24).
- Finally, let us note that explicit inversion of the matrix in (22) or (23) can be avoided by using the identity which is valid for any decorrelating .
D. Properties of the Fixed-Point Algorithm
- The fixed-point algorithm and the underlying contrast functions have a number of desirable properties when compared with existing methods for ICA.
- This illustrates the fast convergence of the fixed-point algorithm.
- This resulted in a generalization of the kurtosis-based approach in [7] and [9], and also enabled estimation of the independent components one by one.
- Next, a new family of algorithms for optimizing the contrast functions were i troduced.
A. Proof of Convergence of Algorithm (20)
- The convergence is proven under the assumptions that first, the data follows the ICA data model (2) and second, that the expectations are evaluated exactly.
- The authors must also make the following technical assumption for any (27) which can be considered a generalization of the condition, valid when they use kurtosis as contrast, that the kurtosis of the independent components must be nonzero.
- If (27) is true for a subset of independent components, the authors can estimate just those independent components.
- This shows clearly that under the assumption (27), the algorithm converges to such a vectorthat and for .
- In other cases, the convergence is quadratic.
B. Proof of Convergence of (26)
- Thus, after iterations, the eigenvalues of are obtained as )))), where is applied times on the , which are the eigenvalues of for the original matrix before the iterations.
- Denoting by the weight matrix whose rows are the weight vectors of the neurons, the authors obtain diag (39) where is the learning rate sequence, and the function is applied separately on every component of the vector .
- J. H. Friedman, “Exploratory projection pursuit,”J. Amer. Statist.
Did you find this useful? Give us your feedback
Citations
8,231 citations
3,215 citations
Cites background or methods or result from "Fast and robust fixed-point algorit..."
...To avoid overlearning, regularization is often used in MLPs, and similarly, regularizing the mixing matrix in ICA could be most useful [10]....
[...]
...The JohnsonLindenstrauss result [15, 10, 8] gives bounds that are much higher than the ones that suffice to give good results on our empirical data....
[...]
...For a simple proof of this result, see [10, 8]....
[...]
2,976 citations
2,858 citations
2,080 citations
Cites methods from "Fast and robust fixed-point algorit..."
...To initialize the sampling algorithms, we first ran fastICA (Hyvarinen, 1999) to find an initial estimate of the de-mixing matrix W ....
[...]
References
45,034 citations
Additional excerpts
...n [8, 7]....
[...]
13,886 citations
9,157 citations
"Fast and robust fixed-point algorit..." refers background or methods in this paper
...the data [1], [3], [5], [6], [23], [24], [27], [28], [31]:...
[...]
...Taking also into account the fact that most independent components encountered in practice are super-Gaussian [3], [25], one reaches the conclusion that as a general-purpose contrast function, one should choose a function that resembles rather...
[...]
...Thus the optimal contrast function is the same as the one obtained by the maximum likelihood approach [34], or the infomax approach [3]....
[...]
8,522 citations
"Fast and robust fixed-point algorit..." refers background or methods in this paper
...We treat in this paper the problem of estimating the transformation given by (linear) independent component analysis (ICA) [7], [27]....
[...]
...For symmetric variables, this is a generalization of the cumulantbased approximation in [7], which is obtained by taking ....
[...]
...Comon [7] showed how to obtain a more general formulation for ICA that does not need to assume an underlying data model....
[...]
...Non-Gaussianity of the independent components is necessary for the identifiability of the model (2), see [7]....
[...]
...[31] , “The nonlinear PCA learning rule in independent component analysis,”Neurocomputing,vol. 17, no. 1, pp. 25–46, 1997....
[...]
5,947 citations
"Fast and robust fixed-point algorit..." refers background in this paper
...of redundancy reduction [2], [32] explains some aspects of...
[...]
Related Papers (5)
Frequently Asked Questions (12)
Q2. What is the advantage of neural on-line learning rules?
The advantage of neural on-line learning rules is that the inputs can be used in the algorithm at once, thus enabling faster adaptation in a nonstationary environment.
Q3. What is the way to estimate the independent components of a vector?
When the authors have estimatedindependent components, or vectors , the authors run the one-unit fixed-point algorithm for , and after every iteration step subtract from the “projections” of the previously estimated vectors, and then renormalize1.
Q4. What was the version of the fixedpoint algorithm for sphered data?
Four independent components of different distributions (two sub-Gaussian and two super-Gaussian) were artificially generated, and the symmetric version of the fixedpoint algorithm for sphered data was used.
Q5. What is the main advantage of the fixed-point algorithms?
The main advantage of the fixed-point algorithms is that their convergence can be shown to be very fast (cubic or at least quadratic).
Q6. What is the contrast function for estimating an independent component?
Using Theorem 2, one sees that in terms of asymptotic variance, an optimal contrast function for estimating an independent component whose density function equals, is of the form(12)where the arbitrary constants have been dropped for simplicity.
Q7. What is the condition for the convergence of the Jacobian matrix?
First of all, since only the Jacobian matrix is approximated, any convergence point of the algorithm must be a solution of the Kuhn–Tucker condition in (17).
Q8. What are some of the extensions of the contrast functions introduced in this paper?
Some extensions of the methods introduced in this paper are presented in [20], in which the problem of noisy data is addressed, and in [22], which deals with the situation where there are more independent components than observed variables.
Q9. What is the contrast function for estimating independent components?
This implies roughly that for super-Gaussian (respectively, sub-Gaussian) densities, the optimal contrast function is a function that growsslower than quadratically(respectively, faster than quadratically).
Q10. What kinds of applications have been performed using the contrast functions and algorithms introduced in this paper?
These applications include artifact ca cellation in EEG and MEG [36], [37], decomposition of evoked fields in MEG [38], and feature extraction of image data [25], [35].
Q11. How many iterations were necessary to achieve the maximum accuracy?
The authors observed that for all three contrast functions, onlythree iterations were necessary, on the average, to achieve the maximum accuracy allowed by the data.
Q12. How can the authors compute the whole matrix in (1)?
using the approach of minimizing mutual information, the above one-unit contrast function can be simply extended to compute the whole matrix in (1).