Independent component analysis, a new concept?

doi:10.1016/0165-1684(94)90029-9

Journal Article•DOI•

Independent component analysis, a new concept?

01 Apr 1994-Signal Processing (Elsevier North-Holland, Inc.)-Vol. 36, Iss: 3, pp 287-314

TL;DR: An efficient algorithm is proposed, which allows the computation of the ICA of a data matrix within a polynomial time and may actually be seen as an extension of the principal component analysis (PCA).

read less

About: This article is published in Signal Processing.The article was published on 1994-04-01 and is currently open access. It has received 8522 citations till now. The article focuses on the topics: Independent component analysis & FastICA.

...read moreread less

Summary (5 min read)

Jump to: [1.1. Problem description] – [1.2. Organization of the paper] – [1.3. Related works] – [1.4. Applications] – [1.5. Preliminary observations] – [PROPERTY2.] – [2. Statements related to statistical independence] – [2.1. Mutual information] – [2.2. Standardization] – [2.3. Negentropy, a measure of distance to normality] – [2.4. Measures of statistical dependence] – [THEOREM 7. The following mapping is a contrast over ~_~: T(p~) = -I(p~).] – [THEOREM 10. Let x and z be two random vectors such that z = Bx, B being a given rectangular matrix. Suppose additionally that x has independent components and that z has pairwise independent corn-ponents. If B has two non-zero entries in the same column j, then xj is either Gaussian or deterministic.] – [3.1. Approximation of the mutual information] – [3.2. Simpler criteria] – [3.3. Link with blind identification and deconvolution] – [4.1. Pairwise processing] – [4.2. Convergence] – [4.3. Computational complexity] – [5.2. Behaviour in the presence of non-Gaussian noise when N = 2] – [5.3. Behaviour in the noiseless case when N > 2] – [5.4. Behaviour in the presence of noise when N > 2] and [6. Conclusions]

1.1. Problem description

This paper attempts to provide a precise definition of ICA within an applicable mathematical framework.
It is envisaged that this definition will provide a baseline for further development and application of the ICA concept.
Assume the following linear statistical model: (1.1a) where x,y and v are random vectors with values in or C and with zero mean and finite covariance, M is a rectangular matrix with at most as many columns as rows, and vector x has statistically independent components.
The qualifiers 'blind' or 'myopic' are often used when only the outputs of the system considered are observed; in this framework, the authors are thus dealing with the problem of blind identification of a linear static system.

1.2. Organization of the paper

Related works, possible applications and preliminary observations regarding the problem statement are surveyed.
In Section 2, general results on statistical independence are stated, and are then utilized in Section 3 to derive optimization criteria.
The properties of these criteria are also investigated in Section 3.
Simulation results are then presented in Section 5.

1.4. Applications

The first column of M contains the successive samples of the impulse response of the corresponding causal filter.
In antenna array processing, ICA might be utilized in at least two instances: firstly, the estimation of radiating sources from unknown arrays (necessarily without localization).
Further, ICA can be utilized in the identification of multichannel ARMA processes when the input is not observed and, in particular, for estimating the first coefficient of the model [12, 15] .
On the other hand, ICA can be used as a data preprocessing tool before Bayesian detection and classification.

1.5. Preliminary observations

This latter indetermination cannot be reduced further without additional assumptions.
The asterisk denotes transposition, and complex conjugation whenever the quantity is complex (mainly the real case will be considered in this paper).
The purpose of the next section will be to define such contrast functions.
Since multiplying random variables by non-zero scalar factors or changing their order does not affect their statistical independence, ICA actually defines an equivalence class of decompositions rather than a single one, so that the property below holds.
The definition of contrast functions will thus have to take this indeterminacy into account.

PROPERTY2.

For the sake of clarity, let us also recall the definition of PCA below.
Note that two different pairs can be PCAs of the same random variable y, but are also related by Property 2.
Thus, PCA has exactly the same inherent indeterminations as ICA, so that the authors may assume the same additional arbitrary constraints [15] .
With constraint (c), the matrix F defined in the PCA decomposition (Definition 3) is unitary.

2.1. Mutual information

In statistics, the large class of f-divergences is of key importance among the possible distance measures available [4] .
In these measures the roles played by both densities are not always symmetric, so that the authors are not dealing with proper distances.
The Kullback divergence is defined as with equality if and only if pz(u)= p:(u) almost everywhere.

2.2. Standardization

The standardization procedure proposed here fulfils these two tasks, and may be viewed as a mere PCA.
Matrix L could be defined by any square-root decomposition, such as Cholesky factorization, or a decomposition based on the eigen value decomposition (EVD), for instance.
But their preference goes to the EVD because it allows one to handle easily the case of singular or ill-conditioned covariance matrices by projection.
Without restricting the generality the authors may thus consider in the remaining that the variable observed belongs to H:~; they additionally assume that N > 1, since otherwise the observation is scalar and the problem does not exist.

2.3. Negentropy, a measure of distance to normality

This relation gives a means of approximating the mutual information, provided the authors are able to approximate the negentropy about zero, which amounts to expanding the density PI in the neighbourhood of flax.
In the first step, a transform will cancel the last term of (2.14).
It can be shown that this is equivalent to standardizing the data . flax(U) = (2re) -N/2I VI-~/2exp{-u'V-au}/2. (2.10) Among the densities of ~:~ having a given covariance matrix V, the Gaussian density is the one which has the largest entropy.

2.4. Measures of statistical dependence

The authors have shown that both the Gaussian feature and the mutual independence can be characterized with the help of negentropy.
Yet, these remarks justify only in part the use of (2.14) as an optimization criterion in their problem.
In fact, from Property 2, this criterion should meet the requirements given below.

THEOREM 7. The following mapping is a contrast over ~_~: T(p~) = -I(p~).

The criterion proposed in Theorem 7 is consequently admissible for ICA computation.
This theoretical criterion, involving a generally unknown density, will be made usable by approximtions in Section 3.
Regarding computational loads, the calculation of ICA may still be very heavy even after approximations, and the authors now turn to a theorem that theoretically explains why the practical algorithm designed in Section 4, that proceeds pairwise, indeed works.
The following two lemmas are utilized in the proof of Theorem 10, and are reproduced below because of their interesting content.

THEOREM 10. Let x and z be two random vectors such that z = Bx, B being a given rectangular matrix. Suppose additionally that x has independent components and that z has pairwise independent corn-ponents. If B has two non-zero entries in the same column j, then xj is either Gaussian or deterministic.

Now the authors are in a position to state a theorem, from which two important corollaries can be deduced (see Appendices A.7-A.9 for proofs).
Then the following three properties are equivalent: (i) The components zi are pairwise independent.
(ii) The components zi are mutually independent.
This last corollary is actually stating identifiability conditions of the noiseless problem.
For processes with unit variance, the authors find indeed that M = FDP in the corollary above.

3.1. Approximation of the mutual information

In practice, the densities p~ and py are not known, so that criterion (3.1) cannot be directly utilized.
The aim of this section is to express the contrast in Theorem 7 as a function of the standardized cumulants (of orders 3 and 4), which are quantities more easily accessible.
The expression of entropy and negentropy in the scalar case will be first briefly derived.

3.2. Simpler criteria

The justification of these criteria is argued and is not subject to the validity of the Edgeworth expansion; in other words, it is not absolutely necessary that they approximate the mutual information.
These contrast functions are generally less discriminating than (3.1) (i.e. discriminating over a smaller subset of random vectors).
Another consequence of this invariance is that if ~/'(y) = ~(x), where x has uncorrelated components at orders involved in the definition of ~b, then y also has independent components in the same sense.
The interest in maximizing the contrast in Theorem 16 rather than minimizing the crosscumulants lies essentially in the fact that only N cumulants are involved instead of O(N').

3.3. Link with blind identification and deconvolution

On the other hand, identification techniques presented in [35, 45, 46, 57, 63] solve a system of equations obtained by equation error matching.
Moreover, their robustness is questioned in the presence of measurement noise, especially non-Gaussian, as well as for short data records.
The authors equation (3.9) gives a possible justification to the process of selecting particular cumulants, by showing their dominance over the others.

4.1. Pairwise processing

The proof of Theorem 11 did not involve the contrast expression, but was valid only in the case where the observation y was effectively stemming linearly from a random variable x with independent components (model (1.1) in the noiseless case).
The same statement holds true for the condition d2O < 0.
The theorem says that pairwise independence is sufficient, provided a contrast of polynomial form is utilized.
T}, the proposed algorithm processes each pair in turn (similarly to the Jacobi algorithm in the diagonalization of symmetric real matrices);.
The maximization of (4.2) with respect to variable leads to a polynomial rooting which does not raise any difficulty, since it is of low degree (there even exists an analytical solution since the degree is strictly smaller than 5):.

4.2. Convergence

It is easy to show [15, ] that Algorithm 18 is such that the global contrast function ~O(Q) monotonically increases as more and more iterations are run.
Since this real positive sequence is bounded above, it must converge to a maximum.
In fact, when the number of sweeps, k, reaches this value, it has been observed that the angles of all plane rotations within the last sweep were very small and that the contrast function had reached its stationary value.
In [52,] , the deconvolution algorithm proposed has been shown to suffer from spurious maxima in the noisy case.

4.3. Computational complexity

When evaluating the complexity of Algorithm 18, the authors shall count the number of floating point operations .
The fastest version requires O(N 5) flops, if all cumulants of the observations are available and if N sources are present with a high signal to noise ratio.
On the other hand, there are O(N4/24) different cumulants in this 'supersymmetric' tensor.
Lastly, one could think of using the multilinearity relation (3.8) to calculate the five cumulants required in step 4(a).

5.2. Behaviour in the presence of non-Gaussian noise when N = 2

Consider the observations in dimension N for realization t: (the largest singular value).
One may assume the following as a definition of the signal to noise ratio (see Fig.
The exact value of p for which the ICA remains interesting depends on the integration length T.
In fact, the shorter the T, the larger the apparent noise, since estimation errors are also seen by the algorithm as extraneous noise.
For larger values of/~, the ICA algorithm considers the vector Mx as a noise and the vector I~flw as a source vector.

5.3. Behaviour in the noiseless case when N > 2

For this purpose, consider now t0 independent sources.
To see the influence of this ordering without modifying the code, the authors have changed the order of the source kurtosis.
Observe that the gap .e is not necessarily monotonically decreasing, whereas the contrast always is, by construction of the algo- rithm.
In the latter algorithm, it is well known that the convergence can be speeded up by processing the pair for which non-diagonal terms are the largest.
Even if this strategy could also be applied to the diagonalization of the tensor cumulant, the computation of all pairwise cross-cumulants at each iteration is necessary.

5.4. Behaviour in the presence of noise when N > 2

The components of x and w are again independent random variables identically and uniformly distributed in [-x/3, xf3], as in Section 5.2.
The continuous bottom solid curve in Fig. 7 shows the performances that would be obtained for infinite integration length.
Other simulation results related to ICA are also reported in [15] .
Simulation results demonstrate convergence and robustness of the algorithm, even in the presence of non-Gaussian noise (up to 0 dB for a noise having the same statistics as the sources), and with limited integration lengths.
It has not been proved (but also never been observed to happen) that the algorithm proposed cannot be stuck at some local maximum.

6. Conclusions

The definition of ICA given within the flamework of this paper depends on a contrast function that serves as a rotation selection criterion.
One of the contrasts proposed is built from the mutual information of standardized observations.
For practical purposes this contrast is approximated by the Edgeworth expansion of the mutual information, and consists of a combination of third-and fourth-order marginal cumulants.
Denote by an asterisk transposition and complex conjugation.
Thus, denote by A z this covariance matrix, where A is diagonal and positive real (since C is full column rank, A has exactly N-p null entries).

Did you find this useful? Give us your feedback

Independent component analysis, a new concept?

Summary (5 min read)

1.1. Problem description

1.2. Organization of the paper

1.4. Applications

1.5. Preliminary observations

PROPERTY2.

2.1. Mutual information

2.2. Standardization

2.3. Negentropy, a measure of distance to normality

2.4. Measures of statistical dependence

THEOREM 7. The following mapping is a contrast over ~_~: T(p~) = -I(p~).

3.1. Approximation of the mutual information

3.2. Simpler criteria

3.3. Link with blind identification and deconvolution

4.1. Pairwise processing

4.2. Convergence

4.3. Computational complexity

5.2. Behaviour in the presence of non-Gaussian noise when N = 2

5.3. Behaviour in the noiseless case when N > 2

5.4. Behaviour in the presence of noise when N > 2

6. Conclusions

Citations

Cites background from "Independent component analysis, a n..."

Cites background from "Independent component analysis, a n..."

Cites background or methods from "Independent component analysis, a n..."

References

"Independent component analysis, a n..." refers background or result in this paper

"Independent component analysis, a n..." refers background in this paper

Related Papers (5)