scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Independent component analysis, a new concept?

01 Apr 1994-Signal Processing (Elsevier North-Holland, Inc.)-Vol. 36, Iss: 3, pp 287-314
TL;DR: An efficient algorithm is proposed, which allows the computation of the ICA of a data matrix within a polynomial time and may actually be seen as an extension of the principal component analysis (PCA).
About: This article is published in Signal Processing.The article was published on 1994-04-01 and is currently open access. It has received 8522 citations till now. The article focuses on the topics: Independent component analysis & FastICA.

Summary (5 min read)

1.1. Problem description

  • This paper attempts to provide a precise definition of ICA within an applicable mathematical framework.
  • It is envisaged that this definition will provide a baseline for further development and application of the ICA concept.
  • Assume the following linear statistical model: (1.1a) where x,y and v are random vectors with values in or C and with zero mean and finite covariance, M is a rectangular matrix with at most as many columns as rows, and vector x has statistically independent components.
  • The qualifiers 'blind' or 'myopic' are often used when only the outputs of the system considered are observed; in this framework, the authors are thus dealing with the problem of blind identification of a linear static system.

1.2. Organization of the paper

  • Related works, possible applications and preliminary observations regarding the problem statement are surveyed.
  • In Section 2, general results on statistical independence are stated, and are then utilized in Section 3 to derive optimization criteria.
  • The properties of these criteria are also investigated in Section 3.
  • Simulation results are then presented in Section 5.

1.4. Applications

  • The first column of M contains the successive samples of the impulse response of the corresponding causal filter.
  • In antenna array processing, ICA might be utilized in at least two instances: firstly, the estimation of radiating sources from unknown arrays (necessarily without localization).
  • Further, ICA can be utilized in the identification of multichannel ARMA processes when the input is not observed and, in particular, for estimating the first coefficient of the model [12, 15] .
  • On the other hand, ICA can be used as a data preprocessing tool before Bayesian detection and classification.

1.5. Preliminary observations

  • This latter indetermination cannot be reduced further without additional assumptions.
  • The asterisk denotes transposition, and complex conjugation whenever the quantity is complex (mainly the real case will be considered in this paper).
  • The purpose of the next section will be to define such contrast functions.
  • Since multiplying random variables by non-zero scalar factors or changing their order does not affect their statistical independence, ICA actually defines an equivalence class of decompositions rather than a single one, so that the property below holds.
  • The definition of contrast functions will thus have to take this indeterminacy into account.

PROPERTY2.

  • For the sake of clarity, let us also recall the definition of PCA below.
  • Note that two different pairs can be PCAs of the same random variable y, but are also related by Property 2.
  • Thus, PCA has exactly the same inherent indeterminations as ICA, so that the authors may assume the same additional arbitrary constraints [15] .
  • With constraint (c), the matrix F defined in the PCA decomposition (Definition 3) is unitary.

2.1. Mutual information

  • In statistics, the large class of f-divergences is of key importance among the possible distance measures available [4] .
  • In these measures the roles played by both densities are not always symmetric, so that the authors are not dealing with proper distances.
  • The Kullback divergence is defined as with equality if and only if pz(u)= p:(u) almost everywhere.

2.2. Standardization

  • The standardization procedure proposed here fulfils these two tasks, and may be viewed as a mere PCA.
  • Matrix L could be defined by any square-root decomposition, such as Cholesky factorization, or a decomposition based on the eigen value decomposition (EVD), for instance.
  • But their preference goes to the EVD because it allows one to handle easily the case of singular or ill-conditioned covariance matrices by projection.
  • Without restricting the generality the authors may thus consider in the remaining that the variable observed belongs to H:~; they additionally assume that N > 1, since otherwise the observation is scalar and the problem does not exist.

2.3. Negentropy, a measure of distance to normality

  • This relation gives a means of approximating the mutual information, provided the authors are able to approximate the negentropy about zero, which amounts to expanding the density PI in the neighbourhood of flax.
  • In the first step, a transform will cancel the last term of (2.14).
  • It can be shown that this is equivalent to standardizing the data . flax(U) = (2re) -N/2I VI-~/2exp{-u'V-au}/2. (2.10) Among the densities of ~:~ having a given covariance matrix V, the Gaussian density is the one which has the largest entropy.

2.4. Measures of statistical dependence

  • The authors have shown that both the Gaussian feature and the mutual independence can be characterized with the help of negentropy.
  • Yet, these remarks justify only in part the use of (2.14) as an optimization criterion in their problem.
  • In fact, from Property 2, this criterion should meet the requirements given below.

THEOREM 7. The following mapping is a contrast over ~_~: T(p~) = -I(p~).

  • The criterion proposed in Theorem 7 is consequently admissible for ICA computation.
  • This theoretical criterion, involving a generally unknown density, will be made usable by approximtions in Section 3.
  • Regarding computational loads, the calculation of ICA may still be very heavy even after approximations, and the authors now turn to a theorem that theoretically explains why the practical algorithm designed in Section 4, that proceeds pairwise, indeed works.
  • The following two lemmas are utilized in the proof of Theorem 10, and are reproduced below because of their interesting content.

THEOREM 10. Let x and z be two random vectors such that z = Bx, B being a given rectangular matrix. Suppose additionally that x has independent components and that z has pairwise independent corn-ponents. If B has two non-zero entries in the same column j, then xj is either Gaussian or deterministic.

  • Now the authors are in a position to state a theorem, from which two important corollaries can be deduced (see Appendices A.7-A.9 for proofs).
  • Then the following three properties are equivalent: (i) The components zi are pairwise independent.
  • (ii) The components zi are mutually independent.
  • This last corollary is actually stating identifiability conditions of the noiseless problem.
  • For processes with unit variance, the authors find indeed that M = FDP in the corollary above.

3.1. Approximation of the mutual information

  • In practice, the densities p~ and py are not known, so that criterion (3.1) cannot be directly utilized.
  • The aim of this section is to express the contrast in Theorem 7 as a function of the standardized cumulants (of orders 3 and 4), which are quantities more easily accessible.
  • The expression of entropy and negentropy in the scalar case will be first briefly derived.

3.2. Simpler criteria

  • The justification of these criteria is argued and is not subject to the validity of the Edgeworth expansion; in other words, it is not absolutely necessary that they approximate the mutual information.
  • These contrast functions are generally less discriminating than (3.1) (i.e. discriminating over a smaller subset of random vectors).
  • Another consequence of this invariance is that if ~/'(y) = ~(x), where x has uncorrelated components at orders involved in the definition of ~b, then y also has independent components in the same sense.
  • The interest in maximizing the contrast in Theorem 16 rather than minimizing the crosscumulants lies essentially in the fact that only N cumulants are involved instead of O(N').

4.1. Pairwise processing

  • The proof of Theorem 11 did not involve the contrast expression, but was valid only in the case where the observation y was effectively stemming linearly from a random variable x with independent components (model (1.1) in the noiseless case).
  • The same statement holds true for the condition d2O < 0.
  • The theorem says that pairwise independence is sufficient, provided a contrast of polynomial form is utilized.
  • T}, the proposed algorithm processes each pair in turn (similarly to the Jacobi algorithm in the diagonalization of symmetric real matrices);.
  • The maximization of (4.2) with respect to variable leads to a polynomial rooting which does not raise any difficulty, since it is of low degree (there even exists an analytical solution since the degree is strictly smaller than 5):.

4.2. Convergence

  • It is easy to show [15, ] that Algorithm 18 is such that the global contrast function ~O(Q) monotonically increases as more and more iterations are run.
  • Since this real positive sequence is bounded above, it must converge to a maximum.
  • In fact, when the number of sweeps, k, reaches this value, it has been observed that the angles of all plane rotations within the last sweep were very small and that the contrast function had reached its stationary value.
  • In [52,] , the deconvolution algorithm proposed has been shown to suffer from spurious maxima in the noisy case.

4.3. Computational complexity

  • When evaluating the complexity of Algorithm 18, the authors shall count the number of floating point operations .
  • The fastest version requires O(N 5) flops, if all cumulants of the observations are available and if N sources are present with a high signal to noise ratio.
  • On the other hand, there are O(N4/24) different cumulants in this 'supersymmetric' tensor.
  • Lastly, one could think of using the multilinearity relation (3.8) to calculate the five cumulants required in step 4(a).

5.2. Behaviour in the presence of non-Gaussian noise when N = 2

  • Consider the observations in dimension N for realization t: (the largest singular value).
  • One may assume the following as a definition of the signal to noise ratio (see Fig.
  • The exact value of p for which the ICA remains interesting depends on the integration length T.
  • In fact, the shorter the T, the larger the apparent noise, since estimation errors are also seen by the algorithm as extraneous noise.
  • For larger values of/~, the ICA algorithm considers the vector Mx as a noise and the vector I~flw as a source vector.

5.3. Behaviour in the noiseless case when N > 2

  • For this purpose, consider now t0 independent sources.
  • To see the influence of this ordering without modifying the code, the authors have changed the order of the source kurtosis.
  • Observe that the gap .e is not necessarily monotonically decreasing, whereas the contrast always is, by construction of the algo- rithm.
  • In the latter algorithm, it is well known that the convergence can be speeded up by processing the pair for which non-diagonal terms are the largest.
  • Even if this strategy could also be applied to the diagonalization of the tensor cumulant, the computation of all pairwise cross-cumulants at each iteration is necessary.

5.4. Behaviour in the presence of noise when N > 2

  • The components of x and w are again independent random variables identically and uniformly distributed in [-x/3, xf3], as in Section 5.2.
  • The continuous bottom solid curve in Fig. 7 shows the performances that would be obtained for infinite integration length.
  • Other simulation results related to ICA are also reported in [15] .
  • Simulation results demonstrate convergence and robustness of the algorithm, even in the presence of non-Gaussian noise (up to 0 dB for a noise having the same statistics as the sources), and with limited integration lengths.
  • It has not been proved (but also never been observed to happen) that the algorithm proposed cannot be stuck at some local maximum.

6. Conclusions

  • The definition of ICA given within the flamework of this paper depends on a contrast function that serves as a rotation selection criterion.
  • One of the contrasts proposed is built from the mutual information of standardized observations.
  • For practical purposes this contrast is approximated by the Edgeworth expansion of the mutual information, and consists of a combination of third-and fourth-order marginal cumulants.
  • Denote by an asterisk transposition and complex conjugation.
  • Thus, denote by A z this covariance matrix, where A is diagonal and positive real (since C is full column rank, A has exactly N-p null entries).

Did you find this useful? Give us your feedback

Citations
More filters
Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Journal ArticleDOI
TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

14,635 citations


Cites background from "Independent component analysis, a n..."

  • ...…signals (e.g., Andrade, Chacon, Merelo, & Moran, 1993; Bell & Sejnowski, 1995; Belouchrani, Abed-Meraim, Cardoso, & Moulines, 1997; Cardoso, 1994; Comon, 1994; Hyvärinen, Karhunen, & Oja, 2001; Jutten & Herault, 1991; Karhunen & Joutsensalo, 1995; Molgedey & Schuster, 1994; Schuster, 1992; Shan…...

    [...]

  • ...Many do this to uncover and disentangle hidden underlying sources of signals (e.g., Jutten and Herault, 1991; Schuster, 1992; Andrade et al., 1993; Molgedey and Schuster, 1994; Comon, 1994; Cardoso, 1994; Bell and Sejnowski, 1995; Karhunen and Joutsensalo, 1995; Belouchrani et al., 1997; Hyvärinen et al., 2001; Szabó et al., 2006; Shan et al., 2007; Shan and Cottrell, 2014)....

    [...]

Journal ArticleDOI
22 Dec 2000-Science
TL;DR: An approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set and efficiently computes a globally optimal solution, and is guaranteed to converge asymptotically to the true structure.
Abstract: Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. The human brain confronts the same problem in everyday perception, extracting from its high-dimensional sensory inputs-30,000 auditory nerve fibers or 10(6) optic nerve fibers-a manageably small number of perceptually relevant features. Here we describe an approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set. Unlike classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS), our approach is capable of discovering the nonlinear degrees of freedom that underlie complex natural observations, such as human handwriting or images of a face under different viewing conditions. In contrast to previous algorithms for nonlinear dimensionality reduction, ours efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.

13,652 citations

Journal ArticleDOI
TL;DR: Recent work in the area of unsupervised feature learning and deep learning is reviewed, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks.
Abstract: The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.

11,201 citations


Cites background from "Independent component analysis, a n..."

  • ...2012). Another rich family of feature extraction techniques, that this review does not cover in any detail due to space constraints is Independent Component Analysis or ICA (Jutten and Herault, 1991; Comon, 1994; Bell and Sejnowski, 1997). Instead, we refer the reader to Hyvarinen¨ et al. (2001a); Hyv¨arinen et al. (2009). Note that, while in the simplest case (complete, noise-free) ICA yields linear feature...

    [...]

Journal ArticleDOI
TL;DR: This work considers the problem of automatically recognizing human faces from frontal views with varying expression and illumination, as well as occlusion and disguise, and proposes a general classification algorithm for (image-based) object recognition based on a sparse representation computed by C1-minimization.
Abstract: We consider the problem of automatically recognizing human faces from frontal views with varying expression and illumination, as well as occlusion and disguise. We cast the recognition problem as one of classifying among multiple linear regression models and argue that new theory from sparse signal representation offers the key to addressing this problem. Based on a sparse representation computed by C1-minimization, we propose a general classification algorithm for (image-based) object recognition. This new framework provides new insights into two crucial issues in face recognition: feature extraction and robustness to occlusion. For feature extraction, we show that if sparsity in the recognition problem is properly harnessed, the choice of features is no longer critical. What is critical, however, is whether the number of features is sufficiently large and whether the sparse representation is correctly computed. Unconventional features such as downsampled images and random projections perform just as well as conventional features such as eigenfaces and Laplacianfaces, as long as the dimension of the feature space surpasses certain threshold, predicted by the theory of sparse representation. This framework can handle errors due to occlusion and corruption uniformly by exploiting the fact that these errors are often sparse with respect to the standard (pixel) basis. The theory of sparse representation helps predict how much occlusion the recognition algorithm can handle and how to choose the training images to maximize robustness to occlusion. We conduct extensive experiments on publicly available databases to verify the efficacy of the proposed algorithm and corroborate the above claims.

9,658 citations


Cites background or methods from "Independent component analysis, a n..."

  • ...This problem should be solved in polynomial time by standard linear programming methods [10]....

    [...]

  • ...The other algorithm is Independent Component Analysis (ICA) [10]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A method is described for the minimization of a function of n variables, which depends on the comparison of function values at the (n 41) vertices of a general simplex, followed by the replacement of the vertex with the highest value by another point.
Abstract: A method is described for the minimization of a function of n variables, which depends on the comparison of function values at the (n 41) vertices of a general simplex, followed by the replacement of the vertex with the highest value by another point. The simplex adapts itself to the local landscape, and contracts on to the final minimum. The method is shown to be effective and computationally compact. A procedure is given for the estimation of the Hessian matrix in the neighbourhood of the minimum, needed in statistical estimation problems.

27,271 citations

Book
01 May 1981
TL;DR: This book will be most useful to applied mathematicians, communication engineers, signal processors, statisticians, and time series researchers, both applied and theoretical.
Abstract: This book will be most useful to applied mathematicians, communication engineers, signal processors, statisticians, and time series researchers, both applied and theoretical. Readers should have some background in complex function theory and matrix algebra and should have successfully completed the equivalent of an upper division course in statistics.

3,231 citations

Journal ArticleDOI
TL;DR: A new concept, that of INdependent Components Analysis (INCA), more powerful than the classical Principal components Analysis (in decision tasks) emerges from this work.

2,583 citations


"Independent component analysis, a n..." refers background or result in this paper

  • ...The calculation of ICA was discussed in several recent papers [8, 16, 30, 36, 37 , 61], where the problem was given various names....

    [...]

  • ...Refer to [ 37 ] and other papers in the same issue, and to [27]....

    [...]

Journal ArticleDOI
01 Oct 1943-Nature
TL;DR: The Advanced Theory of Statistics by Maurice G. Kendall as discussed by the authors is a very handsomely produced volume which is one which it will be a pleasure to any mathematical statistician to possess.
Abstract: THIS very handsomely produced volume is one which it will be a pleasure to any mathematical statistician to possess. Mr. Kendall is indeed to be congratulated on the energy and, unswerving perseverance needed to complete his heavy task, and encouraged in the still unflagging energy which will be needed for the second volume. So far as he has carried, his work, he has certainly done something to sustain the credit of Great Britain in mathematical scholarship. The Advanced Theory of Statistics By Maurice G. Kendall. Vol. 1. Pp. xii + 457. (London: Charles Griffin and Co., Ltd., 1943.) 42s. net.

1,980 citations


"Independent component analysis, a n..." refers background in this paper

  • ...See [39, 40] for general remarks on pdf expansions....

    [...]

  • ...For each pair: (a) estimate the required cumulants of (Zi:, Z j:), by resorting to K-statistics for instance [39, 44], (b) find the angle a maximizing 6(Q"'J~), where Q(i,j) is the plane rotation of angle a, a ~ ] - n/4, re/4], in the plane defined by components { i, j}, (c) accumulate F : = FQ "'j)*, (e) update Z : = (2{~'J)Z....

    [...]