Learning the parts of objects by non-negative matrix factorization

doi:10.1038/44565

Home
/
Papers
/
Learning the parts of objects by non-negative matrix factorization

Journal Article•DOI•

Learning the parts of objects by non-negative matrix factorization

Daniel D. Lee¹, H. Sebastian Seung², H. Sebastian Seung¹•Institutions (2)

Alcatel-Lucent¹, Massachusetts Institute of Technology²

21 Oct 1999-Nature (Nature Publishing Group)-Vol. 401, Iss: 6755, pp 788-791

TL;DR: An algorithm for non-negative matrix factorization is demonstrated that is able to learn parts of faces and semantic features of text and is in contrast to other methods that learn holistic, not parts-based, representations.

read less

Abstract: Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations. Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Nonlinear dimensionality reduction by locally linear embedding.

[...]

Sam T. Roweis¹, Lawrence K. Saul²•Institutions (2)

University College London¹, AT&T Labs²

22 Dec 2000-Science

TL;DR: Locally linear embedding (LLE) is introduced, an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs that learns the global structure of nonlinear manifolds.

...read moreread less

Abstract: Many areas of science depend on exploratory data analysis and visualization. The need to analyze large amounts of multivariate data raises the fundamental problem of dimensionality reduction: how to discover compact representations of high-dimensional data. Here, we introduce locally linear embedding (LLE), an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs. Unlike clustering methods for local dimensionality reduction, LLE maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations do not involve local minima. By exploiting the local symmetries of linear reconstructions, LLE is able to learn the global structure of nonlinear manifolds, such as those generated by images of faces or documents of text.

...read moreread less

15,106 citations

Journal Article•DOI•

Tensor Decompositions and Applications

[...]

Tamara G. Kolda¹, Brett W. Bader¹•Institutions (1)

Sandia National Laboratories¹

01 Aug 2009-Siam Review

TL;DR: This survey provides an overview of higher-order tensor decompositions, their applications, and available software.

...read moreread less

Abstract: This survey provides an overview of higher-order tensor decompositions, their applications, and available software. A tensor is a multidimensional or $N$-way array. Decompositions of higher-order tensors (i.e., $N$-way arrays with $N \geq 3$) have applications in psycho-metrics, chemometrics, signal processing, numerical linear algebra, computer vision, numerical analysis, data mining, neuroscience, graph analysis, and elsewhere. Two particular tensor decompositions can be considered to be higher-order extensions of the matrix singular value decomposition: CANDECOMP/PARAFAC (CP) decomposes a tensor as a sum of rank-one tensors, and the Tucker decomposition is a higher-order form of principal component analysis. There are many other tensor decompositions, including INDSCAL, PARAFAC2, CANDELINC, DEDICOM, and PARATUCK2 as well as nonnegative variants of all of the above. The N-way Toolbox, Tensor Toolbox, and Multilinear Engine are examples of software packages for working with tensors.

...read moreread less

9,227 citations

Cites background or methods from "Learning the parts of objects by no..."

...They replaced the least-squares updates with the multiplicative update introduced in [151]....
[...]
...Welling and Weber [238] perform multiplicative updates like Lee and Seung [151] for NNCP....
[...]
...Paatero and Tapper [181] and Lee and Seung [151] proposed using nonnegative matrix factorizations for analyzing nonnegative data, such as environmental models and grayscale images, because it is desirable for the decompositions to retain the nonnegative characteristics of the original data and thereby facilitate easier interpretation....
[...]

Proceedings Article•

Algorithms for Non-negative Matrix Factorization

[...]

Daniel D. Lee¹, H. Sebastian Seung²•Institutions (2)

Bell Labs¹, Massachusetts Institute of Technology²

01 Jan 2000

TL;DR: Two different multiplicative algorithms for non-negative matrix factorization are analyzed and one algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence.

...read moreread less

Abstract: Non-negative matrix factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. Two different multiplicative algorithms for NMF are analyzed. They differ only slightly in the multiplicative factor used in the update rules. One algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence. The monotonic convergence of both algorithms can be proven using an auxiliary function analogous to that used for proving convergence of the Expectation-Maximization algorithm. The algorithms can also be interpreted as diagonally rescaled gradient descent, where the rescaling factor is optimally chosen to ensure convergence.

...read moreread less

7,345 citations

Cites background from "Learning the parts of objects by no..."

...We have previously shown that nonnegativity is a useful constraint for matrix factorization that can learn a parts representation of the data [4, 5]....
[...]

Journal Article•DOI•

Image Super-Resolution Via Sparse Representation

[...]

Jianchao Yang¹, John Wright², Thomas S. Huang¹, Yi Ma²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Microsoft²

01 Nov 2010-IEEE Transactions on Image Processing

TL;DR: This paper presents a new approach to single-image superresolution, based upon sparse signal representation, which generates high-resolution images that are competitive or even superior in quality to images produced by other similar SR methods.

...read moreread less

Abstract: This paper presents a new approach to single-image superresolution, based upon sparse signal representation. Research on image statistics suggests that image patches can be well-represented as a sparse linear combination of elements from an appropriately chosen over-complete dictionary. Inspired by this observation, we seek a sparse representation for each patch of the low-resolution input, and then use the coefficients of this representation to generate the high-resolution output. Theoretical results from compressed sensing suggest that under mild conditions, the sparse representation can be correctly recovered from the downsampled signals. By jointly training two dictionaries for the low- and high-resolution image patches, we can enforce the similarity of sparse representations between the low-resolution and high-resolution image patch pair with respect to their own dictionaries. Therefore, the sparse representation of a low-resolution image patch can be applied with the high-resolution image patch dictionary to generate a high-resolution image patch. The learned dictionary pair is a more compact representation of the patch pairs, compared to previous approaches, which simply sample a large amount of image patch pairs , reducing the computational cost substantially. The effectiveness of such a sparsity prior is demonstrated for both general image super-resolution (SR) and the special case of face hallucination. In both cases, our algorithm generates high-resolution images that are competitive or even superior in quality to images produced by other similar SR methods. In addition, the local sparse modeling of our approach is naturally robust to noise, and therefore the proposed algorithm can handle SR with noisy inputs in a more unified framework.

...read moreread less

4,958 citations

Cites background from "Learning the parts of objects by no..."

...To be precise, the solution is also approached in two steps: 1) global model: use reconstruction constraint to recover a medium high-resolution face image, but the solution is searched only in the face subspace; and 2) local model: use the local sparse model to recover the image details. a) Nonnegative Matrix Factorization (NMF): In face SR, the most frequently used subspace method for modeling the human face is principal component analysis (PCA), which chooses a low-dimensional subspace that captures as much of the variance as possible....
[...]
...Nonnegati ve Matrix Factorization (NMF) [31] seeks a representation of t he given signals as an additive combination of local features....
[...]
...Column 4 shows the intermediate results from the NMF global modeling and column 5 demonstrates the results after local sparse modeling....
[...]
...NMF [29] seeks a representation of the given signals as an additive combination of local features....
[...]
...To find such a part-based subspace, NMF is formulated as the following optimization problem: (12) where is the data matrix, is the basis matrix and is the coefficient matrix....
[...]

Proceedings Article•DOI•

PCA-SIFT: a more distinctive representation for local image descriptors

[...]

Yan Ke¹, Rahul Sukthankar¹•Institutions (1)

Carnegie Mellon University¹

27 Jun 2004

TL;DR: This paper examines (and improves upon) the local image descriptor used by SIFT, and demonstrates that the PCA-based local descriptors are more distinctive, more robust to image deformations, and more compact than the standard SIFT representation.

...read moreread less

Abstract: Stable local feature detection and representation is a fundamental component of many image registration and object recognition algorithms. Mikolajczyk and Schmid (June 2003) recently evaluated a variety of approaches and identified the SIFT [D. G. Lowe, 1999] algorithm as being the most resistant to common image deformations. This paper examines (and improves upon) the local image descriptor used by SIFT. Like SIFT, our descriptors encode the salient aspects of the image gradient in the feature point's neighborhood; however, instead of using SIFT's smoothed weighted histograms, we apply principal components analysis (PCA) to the normalized gradient patch. Our experiments demonstrate that the PCA-based local descriptors are more distinctive, more robust to image deformations, and more compact than the standard SIFT representation. We also present results showing that using these descriptors in an image retrieval application results in increased accuracy and faster matching.

...read moreread less

3,325 citations

Cites background from "Learning the parts of objects by no..."

...While PCA suffers from a number of shortcomings [8, 10], such as its implicit assumption of Gaussian distributions and its restriction to orthogonal linear combinations, it remains popular due to its simplicity....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Maximum likelihood from incomplete data via the EM algorithm

[...]

Arthur P. Dempster¹, Nan M. Laird¹, Donald B. Rubin¹•Institutions (1)

Harvard University¹

01 Sep 1977-Journal of the royal statistical society series b-methodological

49,597 citations

Journal Article•DOI•

Eigenfaces for recognition

[...]

Matthew Turk¹, Alex Pentland¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 1991-Journal of Cognitive Neuroscience

TL;DR: A near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals, and that is easy to implement using a neural network architecture.

...read moreread less

Abstract: We have developed a near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals. The computational approach taken in this system is motivated by both physiology and information theory, as well as by the practical requirements of near-real-time performance and accuracy. Our approach treats the face recognition problem as an intrinsically two-dimensional (2-D) recognition problem rather than requiring recovery of three-dimensional geometry, taking advantage of the fact that faces are normally upright and thus may be described by a small set of 2-D characteristic views. The system functions by projecting face images onto a feature space that spans the significant variations among known face images. The significant features are known as "eigenfaces," because they are the eigenvectors (principal components) of the set of faces; they do not necessarily correspond to features such as eyes, ears, and noses. The projection operation characterizes an individual face by a weighted sum of the eigenface features, and so to recognize a particular face it is necessary only to compare these weights to those of known individuals. Some particular advantages of our approach are that it provides for the ability to learn and later recognize new faces in an unsupervised manner, and that it is easy to implement using a neural network architecture.

...read moreread less

14,562 citations

Book•

Introduction to Modern Information Retrieval

[...]

Gerard Salton, Michael J. McGill

01 Jan 1983

TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.

...read moreread less

Abstract: Some people may be laughing when looking at you reading in your spare time. Some may be admired of you. And some may want be like you who have reading hobby. What about your own feel? Have you felt right? Reading is a need and a hobby at once. This condition is the on that will make you feel that you must read. If you know are looking for the book enPDFd introduction to modern information retrieval as the choice of reading, you can find here.

...read moreread less

12,059 citations

Journal Article•DOI•

An information-maximization approach to blind separation and blind deconvolution

[...]

Anthony J. Bell¹, Terrence J. Sejnowski¹•Institutions (1)

University of California, San Diego¹

01 Nov 1995-Neural Computation

TL;DR: It is suggested that information maximization provides a unifying framework for problems in "blind" signal processing and dependencies of information transfer on time delays are derived.

...read moreread less

Abstract: We derive a new self-organizing learning algorithm that maximizes the information transferred in a network of nonlinear units. The algorithm does not assume any knowledge of the input distributions, and is defined here for the zero-noise limit. Under these conditions, information maximization has extra properties not found in the linear case (Linsker 1989). The nonlinearities in the transfer function are able to pick up higher-order moments of the input distributions and perform something akin to true redundancy reduction between units in the output representation. This enables the network to separate statistically independent components in the inputs: a higher-order generalization of principal components analysis. We apply the network to the source separation (or cocktail party) problem, successfully separating unknown mixtures of up to 10 speakers. We also show that a variant on the network architecture is able to perform blind deconvolution (cancellation of unknown echoes and reverberation in a speech signal). Finally, we derive dependencies of information transfer on time delays. We suggest that information maximization provides a unifying framework for problems in "blind" signal processing.

...read moreread less

9,157 citations

Journal Article•DOI•

A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.

[...]

Thomas K. Landauer, Susan T. Dumais

01 Apr 1997-Psychological Review

TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.

...read moreread less

Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

...read moreread less

6,014 citations