What have the authors stated for future works in "Nonnegative matrix factorization: a comprehensive review" ?

Here, the authors just list a few possibilities as follows: 1. Statistical underpinning. Although NMF can be interpreted as the maximum likelihood algorithm in different residual distribution, a solid statistical underpinning has not been developed adequately by now, which is an essential yet neglected, to some extent, issue. It is indeed hard ; anyhow, it will provide some worthy suggestions for approximate NMF. Points ( 1 ) and ( 2 ) can be viewed as two complementary directions from statistical and algebraic standpoints, respectively.

What is the way to solve the rank-one NMF problem?

Given that computing a globally optimal rank-oneapproximation can be done in polynomial time while thegeneral NMF problem is NP-hard, Gillis and Glineur [72]introduced Nonnegative Matrix Underapproximation(NMU) to solve the higher rank NMF problem in arecursive way.

What is the approach to determine the number of factor matrices?

In practice, the trial and error approach is often adopted, where L is set in advance and then adjusted according to the feedback of the factorization results, such as the approximation errors.

What is the crest of the previous work on Basic NMF?

The “ANLS using Projected Gradient (PG) methods” proposed by Lin [56] is the crest of the previous work on Basic NMF, which makes headway in the bound-constrained optimization.

What is the purpose of the constrained gradient distance minimization problem?

Given that the norm of the gradient of a mapping H from the lowdimensional manifold to the original high-dimensional space provides the measure of how far apart H maps nearby points, a constrained gradient distance minimization problem is formulated, whose goal is to find the map that best preserves local topology.

Why is it important to use the local rather than global minimization characteristic?

Because of the local rather than global minimization characteristic, it is obvious that the initialization of U and V will directly influence the convergence rate and the solution quality.

What is the main consideration to reduce the computational consumption of the basic NMF algorithms?

Another consideration to decrease the computational consumption is the parallel implementation of the existing Basic NMF algorithms, which tries to divide and distribute the factorization task block-wisely among several CPUs or GPUs [74].

How did Berry et al. solve the LS subproblem?

Berry et al. [10] recommended ALS NMF algorithm by computing the solutions to the subproblems as unconstrained LS problems with multiple right-hand sides and maintaining nonnegativity via setting negative values to zero per iteration.

How did Cai and his colleagues model the manifold structure?

Graph regularized NMF (GRNMF) proposed by Cai et al. [99], [98] modeled the manifold structure by constructing a nearest neighborhood graph on a scatter of data points.

What is the way to mitigate the problem of local minima?

To mitigate the problem of local minima, Cichocki andZdunek [85], [60] recommended a simple yet effectiveapproach named multilayer NMF by replacing the basismatrix U with a set of cascaded factor matrices.

What is the possible direction for the optimization problem?

Given that while the optimization problem is not jointly convex in both U and V , it is separately convex in either Uor V , the alternating minimizations are seemly the feasible direction.

What are the penalty terms for imposing certain application dependent constraints?

The various Constrained NMF models can be unified under the similar extended objective functionDC X UVkð Þ ¼ D X UVkð Þ þ J1ðUÞ þ J2ðV Þ; ð13Þwhere J1ðUÞ and J2ðV Þ are the penalty terms to enforce certain application dependent constraints, and are small regularization parameters balancing the tradeoff between the fitting goodness and the constraints.

(Open Access) Nonnegative Matrix Factorization: A Comprehensive Review (2013) | Yu-Xiong Wang

Nonnegative Matrix Factorization:

A Comprehensive Review

Yu-Xiong Wang, Student Member, IEEE, and Yu-Jin Zhang, Senior Member, IE EE

Abstract—Nonnegative Matrix Factorization (NMF), a relatively novel paradigm for dimensionality reduction, has been in the

ascendant since its inception. It incorporates the nonnegativity constraint and thus obtains the parts-based representation as well as

enhancing the interpretability of the issue correspondingly. This survey paper mainly focuses on the theoretical research into NMF over

the last 5 years, where the principles, basic models, properties, and algorithms of NMF along with its various modifications, extensions,

and generalizations are summarized systematically. The existing NMF algorithms are divided into four categories: Basic NMF (BNMF),

Constrained NMF (CNMF), Structured NMF (SNMF), and Generalized NMF (GNMF), upon which the design principles,

characteristics, problems, relationships, and evolution of these algorithms are presented and analyzed comprehensively. Some related

work not on NMF that NMF should learn from or has connections with is involved too. Moreover, some open issues remained to be

solved are discussed. Several relevant application areas of NMF are also briefly described. This survey aims to construct an

integrated, state-of-the-art framework for NMF concept, from which the follow-up research may benefit.

Index Terms—Data mining, dimensionality reduction, multivariate data analysis, nonnegative matrix factorization (NMF)

1INTRODUCTION

NE of the basic concepts deeply rooted in science and

engineering is that there must be something simple,

compact, and elegant playing the fundamental roles under

the apparent chaos and complexity. This is also the case in

signal processing, data analysis, data mining, pattern

recognition, and machine learning. With the increasing

quantities of available raw data due to the development in

sensor and computer technology, how to obtain such an

effective way of representation by appropriate dimension-

ality reduction technique has become important, necessary,

and challenging in multivariate data analysis. Generally

speaking, two basic properties are supposed to be satisfied:

first, the dimension of the original data should be reduced;

second, the principal components, hidden concepts, promi-

nent features, or latent variables of the data, depending on

the application context, should be identified efficaciously.

In many cases, the primitive data sets or observations are

organized as data matrices (or tensors), and described by

linear (or multilinear) combination models; whereupon the

formulation of dimensionality reduction can be regarded as,

from the algebraic perspective, decomposing the original

data matrix into two factor matrices. The canonical methods,

such as Principal Component Analysis (PCA), Linear

Discriminant Analysis (LDA), Ind ependent Component

Analysis (ICA), Vector Quantization (VQ), etc., are the

exemplars of such low-rank approximations. They differ

from one another in the statistical properties attributable to

the different constraints imposed on the component

matrices and their underlying structures; however, they

have something in common that there is no constraint in the

sign of the elements in the factorized matrices. In other

words, the negative component or the subtractive combina-

tion is allowed in the representation. By contrast, a new

paradigm of factorization—Nonnegative Matrix Factoriza-

tion (NMF), which incorporates the nonnegativity constraint

and thus obtains the parts-based representation as well as

enhancing the interpretability of the issue correspondingly,

was initiated by Paatero and Tapper [1], [2] together with

Lee and Seung [3], [4].

As a matter of fact, the notion of NMF has a long history

under the name “self modeling curve resoluti on” in

chemometrics, where the vectors are continuous curves

rather than discrete vectors [5]. NMF was first introduced

by Paatero and Tapper as the concept of Positive Matrix

Factorization, which concentrated on a specific application

with Byzantine algorithms. These shortcomings limit both

the theoretical analysis, such as the convergence of the

algorithms or the properties of the solutions, and the

generalization of the algorithms in other applications.

Fortunately, NMF was popularized by Lee and Seung due

to their contributing work of a simple yet effective

algorithmic procedure, and more importantly the emphasis

on its potential value of parts-based representation.

Far beyond a mathematical exploration, the philosophy

underlying NMF, which tries to formulate a feasible model

for learning object parts, is closely relevant to perception

mechanism. While the parts-based representation seems

intuitive, it is indeed on the basis of physiological and

psychological evidence: perception of the whole is based on

perception of its parts [6], one of the core concepts in certain

computational theories of recognition problems. In fact there

are two complementary connotations in nonnegativity—

nonnegative component and purely additive combination.

On the one hand, the negative values of both observations

1336 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 6, JUNE 2013

. The authors are with the Tsinghua National Laboratory for Information

Science and Technology and the Department of Electronic Engineering,

Tsinghua University, Rohm Building, Beijing 100084, China.

E-mail: albertwyx@gmail.com, zhang-yj@tsinghua.edu.cn.

Manuscript received 21 July 2011; revised 2 Nov. 2011; accepted 4 Feb. 2012;

published online 2 Mar. 2012.

Recommended for acceptance by L. Chen.

For information on obtaining reprints of this article, please send e-mail to:

tkde@computer.org, and reference IEEECS Log Number TKDE-2011-07-0429.

Digital Object Identifier no. 10.1109/TKDE.2012.51.

1041-4347/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

and latent components are physically meaningless in many

kinds of real-world data, such as image, spectra, and gene

data, analysis tasks. Meanwhile, the discovered prototypes

commonly correspond with certain semantic interpretation.

For instance, in face recognition, the learned basis images are

localized rather than holistic, resembling parts of faces, such

as eyes, nose, mouth, and cheeks [3]. On the other hand,

objects of interest are most naturally characterized by the

inventory of its parts, and the exclusively additive combina-

tion means that they can be reassembled by adding required

parts together similar to identikits. NMF thereupon has

achieved great success in real-word scenarios and tasks. In

document clustering, NMF surpasses the classic methods,

such as spectral clustering, not only in accuracy improve-

ment but also in latent semantic topic identification [7].

To boot, the nonnegativity constraint will lead to sort of

sparseness naturally [3], which is proved to be a highly

effective representation distinguished from both the com-

pletely distributed and the solely active component de-

scription [8]. When NMF is interpreted as a neural network

learning algorithm depicting how the visible variables are

generated from the hidden ones, the parts-based represen-

tation is obtained from the additive model. A positive

number indicates the presence and a zero value represents

the absence of some event or component. This conforms

nicely to the dualistic properties of neural activity and

synaptic strengths in neurophysiology: either excitatory or

inhibitory without changing sign [3].

Because of the enhanced semantic interpretability under

the nonnegativity and the ensuing sparsity, NMF has

become an imperative tool in multivariate data analysis,

and been widely used in the fields of m athematics,

optimization, neural computing, pattern recognition and

machine learning [9], data mining [10], signal processing

[11], image engineering and computer vision [11], spectral

data analysis [12], bioinformatics [13], chemometrics [1],

geophysics [14], finance and economics [15]. More specifi-

cally, such applications include text data mining [16], digital

watermark, image denoising [17], image restoration, image

segmentation [18], image fusion, image classification [19],

image retrieval, face hallucination, face recognition [20],

facial expression recognition [21], audio pattern separation

[22], music genre classification [23], speech recognition,

microarray analysis, blind source separation [24], spectro-

scopy [25], gene expression classification [26], cell analysis,

EEG signal processing [17], pathologic diagnosis, email

surveillance [10], online discussion participation prediction,

network security, automatic personalized summarization,

identification of compounds in atmosphere analysis [14],

earthquake prediction, stock market pricing [15], and so on.

There have been numerous results devoted to NMF

research since its inception. Researchers from various fields,

mathematicians, statisticians, computer scientists, biolo-

gists, and neuroscientists, have explored the NMF concept

from diverse perspectives. So a systematic survey is of

necessity and consequence. Although there have been such

survey papers as [27], [28], [12], [13], [10], [11], [29] and

one book [9], they fail to reflect either the updated or the

comprehensive results. This review paper will summarize

the principles, basic models, properties, and algorithms of

NMF systematically over the last 5 years, including its

various modifications, extensions, and generalizations. A

taxonomy is accordingly proposed to logically group them,

which have not been presented before. Besides these, some

related work not on NMF that NMF should learn from or

has connections with will also be involved. Furthermore,

this survey mainly focuses on the theoretical research rather

than the specific applications, the practical usage will also

be concerned though. It aims to construct an integrated,

state-of-the-art framework for NMF concept, from which

the follow-up research may benefit.

In conclusion, the theory of NMF has advanced sig-

nificantly by now yet is still a work in progress. To be

specific: 1) the properties of NMF itself have been explored

more deeply; whereas a firm statistical underpinning like

those of the traditional factorization methods—PCA or

LDA—is not developed fully (partly due to its knottiness).

2) Some problems like the ones mentioned in [29] have been

solved, especially those with additional constraints; never-

theless a lot of other questions are still left open.

The existing NMF algorithms are divided into four

categories here given in Fig. 1, following some unified

criteria:

1. Basic NMF (BNMF), which only imposes the non-

negativity constraint.

2. Constrained NMF (CNMF), which imposes some

additional constraints as regularization.

3. Structured NMF (SNMF), which modifies the stan-

dard factorization formulations.

4. Generalized NMF (GNMF), which breaks through

the conventional data types or factorization modes

in a broad sense.

The model level from B asic to Generalized NMF

becomes broader. Therein Basic NMF formulates the

fundamental analytical framework upon which all other

NMF models are built. We will present the optimization

tools and computational methods to efficiently and robustly

solve Basic NMF. Moreover, the pragmatic issue of NMF

with respect to large-scale data sets and online processing

will also be discussed.

Constrained NMF is categorized into four subclasses:

1. Sparse NMF (SPNMF), which imposes the sparse-

ness constraint.

2. Orthogonal NMF (ONMF), which imposes the

orthogonality constraint.

3. Discriminant NMF (DNMF), which involves the

information for classification and discrimination.

4. NMF on manifold (MNMF), which preserves the

local topological properties.

We will demonstrate why these morphological constraints

are essentially necessary and how to incorporate them into

the existing solution framework of Basic NMF.

Correspondingly, Structured NMF is categorized into

three subclasses:

1. Weighed NMF (WNMF), which attaches different

weights to different elements regarding their relative

importance.

WANG AND ZHANG: NONNEGATIVE MATRIX FACTORIZATION: A COMPREHENSIVE REVIEW 1337

2. Convolutive NMF (CVNMF), which considers the

time-frequency domain factorization.

3. Nonnegative Matrix Trifactorization (NMTF), which

decomposes the data matrix into three factor

matrices.

Besides, Generalized NMF is categorized into four

subclasses:

1. Semi-NMF, which relaxes the nonnegativity con-

straint only on the specific factor matrix.

2. Nonnegative Tensor Factorization ( NTF), whi ch

generalizes the matrix-form data to higher dimen-

sional tensors.

3. Nonnegative Matrix-Set Factorization (NMSF),

which extends the data sets from matrices to

matrix-sets.

4. Kernel NMF (KNMF), which is the nonlinear model

of NMF.

The remainder of this paper is organized as follows: first,

the mathematic formulation of NMF model is presented,

and the unearthed properties of NMF are summarized.

Then the algorithmic details of foregoing categories of NMF

are elaborated. Finally, conclusions are drawn, and some

open issues remained to be solved are discussed.

2CONCEPT AND PROPERTIES OF NMF

Definition. Given an M dimensional random vector xx with

nonnegative elements, whose N observations are denoted as

j;j¼1;2;...;N

, let data matrix be XX ¼½xx

;xx

; ...;xx

2IR

MN

0

NMF seeks to decompose XX into nonnegative M  L basis

matrix UU ¼½uu

;uu

; ...;uu

2IR

ML

0

and nonnegative L  N

coefficient matrix VV ¼½vv

;vv

; ...;vv

2IR

LN

0

,suchthat

XX  UUV ,whereIR

MN

0

stands for the set of M  N

element-wise nonnegative matrices. This can also be written

as the equivalent vector formula xx



i¼1

It is obvious that vv

is the weight coefficient of the

observation xx

on the columns of UU, the basis vectors or the

latent feature vectors of XX. Hence, NMF decomposes each

data into the linear combination of the basis vectors.

Because of the initial condition L  minðM;NÞ,the

obtained basis vectors are incomplete over the original

vector space. In other words, this approach tries to

represent the high-dimensional stochastic pattern with far

fewer bases, so the perfect approximation can be achieved

successfully only if the intrinsic features are identified in UU.

Here, we discuss the relationship between L and M, N a

little more. In most cases, NMF is viewed as a dimension-

ality reduction and feature extraction technique with L 

M; L  N; that is, the basis set learned from NMF model is

incomplete, and the energy is compacted. However, in

general, L can be smaller, equal or larger than M. But there

are fundamental differences in the decomposition for L<

M and L>M. It is a sort of sparse coding and compressed

sensing with overcomplete basis when L>M. Hence, L

need not be limited by the dimensionality of the data,

which is useful for some applications, like classification. In

this situation, it may benefit from the sparseness due to

both nonnegativity and redundant representation. One

approach to obtain this NMF model is to perform the

decomposition on the residue matrix EE ¼ XX  UUV repeat-

edly and sequentially [30].

As a kind of matrix factorization model, three essential

questions need answering: 1) existence, whether the

nontrivial NMF solutions exist; 2) uniqueness, under what

assumptions NMF is, at least in some sense, unique;

3) effectiveness, under what assumptions NMF is able to

recover the “right answer.” The existence was showed via

the theory of Completely Positive (CP) Factorization for the

first time in [31]. The last two concerns were first mentioned

and discussed from a geometric viewpoint in [32].

Complete NMF XX ¼ UUV is considered first for the

analysis of existence, convexity, and computational com-

plexity. The trivial solution always exists as UU ¼ XX and

VV ¼ II

. By relating NMF to CP Factorization, Vasiloglou

et al. showed that every nonnegative matrix has a

nontrivial complete NMF [31]. As such, CP Factorization

is a special case, where a nonnegative matrix XX 2 IR

MM

0

CP if it can be factored in the form XX ¼ UUUU

;UU 2 IR

ML

0

The minimum L is called the CP-rank of XX. When

1338 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 6, JUNE 2013

Fig. 1. The categorization of NMF models and algorithms.

combining that the set of CP matrices forms a convex cone

with that the solution to NMF belongs to a CP cone,

solving NMF is a convex optimization problem [31].

Nevertheless, finding a practical description of the CP

cone is still open, and it remains hard to formulate NMF

as a convex optimizat ion problem, despit e a convex

relaxation to rank reduction with theoretical merit pro-

posed in [31].

Using the bilinear model, c omplete NMF can be

rewritten as linear combination of rank-one nonnegative

matrices expressed by

XX ¼

i¼1

i

i

i¼1

i

 VV

i

ðÞ

; ð1Þ

where UU

i

is the ith column vector of UU while VV

i

is the ith

row vector of VV , and  denotes the outer product of two

vectors. The smallest L making the decomposition possible

is called the nonnegative rank of the nonnegative matrix XX,

denoted as rank

ðXXÞ. And it satisfies the following trivial

bounds [33]

rankðXXÞrank

ðXXÞminðM; NÞ: ð2Þ

While PCA can be solved in polynomial time, t he

optimization problem of NMF, with respect to determining

the nonnegative rank and computing the associated factor-

ization, is more difficult than its unconstrained counterpart.

It is in fact NP-hard when requiring both the dimension and

the factorization rank of XX to increase, which was proved

via relating it to NP-hard intermediate simplex problem by

Vavasis [34]. This is also the corollary of CP programming,

since the CP cone cannot be described in polynomial time

despite its convexity. In the special case when rankðXXÞ¼1,

complete NMF can be solved in polynomial time. However,

the complexity of complete NMF for fixed factorization

rank generally is still unknown [35].

Another related work is so-called Nonnegative Rank

Factorization (NRF) focusing on the situation of rankðXXÞ¼

rank

ðXXÞ, i.e., selecting rankðXXÞ as the minimum L [33].

This is not always possible, and only nonnegative matrix

with a corresponding simplicial cone (A polyhedral cone is

simplicial if its vertex rays are line arly independent.)

existed has an NRF [36].

In most cases, the approximation version of NMF XX 

UUV instead of the complete factorization is widely utilized.

An alternative generative model is

XX ¼ UUV þ EE; ð3Þ

where EE 2 IR

MN

is the residue or noise matrix represent-

ing the approximation error.

These two modes of NMF are essentially coupled with

each other, though much more attention is devoted to the

latter. The theoretical results on complete NMF will be

helpful to design more efficient NMF algorithms [31], [34].

The selection of the factorization rank L of NMF may be

more creditable if tighter bound for the nonnegative rank is

obtained [37].

In essence, NMF is an ill-posed problem with nonunique

solutions [32], [38]. From the geometric perspective, NMF

can be viewed as finding a simplicial cone involving all the

data points in the positive orthant. Given a simplicial cone

satisfying all these conditions, it is not difficult to construct

another cone containing the former one to meet the same

conditions, so the nesting can work on infinitively thus

leading to an ill-defined factorization notion. From the

algebraic perspective, if there exists a solution XX  UU

let UU ¼ UU

DD, VV ¼ DD

1

, then XX  UUV . If a nonsingular

matrix and its inverse are both nonnegative, then the

matrix is a generalized permutation with the form of PPS,

where PP and SS are permutation and scaling matrices,

respectively. So the permutation and scaling ambiguities

for NMF are inevitable. For that matter, NMF is called

unique factorization up to a permutation and a scaling

transformation when DD ¼ PPS. Unfortunately, there are

many ways to select a rotational matrix DD which is not

necessarily a generalized permutation or even nonnegative

matrix, so that the transformed factor matrices UU and VV are

still nonnegative. In other words, the sole nonnegativity

constraint in itself will not suffi ce to guarantee the

uniqueness, let alone the effectiveness. Nevertheless, the

uniqueness will be achieved if the original data satisfy

certain generative model. Intuitively, if UU

and VV

are

sufficiently sparse, only generalized permutation matrices

are possible rotation matrices satisfying the nonnegativity

constraint. Strictly speaking, this is called boundary close

condition for sufficiency and necessity of the uniqueness of

NMF solution [39]. The deep discussions about this issue

can be found in [32], [38], [39], [40], [41], and [42]. In

practice, incorporating additional constraints such as

sparseness in the factor matrices or normalizing the

columns of UU (respectively rows of VV ) to unit length is

helpful in alleviating the rotational indeterminacy [9].

It was hoped that NMF would produce an intrinsically

parts-based and sparse representation in unsupervised

mode [3], which is the most inspiring benefit of NMF.

Intuitively, this can be explained by that the stationary

points of NMF solutions will typically be located at the

boundary of the feasible domain due to the first order

optimality conditions, leading to zero elements [37]. Further

experiments by Li et al. have shown, however, that the pure

additivity does not necessarily mean sparsity and that NMF

will not necessarily learn the localized features [43].

Further more, NMF is equivalent to k-means clustering

when using Square of Euclidian Distance (SED) [44], [45],

while tantamount to Probabilistic Latent Semantic Analy-

sis (PLSA) when using Generalized Kullback-Leibler

Divergence (GKLD) as the objective function [46], [47].

So far we may conclude that the merits of NMF, parts-

based representation and sparseness included, come at the

price of more complexity. Besides, SVD or PCA has always

a more compact spectrum than NMF [31]. You just cannot

have the best of both worlds.

3BASIC NMF ALGORITHMS

The cynosure in Basic NMF is trying to find more efficient

and effective solutions to NMF problem under the sole

nonnegativity constraint, which lays the foundation for the

practicability of NMF. Due to its NP-hardness and lack of

appropriate convex formulations, the nonconvex formula-

tions with relatively easy solvability are generally adopted,

WANG AND ZHANG: NONNEGATIVE MATRIX FACTORIZATION: A COMPREHENSIVE REVIEW 1339

and only local minima are achievable in a reasonable

computational time. Hence, the classic and also more

practical approach is to perform alternating minimization

of a suitable cost function as the similarity measures

between XX and the product UUV . The different optimization

models vary from one another mainly in the object

functions and the optimization procedures.

These optimization models, even serving to give sight of

some possible directions for the solutions to Constrained,

Structured, and Generalized NMF, are the kernel discus-

sions of this section. We will first summarize the objective

functions. Then the details about the classic Basic NMF

framework and the paragon algorithms are presented.

Moreover, some new vision of NMF, such as the geometric

formulation of NMF, and the pragmatic issue of NMF, such

as large-scale data sets, online processing, parallel comput-

ing, and incremental NMF, will be discussed. In the last part

of this section, some other relevant issues are also involved.

3.1 Similarity Measures or Objective Functions

In order to quantify the difference between the original data

XX and the approximation UUV, a similarity measure

DðXXUUV

Þ needs to be defined first. This is also the objective

function of the optimization model. These similarity mea-

sures can be either distances or divergences, and correspond-

ing objective functions can be either a sole cost function or

optionally a set of cost functions with the same global

minima to be minimized sequentially or simultaneously.

The most commonly used objective functions are SED

(i.e., Frobenius norm) (4) and GKLD (i.e., I-divergence) (5) [4]

XXUUVkðÞ¼

XX  UUVkk

 UUV½



; ð4Þ

XXUUV

ðÞ¼

UUV½

 XX

þ UUV½

: ð5Þ

There are some drawbacks of GKLD, especially the gradients

needed in optimization heavily depend on the scales of

factorizing matrices leading to many iterations. Thus, the

original KLD is renewed for NMF by normalizing the input

data in [48]. Other cost functions consist of Minkowski family

of metrics known as ‘

-norm, Earth Mover’s distance metric

[18], -divergence [17], -divergence [49], -divergence [50],

Csisza

r’s ’-divergence [51], Bregman divergence [52], and

--divergence [53]. Most of them are element-wise mea-

sures. Some similarity measures are more robust with respect

to noise and outliers, such as hypersurface cost function [54],

-divergence [50], and --divergence [53].

Statistically, different similarity measures can be deter-

mined based on a prior knowledge about the probability

distribution of the noise, which actually reflects the

statistical structure of the signals and the disclosed compo-

nents. For example, the SED minimization can be seen as a

maximum likelihood estimator where the difference is due

to additive Gaussian noise, whereas GKLD can be shown to

be equivalent to the Expectation Maximization (EM) algo-

rithm and maximum likelihood for Poisson processes [9].

Given that while the optimization problem is not jointly

convex in both UU and VV , it is separately convex in either UU

or VV , the alternating minimizations are seemly the feasible

direction. A phenomenon worthy of notice is that although

the generative model of NMF is linear, the inference

computation is nonlinear.

3.2 Classic Basic NMF Optimization Framework

The prototypical multiplicative update rules originated by

Lee and Seung—the SED-MU and GKLD-MU [4] have still

been widely used as the baseline. The SED-MU and GKLD-

MU algorithms use SED and GKLD as objective functions,

respectively, and both apply iterative multiplicative updates

as the optimization approach similar to EM algorithms. In

essence, they can be viewed as adaptive rescaled gradient

descent algorithms. Considering the efficiency, they are

relatively simple and parameter free with low cost per

iteration, but they converge slowly due to a first-order

convergence rate [28], [55]. Regarding the quality of the

solutions, Lee and Seung claimed that the multiplicative

update rules converge to a local minimum [4]. Gonzales and

Zhang indicated that the gradient a nd properties of

continual nonincreasing by no means, however, ensure the

convergence to a limit point that is also a stationary point,

which can be understood under the Karush-Kuhn-Tucker

(KKT) optimality conditions [55], [56]. So the accurate

conclusion is that the algorithms converge to a stationary

point which is not necessarily a local minimum when the

limit point is in the interior of the feasible region; its

stationarity cannot be even determined when the limit point

lies on the boundary of the feasible region [10]. However, a

minor modification in their step size of the gradient descent

formula achieves a first-order stationary point [57]. Another

drawback is the strong correlation enforced by the multi-

plication. Once an element in the factor matrices becomes

zero, it must remain zero. This means the gradual shrinkage

of the feasible region, which is harmful for getting more

superior solution. In practice, to reduce the numerical

difficulties, like numerical instabilities or ill-conditioning,

the normalization of the ‘

or ‘

norm of the columns in UU is

often needed as an extra procedure, yet this simple trick has

changed the original optimization problem, thereby making

searching for the global minimum more complicated.

Besides, to preclude the computational difficulty due to

division by zero, an extra positive additive value in the

denominator is helpful [56].

To accelerate the convergence rate, one popular method

is to apply gradient descent algorithms with additive

update rules. Other techniques such as conjugate gradient,

projected gradient, and more sophisticated second-order

scheme like Newton and Quasi-Newton methods et al. are

also in consideration. They choose appropriate descent

direction, such as the gradient direction, and update the

element additively in the fac tor matrices at a certain

learning rate. They differ from one another as for either

the descent direction or the learning rate strategy. To satisfy

the nonnegativity constraint, the updated matri ces are

brought back to the feasible region, namely the nonnegative

orthant, by additional projection, like simply setting all

negative elements to zero. Usually under certain mild

additional conditions, they can guarantee the first-order

stationarity. These are the widely developed algorithms in

Basic NMF recent years.

1340 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 6, JUNE 2013

Nonnegative Matrix Factorization: A Comprehensive Review

Figures

Citations

Parameter-less Auto-weighted multiple graph regularized Nonnegative Matrix Factorization for data representation

Machine Learning in Wireless Sensor Networks: Algorithms, Strategies, and Applications

State-of-the-Art Deep Learning: Evolving Machine Intelligence Toward Tomorrow’s Intelligent Network Traffic Control Systems

Community discovery using nonnegative matrix factorization

High-level vision: Object recognition and visual cognition, Shimon Ullman. MIT Press, Bradford (1996), ISBN 0 262 21013 4

References

Regression Shrinkage and Selection via the Lasso

Learning the parts of objects by non-negative matrix factorization

Learning parts of objects by non-negative matrix factorization

Algorithms for Non-negative Matrix Factorization

Algorithms for non-negative matrix factorization

Related Papers (5)

Learning the parts of objects by non-negative matrix factorization

Algorithms for non-negative matrix factorization

Graph Regularized Nonnegative Matrix Factorization for Data Representation

Non-negative Matrix Factorization with Sparseness Constraints

Projected Gradient Methods for Nonnegative Matrix Factorization

Frequently Asked Questions (19)

Q1. What have the authors stated for future works in "Nonnegative matrix factorization: a comprehensive review" ?

Q2. What are the contributions mentioned in the paper "Nonnegative matrix factorization: a comprehensive review" ?

Q3. What are the exemplars of low-rank approximations?

Q4. What is the inspiring benefit of NMF?

Q5. Why is NMF an imperative tool in multivariate data analysis?

Q6. What is the smallest L that makes the decomposition possible?

Q7. What is the way to solve the rank-one NMF problem?

Q8. What is the approach to determine the number of factor matrices?

Q9. What is the crest of the previous work on Basic NMF?

Q10. What is the purpose of the constrained gradient distance minimization problem?

Q11. Why is it important to use the local rather than global minimization characteristic?

Q12. What is the main consideration to reduce the computational consumption of the basic NMF algorithms?

Q13. How did Berry et al. solve the LS subproblem?

Q14. How did Cai and his colleagues model the manifold structure?

Q15. What is the way to mitigate the problem of local minima?

Q16. What is the possible direction for the optimization problem?

Q17. What is the way to represent the high-dimensional stochastic pattern?

Q18. What is the way to select a rotational matrix D?

Q19. What are the penalty terms for imposing certain application dependent constraints?