How many matrices of rank 10 can be recovered in 17 minutes?

The algorithm also recovers 30,000× 30,000 matrices of rank 10 from about 0.4% of their sampled entries in just about 17 minutes.

How many times faster would a Fortran implementation run?

Despite this significant speedup, the authors have used only the Matlab version, but since the singular value shrinkage operator is by and large the dominant cost in the SVT algorithm, the authors expect that a Fortran implementation would run about 3 to 4 times faster.

What is the way to approximate the geodesic distance matrix?

With geodesic distances, however, a numerical test suggests that the geodesic-distance matrix M can be well approximated by a low-rank matrix.

What is the way to evaluate D (Y)?

When this approximation consists of the truncated SVD retaining the part of the expansion corresponding to singular values greater than τ , this can be used to evaluate Dτ (Y ).

What is the simplest way to find a sparse vector of singular values?

Just as iterative soft-thresholding methods are designed to find sparse solutions, their iterative singular value thresholding scheme is designed to find a sparse vector of singular values.

How does the sequence Xk obtain the unique solution of (2.8)?

In particular, the sequence {Xk} obtained via (2.7) converges to the unique solution of (2.8) provided that 0 < inf δk ≤ sup δk < 2.4.2.

How many times do the authors need to rerun the Lanczos iterations?

The authors note that it is not necessary to rerun the Lanczos iterations for the first sk vectors since they have already been computed; only a few new singular values ( of them) need to be numerically evaluated.

(Open Access) A Singular Value Thresholding Algorithm for Matrix Completion (2010) | Jian-Feng Cai

Q: What contributions have the authors mentioned in the paper "A singular value thresholding algorithm for matrix completion∗" ?

This paper introduces a novel algorithm to approximate the matrix with minimum nuclear norm among all matrices obeying a set of convex constraints. This paper develops a simple first-order and easy-to-implement algorithm that is extremely efficient at addressing problems in which the optimal solution has low rank. On the theoretical side, the authors provide a convergence analysis showing that the sequence of iterates converges. On the practical side, the authors provide numerical examples in which 1, 000 × 1, 000 matrices are recovered in less than a minute on a modest desktop computer. The authors also demonstrate that their approach is amenable to very large scale problems by recovering matrices of rank about 10 with nearly a billion unknowns from just about 0. 4 % of their sampled entries. Their methods are connected with the recent literature on linearized Bregman iterations for 1 minimization, and the authors develop a framework in which one can understand these algorithms in terms of well-known Lagrange multiplier algorithms.

Q: How does the proposed algorithm solve a matrix completion problem?

Their numerical experiments demonstrate that the proposed algorithm can solve problems, in Matlab, involving matrices of size 30,000 × 30,000 having close to a billion unknowns in 17 minutes on a standard desktop computer with a 1.86 GHz CPU (dual core with Matlab’s multithreading option enabled) and 3 GB of memory.

Q: What is the key property of shrink(Y)?

In (1.5), shrink(Y , τ) is a nonlinear function which applies a soft-thresholding rule at level τ to the singular values of the input matrix; see section 2 for details.

Q: How many times do the authors need to rerun the Lanczos iterations?

The authors note that it is not necessary to rerun the Lanczos iterations for the first sk vectors since they have already been computed; only a few new singular values ( of them) need to be numerically evaluated.

SIAM J. OPTIM.



2010 Society for Industrial and Applied Mathematics

Vol. 20, No. 4, pp. 1956–1982

A SINGULAR VALUE THRESHOLDING ALGORITHM FOR

MATRIX COMPLETION

∗

JIAN-FENG CAI

†

, EMMANUEL J. CAND

‡

, AND ZUOWEI SHEN

Abstract. This paper introduces a novel algorithm to approximate the matrix with minimum

nuclear norm among all matrices obeying a set of convex constraints. This problem may be un-

derstood as the convex relaxation of a rank minimization problem and arises in many important

applications as in the task of recovering a large matrix from a small subset of its entries (the famous

Netﬂix problem). Oﬀ-the-shelf algorithms such as interior point methods are not directly amenable

to large problems of this kind with over a million unknown entries. This paper develops a simple

ﬁrst-order and easy-to-implement algorithm that is extremely eﬃcient at addressing problems in

which the optimal solution has low rank. The algorithm is iterative, produces a sequence of matrices

, Y

}, and at each step mainly performs a soft-thresholding operation on the singular values

of the matrix Y

. There are two remarkable features making this attractive for low-rank matrix

completion problems. The ﬁrst is that the soft-thresholding operation is applied to a sparse matrix;

the second is that the rank of the iterates {X

} is empirically nondecreasing. Both these facts allow

the algorithm to make use of very minimal storage space and keep the computational cost of each

iteration low. On the theoretical side, we provide a convergence analysis showing that the sequence

of iterates converges. On the practical side, we provide numerical examples in which 1, 000 × 1, 000

matrices are recovered in less than a minute on a modest desktop computer. We also demonstrate

that our approach is amenable to very large scale problems by recovering matrices of rank about

10 with nearly a billion unknowns from just about 0.4% of their sampled entries. Our methods are

connected with the recent literature on linearized Bregman iterations for 

minimization, and we

develop a framework in which one can understand these algorithms in terms of well-known Lagrange

multiplier algorithms.

Key words. nuclear norm minimization, matrix completion, singular value thresholding, La-

grange dual function, Uzawa’s algorithm, linearized Bregman iteration

AMS subject classiﬁcations. 90C25, 15A83, 65K05

DOI. 10.1137/080738970

1. Introduction.

1.1. Motivation. There is a rapidly growing interest in the recovery of an un-

known low-rank or approximately low-rank matrix from very limited information.

This problem occurs in many areas of engineering and applied science such as ma-

chine learning [2, 3, 4], control [53], and computer vision; see [61]. As a motivating

example, consider the problem of recovering a data matrix from a sampling of its

entries. This routinely comes up whenever one collects partially ﬁlled out surveys,

and one would like to infer the many missing entries. In the area of recommender

systems, users submit ratings on a subset of entries in a database, and the vendor

provides recommendations based on the user’s preferences. Because users only rate a

∗

Received by the editors October 23, 2008; accepted for publication (in revised form) January 5,

2010; published electronically March 3, 2010.

http://www.siam.org/journals/siopt/20-4/73897.html

†

Department of Mathematics, University of California, Los Angeles, CA 90095 (cai@math.ucla.

edu). This author is supported by the Wavelets and Information Processing Programme under a

grant from DSTA, Singapore.

‡

Applied and Computational Mathematics, Caltech, Pasadena, CA 91125 (candes@stanford.edu).

This author is partially supported by the Waterman Award from the National Science Foundation

and by ONR grant N00014-08-1-0749.

Department of Mathematics, National University of Singapore, 117543 Singapore (matzuows@

nus.edu.sg). This author is supported in part by grant R-146-000-113-112 from the National Univer-

sity of Singapore.

1956

A SINGULAR VALUE THRESHOLDING ALGORITHM 1957

few items, one would like to infer their preference for unrated items; this is the famous

Netﬂix problem [1]. Recovering a rectangular matrix from a sampling of its entries is

known as the matrix completion problem. The issue is, of course, that this problem

is extraordinarily ill posed since with fewer samples than entries we have inﬁnitely

many completions. Therefore, it is apparently impossible to identify which of these

candidate solutions is indeed the “correct” one without some additional information.

In many instances, however, the matrix we wish to recover has low rank or ap-

proximately low rank. For instance, the Netﬂix data matrix of all user ratings may

be approximately low rank because it is commonly believed that only a few factors

contribute to anyone’s taste or preference. In computer vision, inferring scene geom-

etry and camera motion from a sequence of images is a well-studied problem known

as the structure-from-motion problem. This is an ill-conditioned problem for objects

that may be distant with respect to their size, or especially for “missing data” which

occur because of occlusion or tracking failures. However, when properly stacked and

indexed, these images form a matrix which has very low rank (e.g., rank 3 under or-

thography) [24,61]. Other examples of low-rank matrix ﬁtting abound; e.g., in control

(system identiﬁcation), machine learning (multiclass learning), and so on. Having said

this, the premise that the unknown has (approximately) low rank radically changes

the problem, making the search for solutions feasible since the lowest-rank solution

now tends to be the right one.

In a recent paper [16], Cand`es and Recht showed that matrix completion is not

as ill posed as people thought. Indeed, they proved that most low-rank matrices can

be recovered exactly from most sets of sampled entries even though these sets have

surprisingly small cardinality, and more importantly, they proved that this can be

done by solving a simple convex optimization problem. To state their results, suppose

to simplify that the unknown matrix M ∈ R

n×n

is square and that one has available

m sampled entries {M

:(i, j) ∈ Ω} where Ω is a random subset of cardinality m.

Then [16] proves that most matrices M of rank r can be perfectly recovered by solving

the optimization problem

(1.1)

minimize X

∗

subject to X

= M

, (i, j) ∈ Ω,

provided that the number of samples obeys m ≥ Cn

6/5

r log n for some positive nu-

merical constant C.

In (1.1), the functional X

∗

is the nuclear norm of the matrix

M, which is the sum of its singular values. The optimization problem (1.1) is convex

and can be recast as a semideﬁnite program [34,35]. In some sense, this is the tightest

convex relaxation of the NP-hard rank minimization problem

(1.2)

minimize rank(X)

subject to X

= M

, (i, j) ∈ Ω,

since the nuclear ball {X : X

∗

≤ 1} is the convex hull of the set of rank-one

matrices with spectral norm bounded by one. Another interpretation of Cand`es and

Recht’s result is that under suitable conditions, the rank minimization program (1.2)

and the convex program (1.1) are formally equivalent in the sense that they have

exactly the same unique solution.

1.2. Algorithm outline. Because minimizing the nuclear norm both provably

recovers the lowest-rank matrix subject to constraints (see [56] for related results) and

Note that an n × n matrix of rank r depends upon r(2n − r) degrees of freedom.

1958 JIAN-FENG CAI, EMMANUEL J. CAND

ES, AND ZUOWEI SHEN

gives generally good empirical results in a variety of situations, it is understandably of

great interest to develop numerical methods for solving (1.1). In [16], this optimization

problem was solved using one of the most advanced semideﬁnite programming solvers,

namely, SDPT3 [59]. This solver and others like SeDuMi are based on interior-point

methods and are problematic when the size of the matrix is large because they need

to solve huge systems of linear equations to compute the Newton direction. In fact,

SDPT3 can only handle n × n matrices with n ≤ 100. Presumably, one could resort

to iterative solvers such as the method of conjugate gradients to solve for the Newton

step, but this is problematic as well since it is well known that the condition number

of the Newton system increases rapidly as one gets closer to the solution. In addition,

none of these general purpose solvers use the fact that the solution may have low

rank. We refer the reader to [50] for some recent progress on interior-point methods

concerning some special nuclear norm-minimization problems.

This paper develops the singular value thresholding algorithm for approximately

solving the nuclear norm minimization problem (1.1) and, by extension, problems of

the form

(1.3)

minimize X

∗

subject to A(X)=b,

where A is a linear operator acting on the space of n

×n

matrices and b ∈ R

.This

algorithm is a simple ﬁrst-order method and is especially well suited for problems of

very large sizes in which the solution has low rank. We sketch this algorithm in the

special matrix completion setting and let P

be the orthogonal projector onto the

span of matrices vanishing outside of Ω so that the (i, j)th component of P

(X)is

equal to X

if (i, j) ∈ Ω and zero otherwise. Our problem may be expressed as

(1.4)

minimize X

∗

subject to P

(X)=P

(M),

with optimization variable X ∈ R

×n

.Fixτ>0 and a sequence {δ

}

k≥1

of scalar

step sizes. Then starting with Y

=0∈ R

×n

, the algorithm inductively deﬁnes

(1.5)



=shrink(Y

k−1

,τ),

= Y

k−1

+ δ

(M − X

)

until a stopping criterion is reached. In (1.5), shrink(Y ,τ) is a nonlinear function

which applies a soft-thresholding rule at level τ to the singular values of the input

matrix; see section 2 for details. The key property here is that for large values of τ,

the sequence {X

} converges to a solution which very nearly minimizes (1.4). Hence,

at each step, one only needs to compute at most one singular value decomposition

and perform a few elementary matrix additions. Two important remarks are in order:

1. Sparsity. For each k ≥ 0, Y

vanishes outside of Ω and is, therefore, sparse,

a fact which can be used to evaluate the shrink function rapidly.

2. Low-rank property. The matrices X

turn out to have low rank, and hence

the algorithm has a minimum storage requirement since we need to keep only

principal factors in memory.

Our numerical experiments demonstrate that the proposed algorithm can solve

problems, in Matlab, involving matrices of size 30,000 × 30,000 having close to a

billion unknowns in 17 minutes on a standard desktop computer with a 1.86 GHz

CPU (dual core with Matlab’s multithreading option enabled) and 3 GB of memory.

As a consequence, the singular value thresholding algorithm may become a rather

powerful computational tool for large scale matrix completion.

A SINGULAR VALUE THRESHOLDING ALGORITHM 1959

1.3. General formulation. The singular value thresholding algorithm can be

adapted to deal with other types of convex constraints. For instance, it may address

problems of the form

(1.6)

minimize X

∗

subject to f

(X) ≤ 0,i=1,...,m,

where each f

is a Lipschitz convex function (note that one can handle linear equality

constraints by considering pairs of aﬃne functionals). In the simpler case where the

’s are aﬃne functionals, the general algorithm goes through a sequence of iterations

which greatly resemble (1.5). This is useful because this enables the development of

numerical algorithms which are eﬀective for recovering matrices from a small subset

of sampled entries possibly contaminated with noise.

1.4. Contents and notations. The rest of the paper is organized as follows.

In section 2, we derive the singular value thresholding (SVT) algorithm for the ma-

trix completion problem and recast it in terms of a well-known Lagrange multiplier

algorithm. In section 3, we extend the SVT algorithm and formulate a general itera-

tion which is applicable to general convex constraints. In section 4, we establish the

convergence results for the iterations given in sections 2 and 3. We demonstrate the

performance and eﬀectiveness of the algorithm through numerical examples in sec-

tion 5, and review additional implementation details. Finally, we conclude the paper

with a short discussion in section 6.

Before continuing, we provide here a brief summary of the notations used through-

out the paper. Matrices are bold capital, vectors are bold lowercase, and scalars or

entries are not bold. For instance, X is a matrix and X

its (i, j)th entry. Likewise,

x is a vector and x

its ith component. The nuclear norm of a matrix is denoted

by X

∗

, the Frobenius norm by X

, and the spectral norm by X

;notethat

these are, respectively, the 1-norm, the 2-norm, and the sup-norm of the vector of

singular values. The adjoint of a matrix X is X

∗

and similarly for vectors. The

notation diag(x), where x is a vector, stands for the diagonal matrix with {x

} as

diagonal elements. We denote by X, Y  =trace(X

∗

Y ) the standard inner prod-

uct between two matrices (X

= X, X). The Cauchy–Schwarz inequality gives

X, Y ≤X

Y 

, and it is well known that we also have X, Y ≤X

∗

Y 

(the spectral and nuclear norms are dual from one another); see, e.g., [16,56].

2. The singular value thresholding algorithm. This section introduces the

singular value thresholding algorithm and discusses some of its basic properties. We

begin with the deﬁnition of a key building block, namely, the singular value thresh-

olding operator.

2.1. The singular value shrinkage operator. Consider the singular value

decomposition (SVD) of a matrix X ∈ R

×n

of rank r

(2.1) X = UΣV

∗

, Σ =diag({σ

}

1≤i≤r

where U and V are, respectively, n

× r and n

× r matrices with orthonormal

columns, and the singular values σ

are positive (unless speciﬁed otherwise, we will

always assume that the SVD of a matrix is given in the reduced form above). For

each τ ≥ 0, we introduce the soft-thresholding operator D

deﬁned as follows:

(2.2) D

(X):=UD

(Σ)V

∗

, D

(Σ)=diag({σ

− τ)

}),

1960 JIAN-FENG CAI, EMMANUEL J. CAND

ES, AND ZUOWEI SHEN

where t

is the positive part of t,namely,t

=max(0,t). In words, this operator

simply applies a soft-thresholding rule to the singular values of X, eﬀectively shrinking

these toward zero. This is the reason why we will also refer to this transformation

as the singular value shrinkage operator. Even though the SVD may not be unique,

it is easy to see that the singular value shrinkage operator is well deﬁned, and we

do not elaborate further on this issue. In some sense, this shrinkage operator is a

straightforward extension of the soft-thresholding rule for scalars and vectors. In

particular, note that if many of the singular values of X are below the threshold

τ, the rank of D

(X) may be considerably lower than that of X, just like the soft-

thresholding rule applied to vectors leads to sparser outputs whenever some entries

of the input are below threshold.

The singular value thresholding operator is the proximity operator associated with

the nuclear norm. Details about the proximity operator can be found in, e.g., [42].

Theorem 2.1.

For each τ ≥ 0 and Y ∈ R

×n

, the singular value shrinkage

operator (2.2) obeys

(2.3) D

(Y )=argmin

X − Y 

+ τX

∗

Proof. Since the function h

(X):=τX

∗

X −Y 

is strictly convex, it is

easy to see that there exists a unique minimizer, and we thus need to prove that it is

equal to D

(Y ). To do this, recall the deﬁnition of a subgradient of a convex function

f : R

×n

→ R.WesaythatZ is a subgradient of f at X

, denoted Z ∈ ∂f(X

), if

(2.4) f(X) ≥ f(X

)+Z, X −X



for all X.Now

X minimizes h

if and only if 0 is a subgradient of the functional h

at the point

X, i.e.,

(2.5) 0 ∈

X − Y + τ∂

X

∗

where ∂

X

∗

is the set of subgradients of the nuclear norm. Let X ∈ R

×n

be an

arbitrary matrix and UΣV

∗

be its SVD. It is known [16, 46, 64] that

(2.6) ∂X

∗



∗

+ W : W ∈ R

×n

, U

∗

W =0, WV =0, W 

≤ 1



Set

X := D

(Y ) for short. In order to show that

X obeys (2.5), decompose the

SVD of Y as Y = U

∗

+ U

∗

,whereU

, V

(resp., U

, V

) are the singular

vectors associated with singular values greater than τ (resp., smaller than or equal to

τ). With these notations, we have

X = U

(Σ

− τI)V

∗

and, therefore,

Y −

X = τ(U

∗

+ W ), W = τ

−1

∗

By deﬁnition, U

∗

W =0,WV

= 0, and since the diagonal elements of Σ

have

magnitudes bounded by τ , we also have W 

≤ 1. Hence Y −

X ∈ τ∂

X

∗

,which

concludes the proof.

One reviewer pointed out that a similar result had been mentioned in a talk given by Donald

Goldfarb at the Foundations of Computational Mathematics conference which took place in Hong

Kong in June 2008.

A Singular Value Thresholding Algorithm for Matrix Completion

Figures

Citations

Robust principal component analysis

Exact Matrix Completion via Convex Optimization

Proximal Algorithms

Robust Recovery of Subspace Structures by Low-Rank Representation

A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography

References

Convex Optimization

Compressed sensing

Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information

Decoding by linear programming

Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies?

Related Papers (5)

Exact Matrix Completion via Convex Optimization

Robust principal component analysis

A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography

Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization

Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "A singular value thresholding algorithm for matrix completion∗" ?

Q2. How many matrices of rank 10 can be recovered in 17 minutes?

Q3. How does the proposed algorithm solve a matrix completion problem?

Q4. What is the key property of shrink(Y)?

Q5. How many times faster would a Fortran implementation run?

Q6. What is the way to approximate the geodesic distance matrix?

Q7. What is the way to evaluate D (Y)?

Q8. What is the simplest way to find a sparse vector of singular values?

Q9. How does the sequence Xk obtain the unique solution of (2.8)?

Q10. How many times do the authors need to rerun the Lanczos iterations?

Q11. What is the premise that the unknown has low rank?

Q12. How can the authors recover low-rank matrices from a sample of the same data?

Trending Questions (1)