What have the authors contributed in "User-friendly tail bounds for sums of random matrices" ?

(Open Access) User-friendly tail bounds for sums of random matrices (2010) | Joel A. Tropp

A P P L I E D & C O M P U T A T I O N A L M A T H E M A T I C S

C A L I F O R N I A I N S T I T U T E O F T E C H N O L O G Y

m a i l c o d e 2 1 7 - 50 ! p a s a d e n a , c a 9 1 1 2 5 !

Technical Report No. 2010-01

April 2010

USER-FRIENDLY TAIL BOUNDS

FOR SUMS OF RANDOM MATRICES

JOEL A. TROPP

USER-FRIENDLY TAIL BOUNDS

FOR SUMS OF RANDOM MATRICES

J. A. TROPP

Abstract. This work presents probability inequalities for sums of independent, random, self-

adjoint matrices. The results frame simple, easily veriﬁable hypotheses on the summands, and they

yield strong conclusions about the large-deviation behavior of the maximum eigenvalue of the sum.

Tail bounds for the norm of a sum of rectangular matrices follow as an immediate corollary, and

similar techniques yield information about matrix-valued martingales.

In other words, this paper provides noncommutative generalizations of the classical bounds

asso ci ated with the names Azuma, Bennett, Bernstein, Chernoﬀ, Hoeﬀding, and McDiarmid. The

matrix inequalities promise the same ease of use, diversity of application, and strength of conclusion

that have made the scalar inequalities so valuable.

1. Introduction

Random matrices have come to play a signiﬁcant role in computational mathematics. This line

of research has advanced by using established methods from random matrix theory, but it has also

generated diﬃcult questions that cannot be addressed without new tools. Let us summarize some

of the challenges that arise.

• For numerical applications, it is important to obtain detailed quantitative information about

random matrices of ﬁnite order. Asymptotic theory has limited value.

• Many problems require explicit large deviation bounds for the extreme eigenvalues of a

random matrix. In other cases, we are concerned not with the eigenvalue spectrum but

rather with the action of a random operator on some class of vectors or matrices.

• In numerical analysis, it is essential to compute eﬀective constants to ensure that an algo-

rithm is provably correct in practice.

• We often encounter highly structured matrices that involve a limited amount of randomness.

One important example is the randomized DFT, which consists of a diagonal matrix of signs

multiplied by a discrete Fourier transform matrix.

• Other problems involve a sparse matrix sampled from a ﬁxed matrix or a random submatrix

drawn from a ﬁxed matrix. These applications lead to random matrices whose distribution

varies by coordinate, in contrast to the classical ensembles of random matrices that have

i.i.d. entries or i.i.d. columns.

We have encountered these issues in a wide range of problems from computational mathemat-

ics: smoothed analysis of Gaussian elimination [SST06]; semideﬁnite relaxation and rounding of

quadratic maximization problems [Nem07, So09]; construction of maps for dimensionality reduc-

tion [AC09]; matrix approximation by sparsiﬁcation [AM07] and by sampling submatrices [RV07];

Date: 25 April 2010. Corrected: 29 April 2010.

Key words and phrases. Discrete-time martingale, large deviation, random matrix, sum of independent random

variabl es.

2010 Mathematics Subject Classiﬁcation. Primary: 60B20. Secondary: 60F10, 60G50, 60G42.

JAT is with Applied and Computational Mathematics, MC 305-16, California Inst. Technology, Pasadena, CA

91125. E-mail: jtropp@acm.caltech.edu. Research supported by ONR award N00014-08-1-0883, DARPA award

N66001-08-1-2065, and AFOSR award FA9550-09-1-0643.

2 J. A. TROPP

analysis of sparse approximation [Tro08] and compressive sampling [CR07] problems; random-

ized schemes for low-rank matrix factorization [HMT09]; and analysis of algorithms for comple-

tion [Gro09, Rec09] and decomposition [CSPW09, CLMW09] of low-rank matrices. And this list

is by no means comprehensive!

In these applications, the methods currently invoked to study random matrices are often cum-

bersome, and they require a substantial amount of practice to use eﬀectively. These frustrations

have led us to search for simpler techniques that still yield detailed quantitative information about

ﬁnite random matrices.

Inspired by the work of Ahslwede–Winter [AW02] and Rudelson–Vershynin [Rud99, RV07], we

study sums of independent, random, self-adjoint matrices. Our results place simple and easily

veriﬁable hypotheses on the summands that allow us to reach strong conclusions about the large-

deviation behavior of the maximum eigenvalue of the sum. These bounds can be viewed as matrix

analogs of the probability inequalities associated with the names Azuma, Bennett, Bernstein, Cher-

noﬀ, Hoeﬀding, and McDiarmid. We hope that these new matrix inequalities will oﬀer researchers

the same ease of use, diversity of application, and strength of conclusion that have made the scalar

inequalities so indispensable.

1.1. Roadmap. The rest of the paper is organized as follows. Section 2 provides an overview of

our main results and a discussion of related work. Section 3 introduces the background required

for our proofs, which ranges from the elementary to the esoteric. Section 4 contains the main

technical innovations. Sections 5–8 complete the proofs of the matrix probability inequalities.

Section 9 describes some complementary results, including the extension to rectangular matrices.

We conclude in Section 10 with some open questions.

2. Main Results and Discussion

Our goal has been to extend the most useful of the classical tail bounds to the matrix case, rather

than to produce a complete catalog of matrix inequalities. This approach allows us to introduce

several diﬀerent techniques that are useful for making the translation from the scalar to the matrix

setting. This section summarizes the main results for easy reference. Section 2.6 describes some

additional theorems that may be found deeper inside the paper.

2.1. Technical Approach. Consider a ﬁnite sequence {X

} of independent, random, self-adjoint

matrices. We wish to bound the probability

max

≥ t

Here and elsewhere, λ

max

denotes the algebraically largest eigenvalue of a self-adjoint matrix. This

formulation is more general than it may appear because we can exploit the same ideas to explore

several related problems:

• We can study the smallest eigenvalue of the sum.

• We can bound the largest singular value of a sum of random rectangular matrices.

• We can extend these methods to matrix-valued martingales.

• We can investigate the probability that the sum satisﬁes other semideﬁnite relations.

In the matrix setting, the structure of the main argument parallels established proofs of the

classical inequalities. See [McD98, Lug09] for accessible surveys in the scalar setting. First, we

describe a suitable generalization of Bernstein’s argument, which is sometimes known as the Laplace

transform method. In the matrix setting, this approach yields the bound

max

≥ t

≤ inf

θ>0

−θt

tr exp

log E e

θX

In words, the probability of a large deviation is controlled by the “cumulant generating functions”

of the random matrices. Although this inequality superﬁcially resembles the classical Laplace

TAIL BOUNDS FOR SUMS OF RANDOM MATRICES 3

transform bound for real random variables, the proof is no longer elementary. Our argument relies

on a deep inequality of Lieb [Lie73, Thm. 6]. This part of the reasoning appears in Section 4.

As in the scalar case, the second stage of the development uses information about each random

matrix to obtain bounds for the “cumulant generating functions.” Certain classical methods extend

directly to the matrix case, but they usually require additional care. Other proofs do not generalize

at all, and we have to identify alternative approaches. Sections 5–8 present these arguments.

Let us emphasize that many of the ideas in this work have appeared in the literature. The primary

precedent is the important paper of Ahlswede and Winter [AW02], which develops a matrix analog

of the Laplace transform method; see also [Gro09, Rec09]. We have been inﬂuenced strongly by

Rudelson and Vershynin’s approach [Rud99, RV07] to random matrices via the noncommutative

Khintchine inequality [LP86, Buc01]. Finally, the recent work of Oliveira [Oli10b] persuaded us

that it might be possible to combine the best qualities of these two approaches.

2.2. Rademacher and Gaussian Series. For motivation, we begin with the simplest example

of a sum of independent random variables: a series with real coeﬃcients modulated by random

signs. This discussion illustrates some new phenomena that arise when we try to translate scalar

tail bounds to the matrix setting.

Consider a ﬁnite sequence {a

} of real numbers and a ﬁnite sequence {ε

} of independent

Rademacher variables

. A classical result, due to Bernstein, shows that

≥ t

≤ e

−t

/2σ

where σ

. (2.1)

In words, a real Rademacher series exhibits normal concentration with variance equal to the sum

of the squared coeﬃcients. The central limit theorem guarantees that there are Rademacher series

where this estimate is essentially sharp.

What is the correct generalization of (2.1) to random matrices? The approach of Ahlswede and

Winter [AW02] suggests the bound

max

≥ t

≤ d ·e

−t

/2σ

where σ

. (2.2)

The symbol #·# denotes the usual norm for operators on a Hilbert space, which returns the largest

singular value of its argument. Although the statement (2.2) identiﬁes a plausible generalization for

the variance, this result can be improved dramatically in most cases. Indeed, a matrix Rademacher

series satisﬁes a fundamentally stronger tail bound.

Theorem 2.1 (Matrix Rademacher and Gaussian Series). Consider a ﬁnite sequence {A

} of

ﬁxed self-adjoint matrices with dimension d, and let {ε

} be a sequence of independent Rademacher

variables. Compute the norm of the sum of squared coeﬃcient matrices:

. (2.3)

For all t ≥ 0,

max

≥ t

≤ d ·e

−t

/2σ

. (2.4)

In particular,

≥ t

≤ 2d ·e

−t

/2σ

. (2.5)

The same bounds hold when we replace {ε

} by a sequence of independent, standard normal random

variables.

A Rademacher random variable is uniformly distributed on {±1}.

4 J. A. TROPP

When the dimension d = 1, the bound (2.4) reduces to the classical result (2.1). Of course, one

may still wonder whether the formula (2.3) for the variance is sharp and whether the dimensional

dependence is necessary. Remarks 2.2, 2.3, and 2.4 demonstrate that Theorem 2.1 cannot be

improved without changing its form. A casual reader may bypass this discussion without loss of

continuity.

The technology required to prove Theorem 2.1 has been available for some time now. One

argument applies sharp noncommutative Khintchine inequalities, [Buc01, Thm. 5] and [Buc05,

Thm. 5], to bound the moment generating function of the maximum eigenvalue of the random sum.

Very recently, Oliveira has developed a diﬀerent approach [Oli10b, Lem. 2] using a clever variation

of Ahlswede and Winter’s techniques. We present our proof in Section 7.

Remark 2.2. The matrix variance σ

given by (2.3) is truly the correct quantity for controlling large

deviations of a matrix Gaussian series. Indeed, it follows from general principles [LT91, Cor. 3.2]

that

lim

t→∞

log P

≥ t

= −

2σ

where {γ

} is a sequence of independent, standard normal variables. By the (scalar) central limit

theorem, we can construct Rademacher series that exhibit essentially the same large-deviation

behavior by repeating each matrix A

multiple times. (Of course, a ﬁnite Rademacher series is

almost surely bounded!)

In contrast to a Gaussian series, a Rademacher series can have a constant operator norm. Nev-

ertheless, the matrix variance in (2.3) always provides a lower bound for the supremal norm of the

series:

σ ≤ sup

This fact follows easily from the statement of the noncommutative Khintchine inequality in [Rud99,

Sec. 3]. A simple e xam ple shows that the lower bound is sharp. Let E

be the matrix with a unit

entry in the (i, j) position and zeros elsewhere, and consider the Rademacher series with coeﬃcients

= E

for k =1, 2, . . . , d. This example also demonstrates that the bound (2.2) is fundamentally

worse than Theorem 2.1.

Remark 2.3. In general, we cannot remove the factor d from the probability bound in Theorem 2.1.

Consider the Gaussian series

k=1

= max

|γ

|≥ c

log d with high probability.

Since the variance parameter σ

= 1, Theorem 2.1 yields

(

k=1

≥ t

)

≤ d ·e

−t

We need the factor d to ensure that the probability bound does not become eﬀective until t ≥

√

2 log d. The dimensional factor is also necessary in the tail bound for Rademacher series because

of the central limit theorem.

Remark 2.4. The dimensional dependence does not appear in standard bounds for Rademacher

series in Banach space because they concern the deviation of the norm of the sum above its mean

value. For example, Ledoux [Led96, Eqn. (1.9)] proves that

≥ E

+ t

≤ e

−t

/8σ

where σ

is given by (2.3). Unfortunately, this formula provides no information about the size of

the expectation. In contrast, we can always bound the expectation by integrating (2.5), although

the estimate may not be sharp.

User-friendly tail bounds for sums of random matrices

Citations

Introduction to the non-asymptotic analysis of random matrices.

Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion

Estimation of high-dimensional low-rank matrices

Oracle inequalities in empirical risk minimization and sparse recovery problems

Quantum Reverse Shannon Theorem

References

Matrix Analysis

Topics in Matrix Analysis

Randomized Algorithms

A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations

Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions

Related Papers (5)

The Power of Convex Relaxation: Near-Optimal Matrix Completion

A Simpler Approach to Matrix Completion

Exact Matrix Completion via Convex Optimization

Introduction to the non-asymptotic analysis of random matrices.

The concentration of measure phenomenon

Frequently Asked Questions (1)

Q1. What have the authors contributed in "User-friendly tail bounds for sums of random matrices" ?