A P P L I E D & C O M P U T A T I O N A L M A T H E M A T I C S
C A L I F O R N I A I N S T I T U T E O F T E C H N O L O G Y
m a i l c o d e 2 1 7 - 50 ! p a s a d e n a , c a 9 1 1 2 5 !
Technical Report No. 2010-01
April 2010
USER-FRIENDLY TAIL BOUNDS
FOR SUMS OF RANDOM MATRICES
JOEL A. TROPP
!
USER-FRIENDLY TAIL BOUNDS
FOR SUMS OF RANDOM MATRICES
J. A. TROPP
Abstract. This work presents probability inequalities for sums of independent, random, self-
adjoint matrices. The results frame simple, easily verifiable hypotheses on the summands, and they
yield strong conclusions about the large-deviation behavior of the maximum eigenvalue of the sum.
Tail bounds for the norm of a sum of rectangular matrices follow as an immediate corollary, and
similar techniques yield information about matrix-valued martingales.
In other words, this paper provides noncommutative generalizations of the classical bounds
asso ci ated with the names Azuma, Bennett, Bernstein, Chernoff, Hoeffding, and McDiarmid. The
matrix inequalities promise the same ease of use, diversity of application, and strength of conclusion
that have made the scalar inequalities so valuable.
1. Introduction
Random matrices have come to play a significant role in computational mathematics. This line
of research has advanced by using established methods from random matrix theory, but it has also
generated difficult questions that cannot be addressed without new tools. Let us summarize some
of the challenges that arise.
• For numerical applications, it is important to obtain detailed quantitative information about
random matrices of finite order. Asymptotic theory has limited value.
• Many problems require explicit large deviation bounds for the extreme eigenvalues of a
random matrix. In other cases, we are concerned not with the eigenvalue spectrum but
rather with the action of a random operator on some class of vectors or matrices.
• In numerical analysis, it is essential to compute effective constants to ensure that an algo-
rithm is provably correct in practice.
• We often encounter highly structured matrices that involve a limited amount of randomness.
One important example is the randomized DFT, which consists of a diagonal matrix of signs
multiplied by a discrete Fourier transform matrix.
• Other problems involve a sparse matrix sampled from a fixed matrix or a random submatrix
drawn from a fixed matrix. These applications lead to random matrices whose distribution
varies by coordinate, in contrast to the classical ensembles of random matrices that have
i.i.d. entries or i.i.d. columns.
We have encountered these issues in a wide range of problems from computational mathemat-
ics: smoothed analysis of Gaussian elimination [SST06]; semidefinite relaxation and rounding of
quadratic maximization problems [Nem07, So09]; construction of maps for dimensionality reduc-
tion [AC09]; matrix approximation by sparsification [AM07] and by sampling submatrices [RV07];
Date: 25 April 2010. Corrected: 29 April 2010.
Key words and phrases. Discrete-time martingale, large deviation, random matrix, sum of independent random
variabl es.
2010 Mathematics Subject Classification. Primary: 60B20. Secondary: 60F10, 60G50, 60G42.
JAT is with Applied and Computational Mathematics, MC 305-16, California Inst. Technology, Pasadena, CA
91125. E-mail: jtropp@acm.caltech.edu. Research supported by ONR award N00014-08-1-0883, DARPA award
N66001-08-1-2065, and AFOSR award FA9550-09-1-0643.
1
2 J. A. TROPP
analysis of sparse approximation [Tro08] and compressive sampling [CR07] problems; random-
ized schemes for low-rank matrix factorization [HMT09]; and analysis of algorithms for comple-
tion [Gro09, Rec09] and decomposition [CSPW09, CLMW09] of low-rank matrices. And this list
is by no means comprehensive!
In these applications, the methods currently invoked to study random matrices are often cum-
bersome, and they require a substantial amount of practice to use effectively. These frustrations
have led us to search for simpler techniques that still yield detailed quantitative information about
finite random matrices.
Inspired by the work of Ahslwede–Winter [AW02] and Rudelson–Vershynin [Rud99, RV07], we
study sums of independent, random, self-adjoint matrices. Our results place simple and easily
verifiable hypotheses on the summands that allow us to reach strong conclusions about the large-
deviation behavior of the maximum eigenvalue of the sum. These bounds can be viewed as matrix
analogs of the probability inequalities associated with the names Azuma, Bennett, Bernstein, Cher-
noff, Hoeffding, and McDiarmid. We hope that these new matrix inequalities will offer researchers
the same ease of use, diversity of application, and strength of conclusion that have made the scalar
inequalities so indispensable.
1.1. Roadmap. The rest of the paper is organized as follows. Section 2 provides an overview of
our main results and a discussion of related work. Section 3 introduces the background required
for our proofs, which ranges from the elementary to the esoteric. Section 4 contains the main
technical innovations. Sections 5–8 complete the proofs of the matrix probability inequalities.
Section 9 describes some complementary results, including the extension to rectangular matrices.
We conclude in Section 10 with some open questions.
2. Main Results and Discussion
Our goal has been to extend the most useful of the classical tail bounds to the matrix case, rather
than to produce a complete catalog of matrix inequalities. This approach allows us to introduce
several different techniques that are useful for making the translation from the scalar to the matrix
setting. This section summarizes the main results for easy reference. Section 2.6 describes some
additional theorems that may be found deeper inside the paper.
2.1. Technical Approach. Consider a finite sequence {X
k
} of independent, random, self-adjoint
matrices. We wish to bound the probability
P
!
λ
max
"
#
k
X
k
$
≥ t
%
.
Here and elsewhere, λ
max
denotes the algebraically largest eigenvalue of a self-adjoint matrix. This
formulation is more general than it may appear because we can exploit the same ideas to explore
several related problems:
• We can study the smallest eigenvalue of the sum.
• We can bound the largest singular value of a sum of random rectangular matrices.
• We can extend these methods to matrix-valued martingales.
• We can investigate the probability that the sum satisfies other semidefinite relations.
In the matrix setting, the structure of the main argument parallels established proofs of the
classical inequalities. See [McD98, Lug09] for accessible surveys in the scalar setting. First, we
describe a suitable generalization of Bernstein’s argument, which is sometimes known as the Laplace
transform method. In the matrix setting, this approach yields the bound
P
!
λ
max
"
#
k
X
k
$
≥ t
%
≤ inf
θ>0
!
e
−θt
tr exp
"
#
k
log E e
θX
k
$%
.
In words, the probability of a large deviation is controlled by the “cumulant generating functions”
of the random matrices. Although this inequality superficially resembles the classical Laplace
TAIL BOUNDS FOR SUMS OF RANDOM MATRICES 3
transform bound for real random variables, the proof is no longer elementary. Our argument relies
on a deep inequality of Lieb [Lie73, Thm. 6]. This part of the reasoning appears in Section 4.
As in the scalar case, the second stage of the development uses information about each random
matrix to obtain bounds for the “cumulant generating functions.” Certain classical methods extend
directly to the matrix case, but they usually require additional care. Other proofs do not generalize
at all, and we have to identify alternative approaches. Sections 5–8 present these arguments.
Let us emphasize that many of the ideas in this work have appeared in the literature. The primary
precedent is the important paper of Ahlswede and Winter [AW02], which develops a matrix analog
of the Laplace transform method; see also [Gro09, Rec09]. We have been influenced strongly by
Rudelson and Vershynin’s approach [Rud99, RV07] to random matrices via the noncommutative
Khintchine inequality [LP86, Buc01]. Finally, the recent work of Oliveira [Oli10b] persuaded us
that it might be possible to combine the best qualities of these two approaches.
2.2. Rademacher and Gaussian Series. For motivation, we begin with the simplest example
of a sum of independent random variables: a series with real coefficients modulated by random
signs. This discussion illustrates some new phenomena that arise when we try to translate scalar
tail bounds to the matrix setting.
Consider a finite sequence {a
k
} of real numbers and a finite sequence {ε
k
} of independent
Rademacher variables
1
. A classical result, due to Bernstein, shows that
P
!
#
k
ε
k
a
k
≥ t
%
≤ e
−t
2
/2σ
2
where σ
2
=
#
k
a
2
k
. (2.1)
In words, a real Rademacher series exhibits normal concentration with variance equal to the sum
of the squared coefficients. The central limit theorem guarantees that there are Rademacher series
where this estimate is essentially sharp.
What is the correct generalization of (2.1) to random matrices? The approach of Ahlswede and
Winter [AW02] suggests the bound
P
!
λ
max
"
#
k
ε
k
A
k
$
≥ t
%
≤ d ·e
−t
2
/2σ
2
where σ
2
=
#
k
&
&
A
2
k
&
&
. (2.2)
The symbol #·# denotes the usual norm for operators on a Hilbert space, which returns the largest
singular value of its argument. Although the statement (2.2) identifies a plausible generalization for
the variance, this result can be improved dramatically in most cases. Indeed, a matrix Rademacher
series satisfies a fundamentally stronger tail bound.
Theorem 2.1 (Matrix Rademacher and Gaussian Series). Consider a finite sequence {A
k
} of
fixed self-adjoint matrices with dimension d, and let {ε
k
} be a sequence of independent Rademacher
variables. Compute the norm of the sum of squared coefficient matrices:
σ
2
=
&
&
&
#
k
A
2
k
&
&
&
. (2.3)
For all t ≥ 0,
P
!
λ
max
"
#
k
ε
k
A
k
$
≥ t
%
≤ d ·e
−t
2
/2σ
2
. (2.4)
In particular,
P
!
&
&
&
#
k
ε
k
A
k
&
&
&
≥ t
%
≤ 2d ·e
−t
2
/2σ
2
. (2.5)
The same bounds hold when we replace {ε
k
} by a sequence of independent, standard normal random
variables.
1
A Rademacher random variable is uniformly distributed on {±1}.
4 J. A. TROPP
When the dimension d = 1, the bound (2.4) reduces to the classical result (2.1). Of course, one
may still wonder whether the formula (2.3) for the variance is sharp and whether the dimensional
dependence is necessary. Remarks 2.2, 2.3, and 2.4 demonstrate that Theorem 2.1 cannot be
improved without changing its form. A casual reader may bypass this discussion without loss of
continuity.
The technology required to prove Theorem 2.1 has been available for some time now. One
argument applies sharp noncommutative Khintchine inequalities, [Buc01, Thm. 5] and [Buc05,
Thm. 5], to bound the moment generating function of the maximum eigenvalue of the random sum.
Very recently, Oliveira has developed a different approach [Oli10b, Lem. 2] using a clever variation
of Ahlswede and Winter’s techniques. We present our proof in Section 7.
Remark 2.2. The matrix variance σ
2
given by (2.3) is truly the correct quantity for controlling large
deviations of a matrix Gaussian series. Indeed, it follows from general principles [LT91, Cor. 3.2]
that
lim
t→∞
1
t
2
log P
!
&
&
&
#
k
γ
k
A
k
&
&
&
≥ t
%
= −
1
2σ
2
.
where {γ
k
} is a sequence of independent, standard normal variables. By the (scalar) central limit
theorem, we can construct Rademacher series that exhibit essentially the same large-deviation
behavior by repeating each matrix A
k
multiple times. (Of course, a finite Rademacher series is
almost surely bounded!)
In contrast to a Gaussian series, a Rademacher series can have a constant operator norm. Nev-
ertheless, the matrix variance in (2.3) always provides a lower bound for the supremal norm of the
series:
σ ≤ sup
ε
&
&
&
#
k
ε
k
A
k
&
&
&
.
This fact follows easily from the statement of the noncommutative Khintchine inequality in [Rud99,
Sec. 3]. A simple e xam ple shows that the lower bound is sharp. Let E
ij
be the matrix with a unit
entry in the (i, j) position and zeros elsewhere, and consider the Rademacher series with coefficients
A
k
= E
kk
for k =1, 2, . . . , d. This example also demonstrates that the bound (2.2) is fundamentally
worse than Theorem 2.1.
Remark 2.3. In general, we cannot remove the factor d from the probability bound in Theorem 2.1.
Consider the Gaussian series
&
&
&
&
#
d
k=1
γ
k
E
kk
&
&
&
&
= max
k
|γ
k
|≥ c
'
log d with high probability.
Since the variance parameter σ
2
= 1, Theorem 2.1 yields
P
(
&
&
&
&
#
d
k=1
γ
k
E
kk
&
&
&
&
≥ t
)
≤ d ·e
−t
2
/2
.
We need the factor d to ensure that the probability bound does not become effective until t ≥
√
2 log d. The dimensional factor is also necessary in the tail bound for Rademacher series because
of the central limit theorem.
Remark 2.4. The dimensional dependence does not appear in standard bounds for Rademacher
series in Banach space because they concern the deviation of the norm of the sum above its mean
value. For example, Ledoux [Led96, Eqn. (1.9)] proves that
P
!
&
&
&
#
k
ε
k
A
k
&
&
&
≥ E
&
&
&
#
k
ε
k
A
k
&
&
&
+ t
%
≤ e
−t
2
/8σ
2
where σ
2
is given by (2.3). Unfortunately, this formula provides no information about the size of
the expectation. In contrast, we can always bound the expectation by integrating (2.5), although
the estimate may not be sharp.