scispace - formally typeset
Open AccessBook ChapterDOI

The Expected Norm of a Sum of Independent Random Matrices: An Elementary Approach

TLDR
In this paper, it was shown that the expectation of the spectral norm of a sum of independent random matrices is bounded by the norm of the expected square of the random matrix and the expected of the maximum squared norm achieved by one of the summands.
Abstract
In contemporary applied and computational mathematics, a frequent challenge is to bound the expectation of the spectral norm of a sum of independent random matrices. This quantity is controlled by the norm of the expected square of the random matrix and the expectation of the maximum squared norm achieved by one of the summands; there is also a weak dependence on the dimension of the random matrix. The purpose of this paper is to give a complete, elementary proof of this important inequality.

read more

Content maybe subject to copyright    Report

arXiv:1506.04711v2 [math.PR] 16 Oct 2015
THE EXPECTED NORM OF A SUM OF INDEPENDENT RANDOM MATRICES:
AN ELEME NTARY APPROACH
JOEL A. TROPP
ABSTRACT. In contemporary applied and computational mathematics, a frequent challenge is to bound the
expectation of the spectral norm of a sum of independent random matrices. This quantity is controlled by
the norm of the expected square of the random matrix and the expectation of the maximum squared no rm
achieved by one of t he summands; there is also a weak dependence on the dimension of the random matrix.
The purpose of this paper is to give a complete, elementary proof of this important, but underappreciated,
inequality.
1. MOTIVATION
Over the last decade, random matrices have become ubiquitous in applied and computational math-
ematics. As this trend accelerates, more an d more researchers must confront random matrices as part of
their work. Classical random matrix theory can be difficult to use, and it is often silent about the ques-
tions that come up in modern applications. As a consequence, it has become imperative to develop and
disseminate new tools that are easy to use an d t hat apply to a wide range of random matrices.
1.1. Matrix Concentration Inequalities. Matrix concentration inequalities are among the most popu-
lar of these new methods. For a random matrix Z with appropriate structure, these results use simple
parameters associated with the random matrix to provide bounds of the form
E
k
Z E Z
k
... and P
{
k
Z E Z
k
t
}
...
where
k
·
k
denotes the spectral norm, also known as the
2
operator norm. These tools have already
found a place in a huge number of mathematical research fields, including
numerical linear algebra [
Tro11]
numerical analysis [
MB14]
uncertainty quantification [
CG14]
statistics [
Kol11]
econometrics [
CC13]
approximation theory [
CDL13]
sampling theory [
BG13]
machine learning [
DKC13, LPSS
+
14]
learning theory [FSV12, MKR12]
mathematical signal processing [
CBSW14]
optimization [
CSW12]
computer graphics and vision [
CGH14]
quantum information theory [
Hol12]
theory of algorithms [
HO14, CKM
+
14] and
combinatorics [
Oli10].
These references are chosen more or less at random from a long menu of possibilities. See the mono-
graph [
Tro15a] for an over view of t he main results on matrix concentration, many detailed applications,
and additional background references.
Date: 15 June 2015.
2010 Mathematics Subject Classification. Primary: 60B20. Secondary: 60F10, 60G50, 60G42.
Key words and phrases. Probability inequality; random matrix; sum of independent random variables.
1

2 J. A. TROPP
1.2. The Expected Norm. The purpose of this paper is to provide a complete proof of the following im-
portant, but underappreciated, theorem. Th is result is adapted from [
CGT12, Th m. A.1].
Theorem I (The Expected Norm of an Independent Sum of Random Matrices). Consider an independent
family {S
1
,...,S
n
} of random d
1
×d
2
complex-valued matr ices with E S
i
=0 for each index i , and define
Z :=
X
n
i =1
S
i
. (1.1)
Introduce the matrix variance parameter
v(Z ) :=max
©
°
°
E
£
Z Z
¤
°
°
,
°
°
E
£
Z
Z
¤
°
°
ª
=max
©
°
°
X
i
E
£
S
i
S
i
¤
°
°
,
°
°
X
i
E
£
S
i
S
i
¤
°
°
ª
(1.2)
and the large deviation parameter
L :=
¡
E max
i
k
S
i
k
2
¢
1/2
. (1.3)
Define the dimensional consta nt
C(d ) :=C (d
1
,d
2
) :=4·
¡
1+2log(d
1
+d
2
)
¢
. (1.4)
Then we have the matching estimates
p
c ·v(Z ) + c ·L
¡
E
k
Z
k
2
¢
1/2
p
C(d ) ·v(Z ) + C (d)·L. (1.5)
In the lower inequality, we can take c := 1/4. The symbol
k
·
k
denotes the
2
operator norm, also known
as the spectral norm, and
refers to the conjugate transpose operation. The map ⌈·⌉ returns the small e st
integer that exceeds its a rgument.
The proof of this result occupies the bulk of this pa p e r. Most of the page count is attributed to a detailed
presentation of the required background material from linear algebra and probability. We have based
the argument on the most elementary considerations possible, and we have tried to make the work self-
contained. Once the reader has digested these ideas, the related—but more sophisticated —approach in
the paper [
MJC
+
14] should be a ccessible.
1.3. Discussion. Before we continue, some remarks abou t Theorem I are in order. First, although it may
seem restrictive to focus on independent sums, as in (
1.1), this model captures an enormous number of
useful examples. See t he monograph [
Tro15a] for justification.
We have chosen the term variance parameter because the quantity (
1.2) is a direct generalization of
the variance of a scalar random variable. The passage from the first formula to the second formula in (
1.2)
is an immediate consequence of t he assumption that the summands S
i
are independen t a nd h ave zero
mean (see Section 5). We use the term large-deviat ion parameter because the quantity (1.3) reflects the
part of the expected norm of the random matrix that is attributable to one of the summands taking an
unusually large value. In practice, both parameters are easy to compute using matrix arithmetic and
some basic probabilistic considerations.
In applications, it is common that we need high-probability boun ds on the norm of a random ma-
trix. Typically, the bigger challenge is to estimate the expectation of the norm, which is what Theorem
I
achieves. Once we have a bound for the expectation, we can use scalar concentration inequalities, such
as [
BLM13, Thm. 6.10], to obtain high-probability bounds on the deviation between the norm and its
mean value.
We have stated Theorem I as a bound on the second moment of
k
Z
k
because this is the most natural
form of the result. Equivalent boun ds hold for the first moment:
p
c
·v(Z ) + c
·L E
k
Z
k
p
C(d ) ·v(Z ) + C (d)·L.
We can take c
=1/8. The upper bound follows easily from (1.5) and Jensens inequality. The lower bound
requires the Khintchine–Kahane inequality [LO94].

THE NORM OF A SUM OF INDEPENDENT RANDOM MATRICES 3
Observe that the lower and upper estimates in (
1.5) differ only by the factor C(d ). As a consequence,
the lower bound has no explicit dimensional dep endence, while the upper bound has only a weak depen-
dence on th e dimension. Under the assumptions of the theorem, it is not possible to make substantial
improvements to either the lower bound or the upper bound. Section
7 provides examples that support
this claim.
In the theory of matrix concentration, one of t he major challenges is t o understand what properties
of the random matrix Z allow us to remove the dimensional factor C(d ) from the estimate (
1.5). This
question is largely open, but the recent papers [
Oli13, BH14, Tro15b] make some progress.
1.4. The Uncentered Case. Although Theorem
I concerns a centered random matrix, it can also be used
to study a general random matrix. The following result is an immediate corollary of Theorem
I.
Theorem II. Consider an independent family {S
1
,...,S
n
} of random d
1
×d
2
complex-valued matrices, not
necessarily centered. Define
R :=
X
n
i =1
S
i
Introduce the variance parameter
v(R) :=max
©
°
°
E
£
(R E R)(R E R)
¤
°
°
,
°
°
E
£
(R E R)
(R E R)
¤
°
°
ª
=max
©
°
°
X
n
i =1
E
£
(S
i
E S
i
)(S
i
E S
i
)
¤
°
°
,
°
°
E
£
(S
i
E S
i
)
(S
i
E S
i
)
¤
°
°
ª
and the large-deviation parameter
L
2
:=E max
i
k
S
i
E S
i
k
2
.
Then we have the matching estimates
p
c ·v(R) + c ·L
¡
E
k
R E R
k
2
¢
1/2
p
C(d ) ·v(R) + C(d ) ·L.
We can take c =1/4, and the dimension a l constant C (d) is defined in (
1.4).
Theorem
II can also be used to study
k
R
k
by combining it with the estimates
k
E R
k
¡
E
k
R E R
k
2
¢
1/2
¡
E
k
R
k
2
¢
1/2
k
E R
k
+
¡
E
k
R E R
k
2
¢
1/2
.
These bounds follow from the triangle inequality for t he spectral norm.
It is productive to interpret Theorem
II as a perturbation result because it describes how far the ran-
dom matrix R deviates from its mean E R. We can derive many useful consequences from a bound of the
form
¡
E
k
R E R
k
2
¢
1/2
...
This estimate shows t hat, on average, all of the singular values of R are close to the corresponding singu-
lar values of E R. It also implies that, on average, t he singular vectors of R are close to the corresponding
singular vectors of E R, provided that the associated singular values are isolated. Furthermore, we dis-
cover that, on average, each linear functional tr[C R] is uniformly close to E tr[CR] for each fixed matrix
C M
d
2
×d
1
with bounded Schatt en 1-norm
k
C
k
S
1
1.
1.5. History. Theorem
I is not new. A somewhat weaker version of the upper bound appeared in Rudel-
sons work [
Rud99, Thm. 1]; see also [RV07, Thm. 3.1] and [Tro08, Sec. 9]. The first explicit statement
of the upper bound appeared in [
CGT12, Thm. A.1]. All of these results depend on the noncommuta-
tive Khintchine inequality [
LP86, Pis98, Buc01]. In our approach, the main innovation is a par ticularly
easy proof of a Khintchine-type inequality for matrices, patterned a fter [
MJC
+
14, Cor 7.3] an d [Tro15b,
Thm. 8.1].
The ideas behind the proof of the lower bound in Theorem I are older. This estimate depends on
generic considerations about th e behavior of a sum of independent random variables in a Banach space.
These techniques are explained in detail in [LT11, Ch. 6]. Our presentation expands on a proof sketch
that appear s in the monograph [Tro15a, Secs. 5.1.2 and 6.1.2].

4 J. A. TROPP
1.6. Target Audience. This paper is intended for students and researchers who want to develop a de-
tailed understanding of the foundations of matrix concentration. The preparation required is modest.
Basic Convexity. Some simple ideas from convexity play a role, notably the concept of a convex
function and Jensens inequality.
Intermediate Linear Algebra. The requirements from linear algebra are more substantial. The
reader should be familiar with the spectral theorem for Hermitian (or symmetric) matrices, Rayleighs
variational principle, the trace of a matrix, and the spectral norm. Th e p aper includes reminders
about this material. The paper elaborates on some less familiar ideas, including inequalities for
the trace and the spectral norm.
Intermediate Probability. The paper dema nds some comfort with probability. The most impor-
tant concepts are expectation and th e elementary theory of conditional expectation. We develop
the other ke y ideas, including the notion of symmetrization.
Although many readers will find the background material unnecessary, it is hard to locate these ideas
in one place and we prefer to make the paper self-contained. In any case, we provide detailed cross-
references so that the reader may dive into the proofs of the main results without wading through the
shallower part of the pa per.
1.7. Roadmap. Section
2 and Section 3 contain the background material from linear algebra and proba-
bility. To prove the up per bound in Theorem
I, the key step is to establish th e result for the special case of
a sum of fixed ma trices, each modulated by a random sign. This result appear s in Section
4. In Section 5,
we exploit this result to obtain the up p e r bou nd in (
1.5). In Section 6, we present the easier proof of the
lower bound in (
1.5). Finally, Section 7 shows t hat it is not possible to improve (1.5) substantially.
2. LINEAR ALGEBRA BACKGROUND
Our aim is to make this paper as accessible as possible. To that end, this section presents some back-
ground material from linear algebra. Good references include [Hal74, Bha97, HJ13]. We also assume
some familiarity with basic ideas from the theory of convexity, which may be found in the books [
Lue69,
Roc97, Bar02, BV04].
2.1. Convexity. Let V be a finite-dimensional linear space. A subset E V is convex when
x, y E implies τ ·x +(1 τ) ·y E for each τ [0,1].
Let E be a convex subset of a linear space V . A function f : E R is convex if
f
¡
τx +(1τ)y
¢
τ · f (x)+(1τ)· f (y ) for all τ [0,1] and all x, y V . (2.1)
We say that f is concave when f is convex.
2.2. Vector Basics. Let C
d
be the complex linear space of d-dimensional complex vectors, equipped
with the usual componentwise addition and scalar multiplication. The
2
norm
k
·
k
is defined on C
d
via
the expression
k
x
k
2
:= x
x for each x C
d
.
The symbol
denotes the conjugate transpose of a vector. Recall that the
2
norm is a convex function.
A family {u
1
,...,u
d
} C
d
is called an ort honormal basis if it satisfies t he relations
u
i
u
j
=
(
1, i = j
0, i 6= j.
The orthonormal basis also has the property
X
d
i =1
u
i
u
i
=I
d
where I
d
is the d ×d identity matrix.

THE NORM OF A SUM OF INDEPENDENT RANDOM MATRICES 5
2.3. Matrix Basics. A matri x is a rectangular array of complex numbers. Addition a nd multiplication
by a complex scalar are defined componentwise, and we can multiply two mat rices with compatible
dimensions. We write M
d
1
×d
2
for the complex linear space of d
1
×d
2
matrices. The symbol
also refers
to the conjugate transpose operation on matrices.
A square matrix H is Hermitian when H = H
. Hermitian matrices are sometimes called conjugate
symmetric. We introduce the set of d ×d Hermitian mat rices:
H
d
:=
©
H M
d×d
: H =H
ª
.
Note that th e set H
d
is a linear space over the real field.
An Hermitian matrix A H
d
is positive semidefinite when
u
Au 0 for ea ch u C
d
.
It is convenient to use the notation A 4 H to mean that H A is positive semidefinite. In particular, the
relation 0 4 H is equivalent to H being positive semidefinite. Observe that
0 4 A and 0 4 H implies 0 4 α ·(A +H) for each α 0.
In other words, addition and nonnegative scaling preser ve the positive-semidefinite property.
For every matr ix B, both of its squares BB
and B
B are Hermitian an d positive semidefinite.
2.4. Basic Spectral Theory. Each Hermitian matrix H H
d
can be ex pressed in the form
H =
X
d
i =1
λ
i
u
i
u
i
(2.2)
where the λ
i
are uniquely determined real numbers, called eigenvalues, and {u
i
} is an orthonorma l basis
for C
d
. The representation (
2.2) is called an eigenvalue decomposition.
An Her mitian matrix H is positive semidefinite if and only if its eigenvalues λ
i
are all nonnegative.
Indeed, using the eigenvalue decomposition (2.2), we see that
u
Hu =
X
n
i =1
λ
i
·u
u
i
u
i
u =
X
n
i =1
λ
i
·
|
u
u
i
|
2
.
To verify the forward direction, select u =u
j
for each index j . The reverse direction should be obvious.
We define a monomial function of an Hermitian matrix H H
d
by repeated multiplication:
H
0
=I
d
, H
1
= H, H
2
= H ·H, H
3
= H ·H
2
, etc.
For each nonnegative integer r , it is not hard to check that
H =
X
d
i =1
λ
i
u
i
u
i
implies H
r
=
X
d
i =1
λ
r
i
u
i
u
i
. (2.3)
In particular, H
2p
is positive semidefinite f or each nonnegative integer p.
2.5. Rayleighs Variational Principle. The Rayleigh principle is an attractive expression for the maxi-
mum eigenvalue λ
max
(H) of an Hermitian ma trix H H
d
. This result states that
λ
max
(H) = max
k
u
k
=1
u
Hu. (2.4)
The maximum takes place over all unit-norm vectors u C
d
. The identity (
2.4) follows from the Lagrange
multiplier theorem and the existence of the eigenvalue decomposition (
2.2). Similarly, the minimum
eigenvalue λ
min
(H) satisfies
λ
min
(H) = min
k
u
k
=1
u
Hu. (2.5)
We can obta in (
2.5) by applying (2.4) to H.
Rayleighs principle implies that order relations for positive-semidefinite matrices lead to order rela-
tions for th eir eigenvalues.
Fact 2.1 (Monotonicity). Let A, H H
d
be Hermitian matrices. The n
A 4 H implies λ
max
(A) λ
max
(H).

Citations
More filters
Journal ArticleDOI

Probability and Random Processes

Ali Esmaili
- 01 Aug 2005 - 
TL;DR: This handbook is a very useful handbook for engineers, especially those working in signal processing, and provides real data bootstrap applications to illustrate the theory covered in the earlier chapters.
Book

High-Dimensional Probability: An Introduction with Applications in Data Science

TL;DR: A broad range of illustrations is embedded throughout, including classical and modern results for covariance estimation, clustering, networks, semidefinite programming, coding, dimension reduction, matrix completion, machine learning, compressed sensing, and sparse regression.
Journal ArticleDOI

Randomized numerical linear algebra: Foundations and algorithms

TL;DR: This survey describes probabilistic algorithms for linear algebraic computations, such as factorizing matrices and solving linear systems, that have a proven track record for real-world problems and treats both the theoretical foundations of the subject and practical computational issues.
Posted Content

Universality laws for randomized dimension reduction, with applications

TL;DR: In this article, the authors studied a family of randomized dimension reduction maps and a large class of data sets, and they showed that there is a phase transition in the success probability of the dimension reduction map as the embedding dimension increases.
Journal ArticleDOI

Universality laws for randomized dimension reduction, with applications

TL;DR: It is proved that there is a phase transition in the success probability of the dimension reduction map as the embedding dimension increases, and each map has the same stability properties, as quantified through the restricted minimum singular value.
References
More filters
Book

Convex Optimization

TL;DR: In this article, the focus is on recognizing convex optimization problems and then finding the most appropriate technique for solving them, and a comprehensive introduction to the subject is given. But the focus of this book is not on the optimization problem itself, but on the problem of finding the appropriate technique to solve it.
Book

Optimization by Vector Space Methods

TL;DR: This book shows engineers how to use optimization theory to solve complex problems with a minimum of mathematics and unifies the large field of optimization with a few geometric principles.
Book

Matrix Analysis

Book

Probability and random processes

TL;DR: In this article, the authors present a survey of the history and varieties of probability for the laws of chance and their application in the context of Markov chains convergence of random variables.
Book

Concentration Inequalities: A Nonasymptotic Theory of Independence

TL;DR: Deep connections with isoperimetric problems are revealed whilst special attention is paid to applications to the supremum of empirical processes.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What are the contributions in this paper?

The purpose of this paper is to give a complete, elementary proof of this important, but underappreciated, inequality. 

Once the authors have a bound for the expectation, the authors can use scalar concentration inequalities, such as [BLM13, Thm. 6.10], to obtain high-probability bounds on the deviation between the norm and its mean value. 

∈Hd can be expressed in the formH = ∑d i=1λi ui u ∗ i (2.2)where the λi are uniquely determined real numbers, called eigenvalues, and {ui } is an orthonormal basis for Cd . 

The authors have the important identity‖B‖2 = ‖B∗B‖ = ‖B B∗‖ for every matrix B . (2.8)Furthermore, the spectral norm is a convex function, and it satisfies the triangle inequality. 

The authors define a monomial function of an Hermitian matrix H ∈Hd by repeated multiplication:H 0 = Id , H 1 = H , H 2 = H ·H , H 3 = H ·H 2, etc. 

{Qi } is an independent family of POISSON(1) random variables, and the first approximation follows from the Poisson limit of a binomial. 

Rudelson [Rud99] pointed out that the noncommutative Khintchine inequality also implies bounds for the spectral norm of a matrix Rademacher series. 

To prove the upper bound in Theorem I, the key step is to establish the result for the special case of a sum of fixed matrices, each modulated by a random sign. 

The norm of a block-diagonal Hermitian matrix is the maximum spectral norm of a block, which follows from the Rayleigh principle (2.4) with a bit of work. 

Calculate thattr [ HW q H Y 2r−q ] = tr [ H ( ∑di=1λ q i ui u ∗ i ) H ( ∑d j=1µ 2r−q j v j v ∗ j )]= ∑d i , j=1λ q i µ 2r−q j · tr [ Hui u ∗ i H v j v ∗ j ]≤ ∑d i , j=1 |λi | q |µ j |2r−q · ∣ ∣u∗i H v j ∣ ∣ 2.(2.16)The first identity relies on the formula (2.3) for the eigenvalue decomposition of a monomial. 

The variance parameter for the random matrix isv(Z )= ∥ ∥∥∑d i=1 ∑n j=1E ( δi j −n−1 )2 ·Ei i∥ ∥ ∥= ∥ ∥ ∥ ∑di=1 ∑n j=1 n −1(1−n−1 ) 

{γi } is an independent family of standard normal variables, and the first approximation follows from the central limit theorem. 

For an Hermitian matrix, the spectral norm can be written in terms of the eigenvalues:‖H‖=max { λmax(H ), −λmin(H ) } for each Hermitian matrix H . (2.9)As a consequence, ‖A‖ =λmax(A) for each positive-semidefinite matrix A. (2.10)This discussion implies that‖H‖2p =‖H 2p‖ for each Hermitian H and each nonnegative integer p .