What is the eigenvalue decomposition of a Hermitian matrix?

∈Hd can be expressed in the formH = ∑d i=1λi ui u ∗ i (2.2)where the λi are uniquely determined real numbers, called eigenvalues, and {ui } is an orthonormal basis for Cd .

What is the important identity of the spectral norm?

The authors have the important identity‖B‖2 = ‖B∗B‖ = ‖B B∗‖ for every matrix B . (2.8)Furthermore, the spectral norm is a convex function, and it satisfies the triangle inequality.

How do the authors define a monomial function of an Hermitian matrix?

The authors define a monomial function of an Hermitian matrix H ∈Hd by repeated multiplication:H 0 = Id , H 1 = H , H 2 = H ·H , H 3 = H ·H 2, etc.

What is the first approximation of Qi?

{Qi } is an independent family of POISSON(1) random variables, and the first approximation follows from the Poisson limit of a binomial.

What is the spectral norm of a matrix Rademacher series?

Rudelson [Rud99] pointed out that the noncommutative Khintchine inequality also implies bounds for the spectral norm of a matrix Rademacher series.

What is the norm of a block-diagonal Hermitian matrix?

The norm of a block-diagonal Hermitian matrix is the maximum spectral norm of a block, which follows from the Rayleigh principle (2.4) with a bit of work.

What is the function for the eigenvalue decomposition of a monomial?

Calculate thattr [ HW q H Y 2r−q ] = tr [ H ( ∑di=1λ q i ui u ∗ i ) H ( ∑d j=1µ 2r−q j v j v ∗ j )]= ∑d i , j=1λ q i µ 2r−q j · tr [ Hui u ∗ i H v j v ∗ j ]≤ ∑d i , j=1 |λi | q |µ j |2r−q · ∣ ∣u∗i H v j ∣ ∣ 2.(2.16)The first identity relies on the formula (2.3) for the eigenvalue decomposition of a monomial.

What is the variance parameter for the random matrix isv(Z)?

The variance parameter for the random matrix isv(Z )= ∥ ∥∥∑d i=1 ∑n j=1E ( δi j −n−1 )2 ·Ei i∥ ∥ ∥= ∥ ∥ ∥ ∑di=1 ∑n j=1 n −1(1−n−1 )

What is the first approximation of the central limit theorem?

{γi } is an independent family of standard normal variables, and the first approximation follows from the central limit theorem.

What is the spectral norm for a Hermitian matrix?

For an Hermitian matrix, the spectral norm can be written in terms of the eigenvalues:‖H‖=max { λmax(H ), −λmin(H ) } for each Hermitian matrix H . (2.9)As a consequence, ‖A‖ =λmax(A) for each positive-semidefinite matrix A. (2.10)This discussion implies that‖H‖2p =‖H 2p‖ for each Hermitian H and each nonnegative integer p .

(Open Access) The Expected Norm of a Sum of Independent Random Matrices: An Elementary Approach (2016) | Joel A. Tropp

Q: What is the simplest way to estimate the norm?

Once the authors have a bound for the expectation, the authors can use scalar concentration inequalities, such as [BLM13, Thm. 6.10], to obtain high-probability bounds on the deviation between the norm and its mean value.

Q: What is the key step to prove the upper bound in the theorem?

To prove the upper bound in Theorem I, the key step is to establish the result for the special case of a sum of fixed matrices, each modulated by a random sign.

arXiv:1506.04711v2 [math.PR] 16 Oct 2015

THE EXPECTED NORM OF A SUM OF INDEPENDENT RANDOM MATRICES:

AN ELEME NTARY APPROACH

JOEL A. TROPP

ABSTRACT. In contemporary applied and computational mathematics, a frequent challenge is to bound the

expectation of the spectral norm of a sum of independent random matrices. This quantity is controlled by

the norm of the expected square of the random matrix and the expectation of the maximum squared no rm

achieved by one of t he summands; there is also a weak dependence on the dimension of the random matrix.

The purpose of this paper is to give a complete, elementary proof of this important, but underappreciated,

inequality.

1. MOTIVATION

Over the last decade, random matrices have become ubiquitous in applied and computational math-

ematics. As this trend accelerates, more an d more researchers must confront random matrices as part of

their work. Classical random matrix theory can be difﬁcult to use, and it is often silent about the ques-

tions that come up in modern applications. As a consequence, it has become imperative to develop and

disseminate new tools that are easy to use an d t hat apply to a wide range of random matrices.

1.1. Matrix Concentration Inequalities. Matrix concentration inequalities are among the most popu-

lar of these new methods. For a random matrix Z with appropriate structure, these results use simple

parameters associated with the random matrix to provide bounds of the form

Z −E Z

≤ ... and P

{

Z −E Z

≥ t

}

≤ ...

where

denotes the spectral norm, also known as the ℓ

operator norm. These tools have already

found a place in a huge number of mathematical research ﬁelds, including

• numerical linear algebra [

Tro11]

• numerical analysis [

MB14]

• uncertainty quantiﬁcation [

CG14]

• statistics [

Kol11]

• econometrics [

CC13]

• approximation theory [

CDL13]

• sampling theory [

BG13]

• machine learning [

DKC13, LPSS

14]

• learning theory [FSV12, MKR12]

• mathematical signal processing [

CBSW14]

• optimization [

CSW12]

• computer graphics and vision [

CGH14]

• quantum information theory [

Hol12]

• theory of algorithms [

HO14, CKM

14] and

• combinatorics [

Oli10].

These references are chosen more or less at random from a long menu of possibilities. See the mono-

graph [

Tro15a] for an over view of t he main results on matrix concentration, many detailed applications,

and additional background references.

Date: 15 June 2015.

2010 Mathematics Subject Classiﬁcation. Primary: 60B20. Secondary: 60F10, 60G50, 60G42.

Key words and phrases. Probability inequality; random matrix; sum of independent random variables.

2 J. A. TROPP

1.2. The Expected Norm. The purpose of this paper is to provide a complete proof of the following im-

portant, but underappreciated, theorem. Th is result is adapted from [

CGT12, Th m. A.1].

Theorem I (The Expected Norm of an Independent Sum of Random Matrices). Consider an independent

family {S

,...,S

} of random d

×d

complex-valued matr ices with E S

=0 for each index i , and deﬁne

Z :=

i =1

. (1.1)

Introduce the matrix variance parameter

v(Z ) :=max

Z Z

∗

=max

∗

(1.2)

and the large deviation parameter

L :=

E max

1/2

. (1.3)

Deﬁne the dimensional consta nt

C(d ) :=C (d

) :=4·

1+2⌈log(d

)⌉

. (1.4)

Then we have the matching estimates

c ·v(Z ) + c ·L ≤

1/2

≤

C(d ) ·v(Z ) + C (d)·L. (1.5)

In the lower inequality, we can take c := 1/4. The symbol

denotes the ℓ

operator norm, also known

as the spectral norm, and

∗

refers to the conjugate transpose operation. The map ⌈·⌉ returns the small e st

integer that exceeds its a rgument.

The proof of this result occupies the bulk of this pa p e r. Most of the page count is attributed to a detailed

presentation of the required background material from linear algebra and probability. We have based

the argument on the most elementary considerations possible, and we have tried to make the work self-

contained. Once the reader has digested these ideas, the related—but more sophisticated —approach in

the paper [

MJC

14] should be a ccessible.

1.3. Discussion. Before we continue, some remarks abou t Theorem I are in order. First, although it may

seem restrictive to focus on independent sums, as in (

1.1), this model captures an enormous number of

useful examples. See t he monograph [

Tro15a] for justiﬁcation.

We have chosen the term variance parameter because the quantity (

1.2) is a direct generalization of

the variance of a scalar random variable. The passage from the ﬁrst formula to the second formula in (

1.2)

is an immediate consequence of t he assumption that the summands S

are independen t a nd h ave zero

mean (see Section 5). We use the term large-deviat ion parameter because the quantity (1.3) reﬂects the

part of the expected norm of the random matrix that is attributable to one of the summands taking an

unusually large value. In practice, both parameters are easy to compute using matrix arithmetic and

some basic probabilistic considerations.

In applications, it is common that we need high-probability boun ds on the norm of a random ma-

trix. Typically, the bigger challenge is to estimate the expectation of the norm, which is what Theorem

achieves. Once we have a bound for the expectation, we can use scalar concentration inequalities, such

as [

BLM13, Thm. 6.10], to obtain high-probability bounds on the deviation between the norm and its

mean value.

We have stated Theorem I as a bound on the second moment of

because this is the most natural

form of the result. Equivalent boun ds hold for the ﬁrst moment:

′

·v(Z ) + c

′

·L ≤ E

≤

C(d ) ·v(Z ) + C (d)·L.

We can take c

′

=1/8. The upper bound follows easily from (1.5) and Jensen’s inequality. The lower bound

requires the Khintchine–Kahane inequality [LO94].

THE NORM OF A SUM OF INDEPENDENT RANDOM MATRICES 3

Observe that the lower and upper estimates in (

1.5) differ only by the factor C(d ). As a consequence,

the lower bound has no explicit dimensional dep endence, while the upper bound has only a weak depen-

dence on th e dimension. Under the assumptions of the theorem, it is not possible to make substantial

improvements to either the lower bound or the upper bound. Section

7 provides examples that support

this claim.

In the theory of matrix concentration, one of t he major challenges is t o understand what properties

of the random matrix Z allow us to remove the dimensional factor C(d ) from the estimate (

1.5). This

question is largely open, but the recent papers [

Oli13, BH14, Tro15b] make some progress.

1.4. The Uncentered Case. Although Theorem

I concerns a centered random matrix, it can also be used

to study a general random matrix. The following result is an immediate corollary of Theorem

Theorem II. Consider an independent family {S

,...,S

} of random d

×d

complex-valued matrices, not

necessarily centered. Deﬁne

R :=

i =1

Introduce the variance parameter

v(R) :=max

(R −E R)(R −E R)

∗

(R −E R)

∗

(R −E R)

=max

i =1

−E S

)(S

−E S

)

∗

−E S

)

∗

−E S

)

and the large-deviation parameter

:=E max

−E S

Then we have the matching estimates

c ·v(R) + c ·L ≤

R −E R

1/2

≤

C(d ) ·v(R) + C(d ) ·L.

We can take c =1/4, and the dimension a l constant C (d) is deﬁned in (

1.4).

Theorem

II can also be used to study

by combining it with the estimates

E R

−

R −E R

1/2

≤

1/2

≤

E R

R −E R

1/2

These bounds follow from the triangle inequality for t he spectral norm.

It is productive to interpret Theorem

II as a perturbation result because it describes how far the ran-

dom matrix R deviates from its mean E R. We can derive many useful consequences from a bound of the

form

R −E R

1/2

≤ ...

This estimate shows t hat, on average, all of the singular values of R are close to the corresponding singu-

lar values of E R. It also implies that, on average, t he singular vectors of R are close to the corresponding

singular vectors of E R, provided that the associated singular values are isolated. Furthermore, we dis-

cover that, on average, each linear functional tr[C R] is uniformly close to E tr[CR] for each ﬁxed matrix

C ∈M

×d

with bounded Schatt en 1-norm

≤1.

1.5. History. Theorem

I is not new. A somewhat weaker version of the upper bound appeared in Rudel-

son’s work [

Rud99, Thm. 1]; see also [RV07, Thm. 3.1] and [Tro08, Sec. 9]. The ﬁrst explicit statement

of the upper bound appeared in [

CGT12, Thm. A.1]. All of these results depend on the noncommuta-

tive Khintchine inequality [

LP86, Pis98, Buc01]. In our approach, the main innovation is a par ticularly

easy proof of a Khintchine-type inequality for matrices, patterned a fter [

MJC

14, Cor 7.3] an d [Tro15b,

Thm. 8.1].

The ideas behind the proof of the lower bound in Theorem I are older. This estimate depends on

generic considerations about th e behavior of a sum of independent random variables in a Banach space.

These techniques are explained in detail in [LT11, Ch. 6]. Our presentation expands on a proof sketch

that appear s in the monograph [Tro15a, Secs. 5.1.2 and 6.1.2].

4 J. A. TROPP

1.6. Target Audience. This paper is intended for students and researchers who want to develop a de-

tailed understanding of the foundations of matrix concentration. The preparation required is modest.

• Basic Convexity. Some simple ideas from convexity play a role, notably the concept of a convex

function and Jensen’s inequality.

• Intermediate Linear Algebra. The requirements from linear algebra are more substantial. The

reader should be familiar with the spectral theorem for Hermitian (or symmetric) matrices, Rayleigh’s

variational principle, the trace of a matrix, and the spectral norm. Th e p aper includes reminders

about this material. The paper elaborates on some less familiar ideas, including inequalities for

the trace and the spectral norm.

• Intermediate Probability. The paper dema nds some comfort with probability. The most impor-

tant concepts are expectation and th e elementary theory of conditional expectation. We develop

the other ke y ideas, including the notion of symmetrization.

Although many readers will ﬁnd the background material unnecessary, it is hard to locate these ideas

in one place and we prefer to make the paper self-contained. In any case, we provide detailed cross-

references so that the reader may dive into the proofs of the main results without wading through the

shallower part of the pa per.

1.7. Roadmap. Section

2 and Section 3 contain the background material from linear algebra and proba-

bility. To prove the up per bound in Theorem

I, the key step is to establish th e result for the special case of

a sum of ﬁxed ma trices, each modulated by a random sign. This result appear s in Section

4. In Section 5,

we exploit this result to obtain the up p e r bou nd in (

1.5). In Section 6, we present the easier proof of the

lower bound in (

1.5). Finally, Section 7 shows t hat it is not possible to improve (1.5) substantially.

2. LINEAR ALGEBRA BACKGROUND

Our aim is to make this paper as accessible as possible. To that end, this section presents some back-

ground material from linear algebra. Good references include [Hal74, Bha97, HJ13]. We also assume

some familiarity with basic ideas from the theory of convexity, which may be found in the books [

Lue69,

Roc97, Bar02, BV04].

2.1. Convexity. Let V be a ﬁnite-dimensional linear space. A subset E ⊂V is convex when

x, y ∈E implies τ ·x +(1 −τ) ·y ∈E for each τ ∈[0,1].

Let E be a convex subset of a linear space V . A function f : E →R is convex if

τx +(1−τ)y

≤τ · f (x)+(1−τ)· f (y ) for all τ ∈[0,1] and all x, y ∈V . (2.1)

We say that f is concave when −f is convex.

2.2. Vector Basics. Let C

be the complex linear space of d-dimensional complex vectors, equipped

with the usual componentwise addition and scalar multiplication. The ℓ

norm

is deﬁned on C

via

the expression

:= x

∗

x for each x ∈C

The symbol

∗

denotes the conjugate transpose of a vector. Recall that the ℓ

norm is a convex function.

A family {u

,...,u

} ⊂C

is called an ort honormal basis if it satisﬁes t he relations

∗

(

1, i = j

0, i 6= j.

The orthonormal basis also has the property

i =1

∗

where I

is the d ×d identity matrix.

THE NORM OF A SUM OF INDEPENDENT RANDOM MATRICES 5

2.3. Matrix Basics. A matri x is a rectangular array of complex numbers. Addition a nd multiplication

by a complex scalar are deﬁned componentwise, and we can multiply two mat rices with compatible

dimensions. We write M

×d

for the complex linear space of d

×d

matrices. The symbol

∗

also refers

to the conjugate transpose operation on matrices.

A square matrix H is Hermitian when H = H

∗

. Hermitian matrices are sometimes called conjugate

symmetric. We introduce the set of d ×d Hermitian mat rices:

H ∈M

d×d

: H =H

∗

Note that th e set H

is a linear space over the real ﬁeld.

An Hermitian matrix A ∈H

is positive semideﬁnite when

∗

Au ≥0 for ea ch u ∈C

It is convenient to use the notation A 4 H to mean that H − A is positive semideﬁnite. In particular, the

relation 0 4 H is equivalent to H being positive semideﬁnite. Observe that

0 4 A and 0 4 H implies 0 4 α ·(A +H) for each α ≥0.

In other words, addition and nonnegative scaling preser ve the positive-semideﬁnite property.

For every matr ix B, both of its squares BB

∗

and B

∗

B are Hermitian an d positive semideﬁnite.

2.4. Basic Spectral Theory. Each Hermitian matrix H ∈H

can be ex pressed in the form

H =

i =1

∗

(2.2)

where the λ

are uniquely determined real numbers, called eigenvalues, and {u

} is an orthonorma l basis

for C

. The representation (

2.2) is called an eigenvalue decomposition.

An Her mitian matrix H is positive semideﬁnite if and only if its eigenvalues λ

are all nonnegative.

Indeed, using the eigenvalue decomposition (2.2), we see that

∗

Hu =

i =1

·u

∗

u =

i =1

∗

To verify the forward direction, select u =u

for each index j . The reverse direction should be obvious.

We deﬁne a monomial function of an Hermitian matrix H ∈H

by repeated multiplication:

, H

= H, H

= H ·H, H

= H ·H

, etc.

For each nonnegative integer r , it is not hard to check that

H =

i =1

∗

implies H

i =1

∗

. (2.3)

In particular, H

is positive semideﬁnite f or each nonnegative integer p.

2.5. Rayleigh’s Variational Principle. The Rayleigh principle is an attractive expression for the maxi-

mum eigenvalue λ

max

(H) of an Hermitian ma trix H ∈H

. This result states that

max

(H) = max

∗

Hu. (2.4)

The maximum takes place over all unit-norm vectors u ∈ C

. The identity (

2.4) follows from the Lagrange

multiplier theorem and the existence of the eigenvalue decomposition (

2.2). Similarly, the minimum

eigenvalue λ

min

(H) satisﬁes

min

(H) = min

∗

Hu. (2.5)

We can obta in (

2.5) by applying (2.4) to −H.

Rayleigh’s principle implies that order relations for positive-semideﬁnite matrices lead to order rela-

tions for th eir eigenvalues.

Fact 2.1 (Monotonicity). Let A, H ∈H

be Hermitian matrices. The n

A 4 H implies λ

max

(A) ≤λ

max

(H).

The Expected Norm of a Sum of Independent Random Matrices: An Elementary Approach

Citations

Probability and Random Processes

High-Dimensional Probability: An Introduction with Applications in Data Science

Randomized numerical linear algebra: Foundations and algorithms

Universality laws for randomized dimension reduction, with applications

Universality laws for randomized dimension reduction, with applications

References

Convex Optimization

Optimization by Vector Space Methods

Matrix Analysis

Probability and random processes

Concentration Inequalities: A Nonasymptotic Theory of Independence

Related Papers (5)

An Introduction to Matrix Concentration Inequalities

On the central limit theorem for the sum of a random number of independent random variables

Probability Inequalities for the Sum of Independent Random Variables

Bounds on the Maximum of the Density for Sums of Independent Random Variables

Probability Inequalities for sums of Bounded Random Variables

Frequently Asked Questions (13)

Q1. What are the contributions in this paper?

Q2. What is the simplest way to estimate the norm?

Q3. What is the eigenvalue decomposition of a Hermitian matrix?

Q4. What is the important identity of the spectral norm?

Q5. How do the authors define a monomial function of an Hermitian matrix?

Q6. What is the first approximation of Qi?

Q7. What is the spectral norm of a matrix Rademacher series?

Q8. What is the key step to prove the upper bound in the theorem?

Q9. What is the norm of a block-diagonal Hermitian matrix?

Q10. What is the function for the eigenvalue decomposition of a monomial?

Q11. What is the variance parameter for the random matrix isv(Z)?

Q12. What is the first approximation of the central limit theorem?

Q13. What is the spectral norm for a Hermitian matrix?