Proceedings Article•DOI•

Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression

Xiangrui Meng¹, Michael W. Mahoney²•Institutions (2)

01 Jun 2013-pp 91-100

TL;DR: In this article, a low-distortion embedding matrix Π ∈ RO(poly(d)) x n that embeds Ap, the lp subspace spanned by A's columns, into the poly(d)), |~cdot~|p, was constructed in O(nnz(A)) time.

read less

Abstract: Low-distortion embeddings are critical building blocks for developing random sampling and random projection algorithms for common linear algebra problems. We show that, given a matrix A ∈ Rn x d with n >> d and a p ∈ [1, 2), with a constant probability, we can construct a low-distortion embedding matrix Π ∈ RO(poly(d)) x n that embeds Ap, the lp subspace spanned by A's columns, into (RO(poly(d)), |~cdot~|p); the distortion of our embeddings is only O(poly(d)), and we can compute Π A in O(nnz(A)) time, i.e., input-sparsity time. Our result generalizes the input-sparsity time l2 subspace embedding by Clarkson and Woodruff [STOC'13]; and for completeness, we present a simpler and improved analysis of their construction for l2. These input-sparsity time lp embeddings are optimal, up to constants, in terms of their running time; and the improved running time propagates to applications such as (1 pm e)-distortion lp subspace embedding and relative-error lp regression. For l2, we show that a (1+e)-approximate solution to the l2 regression problem specified by the matrix A and a vector b ∈ Rn can be computed in O(nnz(A) + d3 log(d/e) /e^2) time; and for lp, via a subspace-preserving sampling procedure, we show that a (1 pm e)-distortion embedding of Ap into RO(poly(d)) can be computed in O(nnz(A) ⋅ log n) time, and we also show that a (1+e)-approximate solution to the lp regression problem minx ∈ Rd |A x - b|p can be computed in O(nnz(A) ⋅ log n + poly(d) log(1/e)/e2) time. Moreover, we can also improve the embedding dimension or equivalently the sample size to O(d3+p/2 log(1/e) / e2) without increasing the complexity.

...read moreread less

Summary (2 min read)

Jump to: [1. INTRODUCTION] – [Conditioning.] – [Lemma 1 ([10]). Given an n] – [Stable distributions.] – [Tail inequalities.] – [3. MAIN RESULTS FOR 2 EMBEDDING] – [4. MAIN RESULTS FOR 1 EMBEDDING] – [5. MAIN RESULTS FOR p EMBEDDING] – [6. IMPROVED EMBEDDING DIMENSION] and [Algorithm 2 Improving the Embedding Dimension]

1. INTRODUCTION

Regression problems are ubiquitous, and the fast computation of their solutions is of interest in many large-scale data applications.
The authors analysis is direct and does not rely on splitting the high-dimensional space into a set of heavy-hitters consisting of the high-leverage components and the complement of that heavy-hitting set.
2) in general, the authors prove that there exists an order among the Cauchy distribution, a p-stable distribution with p ∈ (1, 2), and the Gaussian distribution such that for all p ∈ (1, 2) one can use the upper bound from the Cauchy distribution and the lower bound from the Gaussian distribution.
The (1 ± )-distortion subspace embedding (for p, p ∈ [1, 2), that the authors construct from the input-sparsity time embedding and the fast subspace-preserving sampling) has embedding dimension s = O(poly(d) log(1/ )/ 2 ), where the somewhat large poly(d) term directly multiplies the log(1/ )/ 2 term.

Conditioning.

The p subspace embedding and p regression problems are closely related to the concept of conditioning.
The authors state here two related notions of p-norm conditioning and then a lemma that characterizes the relationship between them.

Lemma 1 ([10]). Given an n

This procedure is called conditioning, and there exist two approaches for conditioning: via low-distortion p subspace embedding and via ellipsoidal rounding.
The authors simply cite the following lemma, which is based on ellipsoidal rounding.

Stable distributions.

The authors use properties of p-stable distributions for analyzing input-sparsity time low-distortion p subspace embeddings.
By Lévy [19] , it is known that p-stable distributions exist for p ∈ (0, 2]; and from Chambers et al. [7] , it is known that p-stable random variables can be generated efficiently, thus allowing their practical use.

Tail inequalities.

The authors note two inequalities from Clarkson et al. [10] regarding the tails of the Cauchy distribution.
The following result about Gaussian variables is a direct consequence of Maurer's inequality ( [22] ), and the authors will use it to derive lower tail inequalities for p-stable distributions.

3. MAIN RESULTS FOR 2 EMBEDDING

Here is their result for input-sparsity time low-distortion subspace embeddings for 2.
See also Nelson and Nguyen [26] for a similar result with a slightly better constant.
The O(nnz(A)) running time is indeed optimal, up to constant factors, for general inputs.
The results of Theorem 1 propagate to related applications, e.g., to the 2 regression problem, the low-rank matrix approximation problem and the problem of computing approximations to the 2 leverage scores.
The technique used in the proof of Clarkson and Woodruff [11] , which splits coordinates into "heavy" and "light" sets based on the leverage scores, highlights an important structural property of 2 subspace: that only a small subset of coordinates can have large 2 leverage scores.

4. MAIN RESULTS FOR 1 EMBEDDING

Here is their result for input-sparsity time low-distortion subspace embeddings for 1.
As mentioned above, the O(nnz(A)) running time is optimal.
For the same construction of Π, one can provide a "bad" case that provides a lower bound.the authors.
The authors input-sparsity time 1 subspace embedding of Theorem 2 improves the O(nnz(A) d log d)-time embedding by Sohler and Woodruff [29] and the O(nd log n)-time embedding of Clarkson et al. [10] .
The authors improvements in Theorems 2 and 3 also propagate to related 1-based applications, including the 1 regression and the 1 subspace approximation problem considered in [29, 10] .

5. MAIN RESULTS FOR p EMBEDDING

Generally, Dp does not have explicit PDF/CDF, which increases the difficulty for theoretical analysis.
Lemma 8 suggests that the authors can use Lemma 5 (regarding Cauchy random variables) to derive upper tail inequalities for general p-stable distributions and that they can use Lemma 7 (regarding Gaussian variables) to derive lower tail inequalities for general p-stable distributions.
Given these results, here is their main result for inputsparsity time low-distortion subspace embeddings for p.
In particular, the authors can establish an improved algorithm for solving the p regression problem in nearly input-sparsity time.

6. IMPROVED EMBEDDING DIMENSION

(See the remark below for comments on the precise value of the poly(d) term.).
This is not ideal for the subspace embedding and the p regression, because the authors want to have a small embedding dimension and a small subsampled problem, respectively.
Here, the authors show that it is possible to decouple the large polynomial of d and the log(1/ )/ 2 term via another round of sampling and conditioning without increasing the complexity.
See Algorithm 2 for details on this procedure.

Algorithm 2 Improving the Embedding Dimension

Then, by applying Theorem 7 to the p regression problem, the authors can improve the size of the subsampled problem and hence the overall running time.
The authors have stated their results in the previous sections as poly(d) without stating the value of the polynomial because there are numerous trade-offs between the conditioning quality and the running time.

Did you find this useful? Give us your feedback

Content maybe subject to copyright Report

Low-distortion Subspace Embeddings in Input-sparsity

Time and Applications to Robust Linear Regression

Xiangrui Meng

∗

LinkedIn Corporation

2029 Stierlin Ct, Mountain View, CA 94043

ximeng@linkedin.com

Michael W. Mahoney

Dept. of Mathematics, Stanford University

Stanford, CA 94305

mmahoney@cs.stanford.edu

ABSTRACT

Low-distortion embeddings are critical building blocks for

developing random sampling and random projection algo-

rithms for common linear algebra problems. We show that,

given a matrix A ∈ R

n×d

with n  d and a p ∈ [1, 2), with a

constant probability, we can construct a low-distortion em-

bedding matrix Π ∈ R

O(poly(d))×n

that embeds A

, the `

subspace spanned by A’s columns, into (R

O(poly(d))

, k · k

);

the distortion of our embeddings is only O(poly(d)), and

we can compute ΠA in O(nnz(A)) time, i.e., input-sparsity

time. Our result generalizes the input-sparsity time `

sub-

space embedding by Clarkson and Woodruﬀ [STOC’13]; and

for completeness, we present a simpler and improved analy-

sis of their construction for `

. These input-sparsity time `

embeddings are optimal, up to constants, in terms of their

running time; and the improved running time propagates to

applications such as (1 ± )-distortion `

subspace embed-

ding and relative-error `

regression. For `

, we show that

a (1 + )-approximate solution to the `

regression problem

speciﬁed by the matrix A and a vector b ∈ R

can be com-

puted in O(nnz(A) + d

log(d/)/

) time; and for `

, via

a subspace-preserving sampling procedure, we show that a

(1 ± )-distortion embedding of A

into R

O(poly(d))

can be

computed in O(nnz(A) · log n) time, and we also show that

a (1 + )-approximate solution to the `

regression problem

min

x∈R

kAx − bk

can be computed in O(nnz(A) · log n +

poly(d) log(1/)/

) time. Moreover, we can also improve

the embedding dimension or equivalently the sample size to

O(d

3+p/2

log(1/)/

) without increasing the complexity.

Categories and Subject Descriptors

F.2 [ANALYSIS OF ALGORITHMS AND PROB-

LEM COMPLEXITY]: Numerical Algorithms and Prob-

lems

Keywords

subspace embedding; input-sparsity time; low-distortion em-

bedding; linear regression; robust regression; `

regression

∗

Most of this work was done while the author was at ICME,

Stanford University supported by NSF DMS-1009005.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

STOC’13, June 1-4, 2013, Palo Alto, California, USA.

1. INTRODUCTION

Regression problems are ubiquitous, and the fast compu-

tation of their solutions is of interest in many large-scale data

applications. A parameterized family of regression problems

that is of particular interest is the overconstrained `

regres-

sion problem: given a matrix A ∈ R

n×d

, with n > d, a

vector b ∈ R

, a norm k · k

parameterized by p ∈ [1, ∞],

and an error parameter  > 0, ﬁnd a (1 + )-approximate

solution ˆx ∈ R

to:

∗

= min

x∈R

kAx − bk

i.e., ﬁnd a vector ˆx such that kAˆx − bk

≤ (1 + )f

∗

, where

the `

norm of a vector x is kxk





1/p

, deﬁned

to be max

| for p = ∞. Special cases include the `

re-

gression problem, also known as the Least Squares problem,

and the `

regression problem, also known as the Least Ab-

solute Deviations or Least Absolute Errors problem. The

latter is of particular interest as a robust estimation or ro-

bust regression technique, in that it is less sensitive to the

presence of outliers than the former. We are most interested

in this paper in the `

regression problem due to its robust-

ness properties, but our methods hold for general p ∈ [1, 2],

and thus we formulate our results in `

It is well-known that for p ≥ 1, the overconstrained `

regression problem is a convex optimization problem; for

p = 1 and p = ∞, it is an instance of linear program-

ming; and for p = 2, it can be solved with eigenvector-

based methods such as with the QR decomposition or the

Singular Value Decomposition of A. In spite of their low-

degree polynomial-time solvability, `

regression problems

have been the focus in recent years of a wide range of ran-

dom sampling and random projection algorithms, largely

due to a desire to develop improved algorithms for large-

scale data applications [3, 24, 10]. For example, Clark-

son [9] uses subgradient and sampling methods to compute

an approximate solution to the overconstrained `

regres-

sion problem in roughly O(nd

log n) time; and Dasgupta et

al. [12] use well-conditioned bases and subspace-preserving

sampling algorithms to solve general `

regression problems,

for p ∈ [1, ∞), in roughly O(nd

log n) time. A similar

subspace-preserving sampling algorithm was developed by

Drineas, Mahoney, and Muthukrishnan [16] to compute an

approximate solution to the `

regression problem. The al-

gorithm of [16] relies on the estimation of the `

leverage

scores

of A to be used as an importance sampling distribu-

Recall that for an n × d matrix A, with n  d, the `

leverage scores of the rows of A are equal to the diagonal el-

tion, but when combined with the results of Sarl´os [28] and

Drineas et al. [17] (that quickly preprocess A to uniformize

those scores) or Drineas et al. [15] (that quickly computes ap-

proximations to those scores), this leads to a random projec-

tion or random sampling (respectively) algorithm for the `

regression problem that runs in roughly O(nd log d) time [17,

20]. More recently, Sohler and Woodruﬀ [29] introduced

the Cauchy Transform to obtain improved `

embeddings,

thereby leading to an algorithm for the `

regression problem

that runs in O(nd

1.376+

) time; and Clarkson et al. [10] use

the Fast Cauchy Transform and ellipsoidal rounding meth-

ods to compute an approximation to the solution of general

regression problems in roughly O(nd log n) time.

These algorithms, and in particular the algorithms for

p = 2, form the basis for much of the large body of recent

work in randomized algorithms for low-rank matrix approx-

imation, and thus optimizing their properties can have im-

mediate practical beneﬁts. See, e.g., the recent monograph

of Mahoney [20] and references therein for details. Although

some of these algorithms are near-optimal for dense inputs,

they all require Ω(nd log d) time, which can be large if the

input matrix is very sparse. Thus, it was a signiﬁcant re-

sult when Clarkson and Woodruﬀ [11] developed an algo-

rithm for the `

regression problem (as well as the related

problems of low-rank matrix approximation and `

leverage

score approximation) that runs in input-sparsity time, i.e.,

in O(nnz(A) + poly(d/)) time, where nnz(A) is the number

of non-zero elements in A and  is an error parameter. This

result depends on the construction of a sparse embedding

matrix Π for `

. By this, we mean the following: for an

n × d matrix A, an s × n matrix Π such that,

(1 − )kAxk

≤ kΠAxk

≤ (1 + )kAxk

for all x ∈ R

. That is, Π embeds the column space of A into

, while approximately preserving the `

norms of all vec-

tors in that subspace. Clarkson and Woodruﬀ achieve their

improved results for `

-based problems by showing how to

construct such a Π with s = poly(d/) and showing that it

can be applied to an arbitrary A in O(nnz(A)) time [11].

(In particular, this embedding result improves the result of

Meng, Saunders, and Mahoney [24], who in their develop-

ment of the parallel least-squares solver LSRN use a re-

sult from Davidson and Szarek [14] to construct a constant-

distortion embedding for `

that runs in O(nnz(A)·d) time.)

Interestingly, the analysis of Clarkson and Woodruﬀ cou-

pled ideas from the data streaming literature with the struc-

tural fact that there cannot be too many high-leverage con-

straints/rows in A. In particular, they showed that the

high-leverage parts of the subspace may be viewed as heavy-

hitters that are “perfectly hashed,” and thus contribute no

distortion, and that the distortion of the rest of the subspace

as well as the “cross terms” may be bounded with a result

of Dasgupta, Kumar, and Sarl´os [13].

In this paper, we provide improved low-distortion sub-

space embeddings for `

, for all p ∈ [1, 2], in input-sparsity

time. We also show that, by coupling with recent work on

fast subspace-preserving sampling from [10], these embed-

dings can be used to provide (1+)-approximate solutions to

ements of the projection matrix onto the span of A. See [20,

15] for details; and note that they can be generalized to `

and other `

norms [10] as well as to arbitrary n×d matrices,

with both n and d large [21, 15].

regression problems, for p ∈ [1, 2], in nearly input-sparsity

time. In more detail, our main results are the following.

First, for `

, we obtain an improved result for the input-

sparsity time (1 ± )-distortion embedding of [11]. In partic-

ular, for the same embedding procedure, we obtain improved

bounds for the embedding dimension with a much simpler

analysis than [11]. See Theorem 1 of Section 3 for a pre-

cise statement of this result. Our analysis is direct and does

not rely on splitting the high-dimensional space into a set

of heavy-hitters consisting of the high-leverage components

and the complement of that heavy-hitting set. Since our re-

sult directly improves the `

embedding result of Clarkson

and Woodruﬀ [11], it immediately leads to improvements

for the `

regression, low-rank matrix approximation, and

leverage score estimation problems that they consider.

Second, for `

, we obtain a low-distortion sparse embed-

ding matrix Π such that ΠA can be computed in input-

sparsity time. That is, we construct an embedding matrix

Π ∈ R

O(poly(d))×n

such that, for all x ∈ R

1/ O(poly(d)) · kAxk

≤ kΠAxk

≤ O(poly(d)) · kAxk

with a constant probability, and ΠA can be computed in

O(nnz(A)) time. See Theorem 2 of Section 4 for a precise

statement of this result. Here, our proof involves splitting

the set Y = {Ux | kxk

∞

= 1, x ∈ R

}, where U is an `

well-conditioned basis for the span of A, into two parts,

informally a subset where coordinates of high `

leverage

dominate kyk

and the complement of that subset. This `

result leads to immediate improvements in `

-based prob-

lems. For example, by taking advantage of the fast version

of subspace-preserving sampling from [10], we can construct

and apply a (1 ± )-distortion sparse embedding matrix for

in O(nnz(A) · log n + poly(d/)) time. In addition, we can

use it to compute a (1 + )-approximation to the `

regres-

sion problem in O(nnz(A) · log n + poly(d/)) time, which in

turn leads to immediate improvements in `

-based matrix

approximation objectives, e.g., for the `

subspace approxi-

mation problem [6, 29, 10].

Third, for `

, for all p ∈ (1, 2), we obtain a low-distortion

sparse embedding matrix Π such that ΠA can be computed

in input-sparsity time. That is, we construct an embedding

matrix Π ∈ R

O(poly(d))×n

such that, for all x ∈ R

1/ O(poly(d)) · kAxk

≤ kΠAxk

≤ O(poly(d)) · kAxk

with a constant probability, and ΠA can be computed in

O(nnz(A)) time. See Theorem 4 of Section 5 for a precise

statement of this result. Here, our proof generalizes the `

result, but we need to prove upper and lower tail bound

inequalities for sampling from general p-stable distributions

that are of independent interest. Although these distribu-

tions don’t have closed forms for p ∈ (1, 2) in general, we

prove that there exists an order among the Cauchy distribu-

tion, a p-stable distribution with p ∈ (1, 2), and the Gaussian

distribution such that for all p ∈ (1, 2) we can use the upper

bound from the Cauchy distribution and the lower bound

from the Gaussian distribution. As with our `

result, this `

result has several extensions: in O(nnz(A)·log n+poly(d/))

time, we can construct and apply a (1 ± )-distortion sparse

embedding matrix for `

; in O(nnz(A) · log n + poly(d/))

time, we can compute a (1 + )-approximation to the `

re-

gression problem; and in O(nnz(A) · d log d) time, we can

construct and apply a near-optimal (in terms of embedding

dimension and distortion factor) embedding matrix.

The (1 ± )-distortion subspace embedding (for `

, p ∈

[1, 2), that we construct from the input-sparsity time embed-

ding and the fast subspace-preserving sampling) has embed-

ding dimension s = O(poly(d) log(1/)/

), where the some-

what large poly(d) term directly multiplies the log(1/)/

term. We can also improve this, showing that it is possible,

without increasing the overall complexity, to decouple the

large poly(d) and log(1/)/

via another round of sampling

and conditioning, thereby obtaining an embedding dimen-

sion that is a small poly(d) times log(1/)/

. See Theorem 7

of Section 6 for a precise statement of this result.

Remark. Subsequent to our posting the ﬁrst version

of this paper on arXiv [23], Clarkson and Woodruﬀ let us

know that, independently of us, they used a result from [10]

to extend their `

subspace embedding from [11] to pro-

vide a nearly input-sparsity time algorithm for `

regres-

sion, for all p ∈ [1, ∞). This is now posted as Version 2

of [11]. Their approach requires solving a rounding prob-

lem of size O(n/ poly(d)) × d, which depends on n (possibly

very large). Our approach via input-sparsity time oblivious

low-distortion `

subspace embeddings does not contain this

intermediate step and it only needs O(poly(d)) storage.

Remark. In the ﬁrst version of this paper, the embedding

dimension for `

in Theorem 1 was O(d

/

). Subsequent to

the dissemination of this version, Drineas pointed out to us

that our result could very easily be improved to O(d

/

Nelson and Nguyen also let us know that, at about the same

time and using the same technique, but independent of us,

they ﬁrst published the O(d

/

) embedding result [26].

2. BACKGROUND

We use k · k

to denote the `

norm of a vector, k · k

the spectral norm of a matrix, k · k

the Frobenius norm

of a matrix, and | · |

the element-wise `

norm of a matrix.

Given A ∈ R

n×d

with full column rank and p ∈ [1, 2], we use

to denote the `

subspace spanned by A’s columns. We

are interested in fast embedding of A

into a d-dimensional

subspace of (R

poly(d)

, k · k

), with distortion either poly(d)

or (1 ± ), for some  > 0, as well as applications of this em-

bedding to problems such as `

regression. We assume that

n  poly(d) ≥ d  log n. To state our results, we assume

that we are capable of computing a (1+)-approximate solu-

tion to an `

regression problem of size n

×d for some  > 0,

as long as n

is independent of n. Denote the running time

needed to solve this smaller problem by T

(; n

, d). In the-

ory, we have T

(; n

, d) = O(n

d log(d/) + d

) (see Drineas

et al. [17]), and T

(; n

, d) = O((n

+ poly(d)) log(n

/)),

for general p (see Mitchell [25]).

Conditioning.

The `

subspace embedding and `

regression problems

are closely related to the concept of conditioning. We state

here two related notions of `

-norm conditioning and then a

lemma that characterizes the relationship between them.

Definition 1 ([10]). Given an n ×d matrix A and p ∈

[1, ∞], let σ

max

(A) = max

kxk

≤1

kAxk

and let σ

min

(A) =

min

kxk

≥1

kAxk

. Then, we denote by κ

(A) the `

-norm

condition number of A: κ

(A) = σ

max

(A)/σ

min

(A). For

simplicity, we will use κ

, σ

min

, and σ

max

when the un-

derlying matrix is clear.

Definition 2 ([12]). Given an n ×d matrix A and p ∈

[1, ∞], let q be the dual norm of p. Then A is (α, β, p)-

conditioned if (1) |A|

≤ α, and (2) for all z ∈ R

, kzk

≤

βkAzk

. Deﬁne ¯κ

(A) as the minimum value of αβ such

that A is (α, β, p)-conditioned.

Lemma 1 ([10]). Given an n × d matrix A and p ∈

[1, ∞]: d

−|1/2−1/p|

(A) ≤ ¯κ

(A) ≤ d

max{1/2,1/p}

(A).

Remark. Given the equivalence established by Lemma 1,

we will say that A is well-conditioned in the `

norm if κ

(A)

or ¯κ

(A) = O(poly(d)), independent of n.

Although for an arbitrary matrix A ∈ R

n×d

, the condition

numbers κ

(A) and ¯κ

(A) can be arbitrarily large, we can

ﬁnd a matrix R ∈ R

d×d

such that AR

−1

is well-conditioned.

This procedure is called conditioning, and there exist two

approaches for conditioning: via low-distortion `

subspace

embedding and via ellipsoidal rounding.

Definition 3. Given an n × d matrix A and a number

p ∈ [1, ∞], Π ∈ R

s×n

is a low-distortion embedding of A

if s = O(poly(d)) and ∀x ∈ R

1/ O(poly(d)) · kAxk

≤ kΠAxk

≤ O(poly(d)) · kAxk

Remark. Given a low-distortion embedding matrix Π of

, let R be the “R” matrix from the QR decomposition of

ΠA. Then, AR

−1

is well-conditioned in the `

norm.

For a discussion of ellipsoidal rounding, we refer readers

to Clarkson et al. [10]. In this paper, we simply cite the

following lemma, which is based on ellipsoidal rounding.

Lemma 2 ([10]). Given an n × d matrix A and p ∈

[1, ∞], it takes at most O(nd

log n) time to ﬁnd a matrix

R ∈ R

d×d

such that κ

(AR

−1

) ≤ 2d.

Subspace-preserving sampling and `

regression.

Given R ∈ R

d×d

such that AR

−1

is well-conditioned in

the `

norm, we can construct a (1 ± )-distortion embed-

ding, speciﬁcally a subspace-preserving sampling, of A

O(nnz(A) · log n) additional time and with a constant prob-

ability. This result from Clarkson et al. [10, Theorem 5.4]

improves the subspace-preserving sampling algorithm pro-

posed by Dasgupta et al. [12] by estimating the row norms

of AR

−1

(instead of computing them exactly) to deﬁne im-

portance sampling probabilities.

Lemma 3 ([10]). Given a matrix A ∈ R

n×d

, p ∈ [1, ∞),

 > 0, and a matrix R ∈ R

d×d

such that AR

−1

is well-

conditioned, it takes O(nnz(A) · log n) time to compute a

sampling matrix S ∈ R

s×n

(with only one nonzero element

per row) with s = O(¯κ

(AR

−1

|p/2−1|+1

log(1/)/

) such

that with a constant probability,

(1 − )kAxk

≤ kSAxk

≤ (1 + )kAxk

, ∀x ∈ R

Given a subspace-preserving sampling algorithm, Clarkson

et al. [10, Theorem 5.4] show it is straightforward to compute

1+

1−

-approximate solution to an `

regression problem.

Lemma 4 ([10]). Given an `

regression problem speci-

ﬁed by A ∈ R

n×d

, b ∈ R

, and p ∈ [1, ∞), let S be a (1 ± )-

distortion embedding matrix of the subspace spanned by A’s

columns and b from Lemma 3, and let ˆx be an optimal solu-

tion to the subsampled problem min

x∈R

kSAx −Sbk

. Then

ˆx is a

1+

1−

-approximate solution to the original problem.

Remark. Collecting these results, we see that low-distortion

subspace embedding is a fundamental building block (and

very likely a bottleneck) for (1 ± )-distortion `

subspace

embeddings, as well as for a (1 + )-approximation to an `

regression problem. This motivates our work and its em-

phasis on ﬁnding low-distortion subspace embeddings.

Stable distributions.

We use properties of p-stable distributions for analyzing

input-sparsity time low-distortion `

subspace embeddings.

Definition 4. A distribution D over R is called p-stable,

if for any m real numbers a

, . . . , a

, we have

i=1

1/p

where X

iid

∼ D and X ∼ D. By “X ' Y ”, we mean X and

Y have the same distribution.

By L´evy [19], it is known that p-stable distributions exist

for p ∈ (0, 2]; and from Chambers et al. [7], it is known that

p-stable random variables can be generated eﬃciently, thus

allowing their practical use. Let us use D

to denote the

“standard” p-stable distribution, for p ∈ [1, 2], speciﬁed by

its characteristic function ψ(t) = e

−|t|

. It is known that

is the standard Cauchy distribution, and that D

is the

Gaussian distribution with mean 0 and variance 2.

Tail inequalities.

We note two inequalities from Clarkson et al. [10] regard-

ing the tails of the Cauchy distribution.

Lemma 5. For i = 1, . . . , m, let C

be m (not necessarily

independent) standard Cauchy variables, and γ

> 0 with

γ =

. Let X =

|. For any t > 1,

Pr[X > tγ] ≤

πt



log(1 + (2mt)

)

1 − 1/(πt)

+ 1



For simplicity, we assume that m ≥ 3 and t ≥ 1, and then

we have Pr[X > tγ] ≤ 2 log(mt)/t.

Lemma 6. For i = 1, . . . , m, let C

be independent stan-

dard Cauchy random variables, and γ

≥ 0 with γ =

Let X =

|. Then, for any t > 0,

log Pr[X ≤ (1 − t)γ] ≤

−γt

3 max

The following result about Gaussian variables is a direct

consequence of Maurer’s inequality ([22]), and we will use it

to derive lower tail inequalities for p-stable distributions.

Lemma 7. For i = 1, . . . , m, let G

be independent stan-

dard Gaussian random variables, and γ

≥ 0 with γ =

Let X =

. Then, for any t > 0,

log Pr[X ≤ (1 − t)γ] ≤

−γt

6 max

3. MAIN RESULTS FOR `

EMBEDDING

Here is our result for input-sparsity time low-distortion

subspace embeddings for `

. See also Nelson and Nguyen [26]

for a similar result with a slightly better constant.

Theorem 1. Given a matrix A ∈ R

n×d

and  ∈ (0, 1),

let Π = SD where S ∈ R

s×n

has each column chosen in-

dependently and uniformly from the s standard basis vectors

of R

and D ∈ R

n×n

is a diagonal matrix with diagonal

entries chosen independently and uniformly from ±1. Let

s = (d

+ d)/(

δ). Then with probability at least 1 − δ,

(1 − )kAxk

≤ kΠAxk

≤ (1 + )kAxk

, ∀x ∈ R

In addition, ΠA can be computed in O(nnz(A)) time.

The construction of Π in this theorem is the same as in

Clarkson and Woodruﬀ [11]. There, s = O((d/)

log

(d/))

in order to achieve (1 ± ) distortion with a constant prob-

ability. Theorem 1 shows that it actually suﬃces to set

s = O((d

+ d)/

). Surprisingly, the proof is rather simple.

Let X = U

ΠU, where U is an orthonormal basis for

. Compute E[kX − Ik

] and apply Markov’s inequality

to kX −Ik

≤ 

, which implies kX −Ik

≤  and hence the

embedding result. See Appendix A.1 for a complete proof.

Remark. The O(nnz(A)) running time is indeed optimal,

up to constant factors, for general inputs. Consider the case

when A has an important row a

such that A becomes rank-

deﬁcient without it. Thus, we have to observe a

in order

to compute a low-distortion embedding. However, without

any prior knowledge, we have to scan at least a constant

portion of the input to guarantee that a

is observed with

a constant probability, which takes O(nnz(A)) time. Note

that this optimality result applies to general p.

The results of Theorem 1 propagate to related applica-

tions, e.g., to the `

regression problem, the low-rank ma-

trix approximation problem and the problem of computing

approximations to the `

leverage scores. Since it underlies

the other applications, only the `

regression improvement

is stated here explicitly; its proof is basically combining our

Theorem 1 with Theorem 19 of [11].

Corollary 1. With a constant probability, a (1 + )-

approximate solution to an `

regression problem can be com-

puted in O(nnz(A) + T

(; d

/

, d)) time.

Remark. Although our simpler direct proof leads to a bet-

ter result for `

subspace embedding, the technique used in

the proof of Clarkson and Woodruﬀ [11], which splits coor-

dinates into “heavy” and “light” sets based on the leverage

scores, highlights an important structural property of `

sub-

space: that only a small subset of coordinates can have large

leverage scores. (We note that the technique of splitting

coordinates is also used by Ailon and Liberty [1] to get an

unrestricted fast Johnson-Lindenstrauss transform; and that

the diﬃculty in ﬁnding and approximating the large-leverage

directions was—until recently [20, 15]—responsible for diﬃ-

culties in obtaining fast relative-error random sampling al-

gorithms for `

regression and low-rank matrix approxima-

tion.) An analogous structural fact holds for `

and other `

spaces. Using this property, we can construct novel input-

sparsity time `

subspace embeddings for general p ∈ [1, 2),

as we discuss in the next two sections.

4. MAIN RESULTS FOR `

EMBEDDING

Here is our result for input-sparsity time low-distortion

subspace embeddings for `

Theorem 2. Given A ∈ R

n×d

, let Π = SC ∈ R

s×n

where S ∈ R

s×n

has each column chosen independently and

uniformly from the s standard basis vectors of R

, and where

C ∈ R

n×n

is a diagonal matrix with diagonals chosen inde-

pendently from the standard Cauchy distribution. Set s =

ωd

log

d with ω suﬃciently large. Then with a constant

probability, we have ∀x ∈ R

1/ O(d

log

d) · kAxk

≤ kΠAxk

≤ O(d log d) · kAxk

In addition, ΠA can be computed in O(nnz(A)) time.

The construction of the `

subspace embedding matrix is

diﬀerent than its `

norm counterpart only by the diago-

nal elements of D (or C): whereas we use ±1 for the `

norm, we use Cauchy variables for the `

norm. The proof

of Theorem 2 uses the technique of splitting coordinates, the

fact that the Cauchy distribution is 1-stable, and the upper

and lower tail tail inequalities from Lemmas 5 and 6. See

Appendix A.2 for a complete proof.

Remark. As mentioned above, the O(nnz(A)) running time

is optimal. Whether the distortion O(d

log

d) is optimal is

still an open question. However, for the same construction of

Π, we can provide a “bad” case that provides a lower bound.

Choose A =





. Suppose that s is suﬃciently large

such that with an overwhelming probability, the top d rows

of A are perfectly hashed, i.e., kΠAxk

k=1

||x

∀x ∈ R

, where c

is the k-th diagonal of C. Then, the

distortion of Π is max

k≤d

|/ min

k≤d

| ≈ O(d

). There-

fore, at most an O(d log

d) factor of the distortion is due to

artifacts in our analysis.

Our input-sparsity time `

subspace embedding of Theo-

rem 2 improves the O(nnz(A) · d log d)-time embedding by

Sohler and Woodruﬀ [29] and the O(nd log n)-time embed-

ding of Clarkson et al. [10]. In addition, by combining The-

orem 2 and Lemma 3, we can compute a (1 ± )-distortion

embedding in nearly input-sparsity time.

Theorem 3. Given A ∈ R

n×d

, it takes O(nnz(A) · log n)

time to compute a sampling matrix S ∈ R

s×n

with s =

O(poly(d) log(1/)/

) such that with a constant probability,

S embeds A

into (R

, k · k

) with distortion 1 ± .

Our improvements in Theorems 2 and 3 also propagate to

related `

-based applications, including the `

regression and

the `

subspace approximation problem considered in [29,

10]. As before, only the regression improvement is stated

here explicitly. For completeness, we present in Algorithm 1

our algorithm for solving `

regression problems in nearly

input-sparsity time. The brief proof of Corollary 2, our

main quality-of-approximation result for Algorithm 1, may

be found in Appendix A.3.

Algorithm 1 Fast `

Regression Approximation in

O(nnz(A) · log n + poly(d) log(1/)/

) Time

Input: A ∈ R

n×d

, b ∈ R

, and  ∈ (0, 1/2).

Output: A (1 + )-approximation solution ˆx to

min

x∈R

kAx − bk

, with a constant probability.

1: Let

A =



A b



and denote

the `

subspace spanned

by A’s columns and b.

2: Compute a low-distortion embedding Π ∈ R

O(poly(d))×n

(Theorem 2).

3: Compute

R ∈ R

(d+1)×(d+1)

from Π

A such that

−1

well-conditioned (QR or Lemma 2).

4: Compute a (1 ± /4)-distortion embedding matrix S ∈

O(poly(d) log(1/)/

)×n

(Lemma 3).

5: Compute a (1 + /4)-approximate solution ˆx to

min

x∈R

kSAx − Sbk

Corollary 2. With a constant probability, Algorithm 1

computes a (1+)-approximation to an `

regression problem

in O(nnz(A) · log n + T

(; poly(d) log(1/)/

, d)) time.

Remark. For readers familiar with the impossibility re-

sults for dimension reduction in `

[8, 18, 5], note that those

results apply to arbitrary point sets of size n and are inter-

ested in embeddings that are “oblivious,” in that they do not

depend on the input data. In this paper, we only consider

points in a subspace, and the subspace-preserving sampling

procedure of [12] that we use is data-dependent.

5. MAIN RESULTS FOR `

EMBEDDING

In this section, we use the properties of p-stable distri-

butions to generalize the input-sparsity time `

subspace

embedding to `

norms, for p ∈ (1, 2). Generally, D

does

not have explicit PDF/CDF, which increases the diﬃculty

for theoretical analysis. Indeed, the main technical diﬃculty

here is that we are not aware of `

analogues of Lemmas 5

and 6 that would provide upper and lower tail inequality for

p-stable distributions. (Indeed, even Lemmas 5 and 6 were

established only recently [10].)

Instead of analyzing D

directly, for any p ∈ (1, 2), we

establish an order among the Cauchy distribution, the p-

stable distribution, and the Gaussian distribution, and then

we derive upper and lower tail inequalities for the p-stable

distribution similar to the ones we used to prove Theorem 2.

We state these technical results here since they are of inde-

pendent interest. We start with the following lemma, which

is proved in Appendix A.4 and which establishes this order.

Lemma 8. For any p ∈ (1, 2), there exist constants α

0 and β

> 0 such that

|C|  |X

 β

|G|

where C is a standard Cauchy variable, X

∼ D

, G is a

standard Gaussian variable. By “X  Y ” we mean Pr[X ≥

t] ≥ Pr[Y ≥ t], ∀t ∈ R.

Our numerical results suggest that the constants α

and β

are not too far away from 1. See [23] for more details.

Lemma 8 suggests that we can use Lemma 5 (regard-

ing Cauchy random variables) to derive upper tail inequal-

ities for general p-stable distributions and that we can use

Lemma 7 (regarding Gaussian variables) to derive lower tail

inequalities for general p-stable distributions. The following

two lemmas establish these results; the proofs of these lem-

mas are provided in Appendixes A.5 and A.6, respectively.

Lemma 9. Given p ∈ (1, 2), for i = 1, . . . , m, let X

m (not necessarily independent) random variables sampled

from D

, and γ

> 0 with γ =

. Let X =

Assume that m ≥ 3. Then for any t ≥ 1,

Pr[X ≥ tα

γ] ≤

2 log(mt)

Lemma 10. For i = 1, . . . , m, let X

be independent ran-

dom variables sampled from D

, and γ

≥ 0 with γ =

Let X =

|. Then,

log Pr[X ≤ (1 − t)β

γ] ≤

−γt

6 max

Given these results, here is our main result for input-

sparsity time low-distortion subspace embeddings for `

. The

proof of this theorem is similar to the proof of Theorem 2,

except that we replace the `

norm k · k

by k · k

and use

Lemmas 9 and 10 (rather than Lemmas 5 and 6).

Theorem 4. Given A ∈ R

n×d

and p ∈ (1, 2), let Π =

SD ∈ R

s×n

where S ∈ R

s×n

has each column chosen inde-

pendently and uniformly from the s standard basis vectors

of R

, and where D ∈ R

n×n

is a diagonal matrix with di-

agonals chosen independently from D

. Set s = ωd

log

HTML Viewer

Frequently Asked Questions (16)

Q1. What are the contributions in "Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression" ?

The authors show that, given a matrix A ∈ Rn×d with n d and a p ∈ [ 1, 2 ), with a constant probability, they can construct a low-distortion embedding matrix Π ∈ RO ( poly ( d ) ) ×n that embeds Their result generalizes the input-sparsity time ` 2 subspace embedding by Clarkson and Woodruff [ STOC ’ 13 ] ; and for completeness, the authors present a simpler and improved analysis of their construction for ` 2. For ` 2, the authors show that a ( 1 + ) -approximate solution to the ` 2 regression problem specified by the matrix A and a vector b ∈ R can be computed in O ( nnz ( A ) + d log ( d/ ) / ) time ; and for ` p, via a subspace-preserving sampling procedure, they show that a ( 1 ± ) -distortion embedding of Ap into RO ( poly ( d ) ) can be computed in O ( nnz ( A ) · logn ) time, and they also show that a ( 1 + ) -approximate solution to the ` p regression problem minx∈Rd ‖Ax − b‖p can be computed in O ( nnz ( A ) · logn + poly ( d ) log ( 1/ ) / ) time. Moreover, the authors can also improve the embedding dimension or equivalently the sample size to O ( d log ( 1/ ) / ) without increasing the complexity. Ap, the ` p subspace spanned by A ’ s columns, into ( RO ( poly ( d ) ), ‖ · ‖p ) ; the distortion of their embeddings is only O ( poly ( d ) ), and the authors can compute ΠA in O ( nnz ( A ) ) time, i. e., input-sparsity time.

Q2. What is the common parameterized family of regression problems?

A parameterized family of regression problems that is of particular interest is the overconstrained `p regression problem: given a matrix A ∈ Rn×d, with n > d, a vector b ∈

Q3. What is the simplest way to construct a subspace-preserving sampling?

Given R ∈ Rd×d such that AR−1 is well-conditioned in the `p norm, the authors can construct a (1 ± )-distortion embedding, specifically a subspace-preserving sampling, of Ap in O(nnz(A) · logn) additional time and with a constant probability.

Q4. What is the way to solve the p regression problem?

Given an `p regression problem specified by A ∈ Rn×d, b ∈ Rn, and p ∈ [1,∞), let S be a (1± )- distortion embedding matrix of the subspace spanned by A’s columns and b from Lemma 3, and let x̂ be an optimal solution to the subsampled problem minx∈Rd ‖SAx−Sb‖p.

Q5. What is the simplest way to embed a subspace?

The authors are interested in fast embedding of Ap into a d-dimensional subspace of (Rpoly(d), ‖ · ‖p), with distortion either poly(d) or (1± ), for some > 0, as well as applications of this embedding to problems such as `p regression.

Q6. What is the embedding dimension for p?

The (1 ± )-distortion subspace embedding (for `p, p ∈ [1, 2), that the authors construct from the input-sparsity time embedding and the fast subspace-preserving sampling) has embedding dimension s = O(poly(d) log(1/ )/ 2), where the somewhat large poly(d) term directly multiplies the log(1/ )/ 2 term.

Q7. What is the simplest way to compute a subspace-preserving sampling?

Given a matrix A ∈ Rn×d, p ∈ [1,∞), > 0, and a matrix R ∈ Rd×d such that AR−1 is wellconditioned, it takes O(nnz(A) · logn) time to compute a sampling matrix S ∈ Rs×n (with only one nonzero element per row) with s = O(κ̄pp(AR−1)

Q8. What is the embedding dimension in the two theorems?

In Theorem 2 and Theorem 4, the embedding dimension is s = O(poly(d) log(1/ )/ 2), where the poly(d) term is a somewhat large polynomial of d that directly multiplies the log(1/ )/ 2 term.

Q9. What is the way to solve the 1 regression problem?

In addition, the authors can use it to compute a (1 + )-approximation to the `1 regression problem in O(nnz(A) · logn+ poly(d/ )) time, which in turn leads to immediate improvements in `1-based matrix approximation objectives, e.g., for the `1 subspace approximation problem [6, 29, 10].

Q10. How can the authors use sparse embeddings to solve 2 regression problems?

The authors also show that, by coupling with recent work on fast subspace-preserving sampling from [10], these embeddings can be used to provide (1+ )-approximate solutions toements of the projection matrix onto the span of A. See [20, 15] for details; and note that they can be generalized to `1 and other `p norms [10] as well as to arbitrary n×d matrices, with both n and d large [21, 15].`p regression problems, for p ∈ [1, 2], in nearly input-sparsity time.

Q11. How did Clarkson and Woodruff achieve their improved results for 2-based problems?

Clarkson and Woodruff achieve their improved results for `2-based problems by showing how to construct such a Π with s = poly(d/ ) and showing that it can be applied to an arbitrary A in O(nnz(A)) time [11].

Q12. How long does it take to compute aj?

without any prior knowledge, the authors have to scan at least a constant portion of the input to guarantee that aj is observed with a constant probability, which takes O(nnz(A)) time.

Q13. What is the definition of a distribution D over R?

Definition 4. A distribution D over R is called p-stable, if for any m real numbers a1, . . . , am, the authors havem∑ i=1 aiXi ' ( m∑ i=1 |ai|p )1/p X,where Xi iid∼ D and X ∼ D. By “X ' Y ”, the authors mean X and Y have the same distribution.

Q14. What is the author's reaction to the first version of this paper?

The authors want to thank P. Drineas for reading the first version of this paper and pointing out that the embedding dimension in Theorem 1 can be easily improved from O(d4/ 2) to O(d2/ 2) using the same technique.

Q15. What is the simplest way to compute a 1+1 solution to an p?

Given a subspace-preserving sampling algorithm, Clarkson et al. [10, Theorem 5.4] show it is straightforward to compute a 1+1− -approximate solution to an `p regression problem.

Q16. What is the proof for 2 subspace?

Although their simpler direct proof leads to a better result for `2 subspace embedding, the technique used in the proof of Clarkson and Woodruff [11], which splits coordinates into “heavy” and “light” sets based on the leverage scores, highlights an important structural property of `2 subspace: that only a small subset of coordinates can have large `2 leverage scores.

Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression

Summary (2 min read)

1. INTRODUCTION

Conditioning.

Lemma 1 ([10]). Given an n

Stable distributions.

Tail inequalities.

3. MAIN RESULTS FOR 2 EMBEDDING

4. MAIN RESULTS FOR 1 EMBEDDING

5. MAIN RESULTS FOR p EMBEDDING

6. IMPROVED EMBEDDING DIMENSION

Algorithm 2 Improving the Embedding Dimension

Citations

References

Related Papers (5)

Frequently Asked Questions (16)

Q1. What are the contributions in "Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression" ?

Q2. What is the common parameterized family of regression problems?

Q3. What is the simplest way to construct a subspace-preserving sampling?

Q4. What is the way to solve the p regression problem?

Q5. What is the simplest way to embed a subspace?

Q6. What is the embedding dimension for p?

Q7. What is the simplest way to compute a subspace-preserving sampling?

Q8. What is the embedding dimension in the two theorems?

Q9. What is the way to solve the 1 regression problem?

Q10. How can the authors use sparse embeddings to solve 2 regression problems?

Q11. How did Clarkson and Woodruff achieve their improved results for 2-based problems?

Q12. How long does it take to compute aj?

Q13. What is the definition of a distribution D over R?

Q14. What is the author's reaction to the first version of this paper?

Q15. What is the simplest way to compute a 1+1 solution to an p?

Q16. What is the proof for 2 subspace?