scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression

01 Jun 2013-pp 91-100
TL;DR: In this article, a low-distortion embedding matrix Π ∈ RO(poly(d)) x n that embeds Ap, the lp subspace spanned by A's columns, into the poly(d)), |~cdot~|p, was constructed in O(nnz(A)) time.
Abstract: Low-distortion embeddings are critical building blocks for developing random sampling and random projection algorithms for common linear algebra problems. We show that, given a matrix A ∈ Rn x d with n >> d and a p ∈ [1, 2), with a constant probability, we can construct a low-distortion embedding matrix Π ∈ RO(poly(d)) x n that embeds Ap, the lp subspace spanned by A's columns, into (RO(poly(d)), |~cdot~|p); the distortion of our embeddings is only O(poly(d)), and we can compute Π A in O(nnz(A)) time, i.e., input-sparsity time. Our result generalizes the input-sparsity time l2 subspace embedding by Clarkson and Woodruff [STOC'13]; and for completeness, we present a simpler and improved analysis of their construction for l2. These input-sparsity time lp embeddings are optimal, up to constants, in terms of their running time; and the improved running time propagates to applications such as (1 pm e)-distortion lp subspace embedding and relative-error lp regression. For l2, we show that a (1+e)-approximate solution to the l2 regression problem specified by the matrix A and a vector b ∈ Rn can be computed in O(nnz(A) + d3 log(d/e) /e^2) time; and for lp, via a subspace-preserving sampling procedure, we show that a (1 pm e)-distortion embedding of Ap into RO(poly(d)) can be computed in O(nnz(A) ⋅ log n) time, and we also show that a (1+e)-approximate solution to the lp regression problem minx ∈ Rd |A x - b|p can be computed in O(nnz(A) ⋅ log n + poly(d) log(1/e)/e2) time. Moreover, we can also improve the embedding dimension or equivalently the sample size to O(d3+p/2 log(1/e) / e2) without increasing the complexity.

Summary (2 min read)

1. INTRODUCTION

  • Regression problems are ubiquitous, and the fast computation of their solutions is of interest in many large-scale data applications.
  • The authors analysis is direct and does not rely on splitting the high-dimensional space into a set of heavy-hitters consisting of the high-leverage components and the complement of that heavy-hitting set.
  • 2) in general, the authors prove that there exists an order among the Cauchy distribution, a p-stable distribution with p ∈ (1, 2), and the Gaussian distribution such that for all p ∈ (1, 2) one can use the upper bound from the Cauchy distribution and the lower bound from the Gaussian distribution.
  • The (1 ± )-distortion subspace embedding (for p, p ∈ [1, 2), that the authors construct from the input-sparsity time embedding and the fast subspace-preserving sampling) has embedding dimension s = O(poly(d) log(1/ )/ 2 ), where the somewhat large poly(d) term directly multiplies the log(1/ )/ 2 term.

Conditioning.

  • The p subspace embedding and p regression problems are closely related to the concept of conditioning.
  • The authors state here two related notions of p-norm conditioning and then a lemma that characterizes the relationship between them.

Lemma 1 ([10]). Given an n

  • This procedure is called conditioning, and there exist two approaches for conditioning: via low-distortion p subspace embedding and via ellipsoidal rounding.
  • The authors simply cite the following lemma, which is based on ellipsoidal rounding.

Stable distributions.

  • The authors use properties of p-stable distributions for analyzing input-sparsity time low-distortion p subspace embeddings.
  • By Lévy [19] , it is known that p-stable distributions exist for p ∈ (0, 2]; and from Chambers et al. [7] , it is known that p-stable random variables can be generated efficiently, thus allowing their practical use.

Tail inequalities.

  • The authors note two inequalities from Clarkson et al. [10] regarding the tails of the Cauchy distribution.
  • The following result about Gaussian variables is a direct consequence of Maurer's inequality ( [22] ), and the authors will use it to derive lower tail inequalities for p-stable distributions.

3. MAIN RESULTS FOR 2 EMBEDDING

  • Here is their result for input-sparsity time low-distortion subspace embeddings for 2.
  • See also Nelson and Nguyen [26] for a similar result with a slightly better constant.
  • The O(nnz(A)) running time is indeed optimal, up to constant factors, for general inputs.
  • The results of Theorem 1 propagate to related applications, e.g., to the 2 regression problem, the low-rank matrix approximation problem and the problem of computing approximations to the 2 leverage scores.
  • The technique used in the proof of Clarkson and Woodruff [11] , which splits coordinates into "heavy" and "light" sets based on the leverage scores, highlights an important structural property of 2 subspace: that only a small subset of coordinates can have large 2 leverage scores.

4. MAIN RESULTS FOR 1 EMBEDDING

  • Here is their result for input-sparsity time low-distortion subspace embeddings for 1.
  • As mentioned above, the O(nnz(A)) running time is optimal.
  • For the same construction of Π, one can provide a "bad" case that provides a lower bound.the authors.
  • The authors input-sparsity time 1 subspace embedding of Theorem 2 improves the O(nnz(A) d log d)-time embedding by Sohler and Woodruff [29] and the O(nd log n)-time embedding of Clarkson et al. [10] .
  • The authors improvements in Theorems 2 and 3 also propagate to related 1-based applications, including the 1 regression and the 1 subspace approximation problem considered in [29, 10] .

5. MAIN RESULTS FOR p EMBEDDING

  • Generally, Dp does not have explicit PDF/CDF, which increases the difficulty for theoretical analysis.
  • Lemma 8 suggests that the authors can use Lemma 5 (regarding Cauchy random variables) to derive upper tail inequalities for general p-stable distributions and that they can use Lemma 7 (regarding Gaussian variables) to derive lower tail inequalities for general p-stable distributions.
  • Given these results, here is their main result for inputsparsity time low-distortion subspace embeddings for p.
  • In particular, the authors can establish an improved algorithm for solving the p regression problem in nearly input-sparsity time.

6. IMPROVED EMBEDDING DIMENSION

  • (See the remark below for comments on the precise value of the poly(d) term.).
  • This is not ideal for the subspace embedding and the p regression, because the authors want to have a small embedding dimension and a small subsampled problem, respectively.
  • Here, the authors show that it is possible to decouple the large polynomial of d and the log(1/ )/ 2 term via another round of sampling and conditioning without increasing the complexity.
  • See Algorithm 2 for details on this procedure.

Algorithm 2 Improving the Embedding Dimension

  • Then, by applying Theorem 7 to the p regression problem, the authors can improve the size of the subsampled problem and hence the overall running time.
  • The authors have stated their results in the previous sections as poly(d) without stating the value of the polynomial because there are numerous trade-offs between the conditioning quality and the running time.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Low-distortion Subspace Embeddings in Input-sparsity
Time and Applications to Robust Linear Regression
Xiangrui Meng
LinkedIn Corporation
2029 Stierlin Ct, Mountain View, CA 94043
ximeng@linkedin.com
Michael W. Mahoney
Dept. of Mathematics, Stanford University
Stanford, CA 94305
mmahoney@cs.stanford.edu
ABSTRACT
Low-distortion embeddings are critical building blocks for
developing random sampling and random projection algo-
rithms for common linear algebra problems. We show that,
given a matrix A R
n×d
with n d and a p [1, 2), with a
constant probability, we can construct a low-distortion em-
bedding matrix Π R
O(poly(d))×n
that embeds A
p
, the `
p
subspace spanned by A’s columns, into (R
O(poly(d))
, k · k
p
);
the distortion of our embeddings is only O(poly(d)), and
we can compute ΠA in O(nnz(A)) time, i.e., input-sparsity
time. Our result generalizes the input-sparsity time `
2
sub-
space embedding by Clarkson and Woodruff [STOC’13]; and
for completeness, we present a simpler and improved analy-
sis of their construction for `
2
. These input-sparsity time `
p
embeddings are optimal, up to constants, in terms of their
running time; and the improved running time propagates to
applications such as (1 ± )-distortion `
p
subspace embed-
ding and relative-error `
p
regression. For `
2
, we show that
a (1 + )-approximate solution to the `
2
regression problem
specified by the matrix A and a vector b R
n
can be com-
puted in O(nnz(A) + d
3
log(d/)/
2
) time; and for `
p
, via
a subspace-preserving sampling procedure, we show that a
(1 ± )-distortion embedding of A
p
into R
O(poly(d))
can be
computed in O(nnz(A) · log n) time, and we also show that
a (1 + )-approximate solution to the `
p
regression problem
min
xR
d
kAx bk
p
can be computed in O(nnz(A) · log n +
poly(d) log(1/)/
2
) time. Moreover, we can also improve
the embedding dimension or equivalently the sample size to
O(d
3+p/2
log(1/)/
2
) without increasing the complexity.
Categories and Subject Descriptors
F.2 [ANALYSIS OF ALGORITHMS AND PROB-
LEM COMPLEXITY]: Numerical Algorithms and Prob-
lems
Keywords
subspace embedding; input-sparsity time; low-distortion em-
bedding; linear regression; robust regression; `
p
regression
Most of this work was done while the author was at ICME,
Stanford University supported by NSF DMS-1009005.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
STOC’13, June 1-4, 2013, Palo Alto, California, USA.
Copyright 2013 ACM 978-1-4503-2029-0/13/06 ...$15.00.
1. INTRODUCTION
Regression problems are ubiquitous, and the fast compu-
tation of their solutions is of interest in many large-scale data
applications. A parameterized family of regression problems
that is of particular interest is the overconstrained `
p
regres-
sion problem: given a matrix A R
n×d
, with n > d, a
vector b R
n
, a norm k · k
p
parameterized by p [1, ],
and an error parameter > 0, find a (1 + )-approximate
solution ˆx R
d
to:
f
= min
xR
d
kAx bk
p
,
i.e., find a vector ˆx such that kAˆx bk
p
(1 + )f
, where
the `
p
norm of a vector x is kxk
p
=
P
i
|x
i
|
p
1/p
, defined
to be max
i
|x
i
| for p = . Special cases include the `
2
re-
gression problem, also known as the Least Squares problem,
and the `
1
regression problem, also known as the Least Ab-
solute Deviations or Least Absolute Errors problem. The
latter is of particular interest as a robust estimation or ro-
bust regression technique, in that it is less sensitive to the
presence of outliers than the former. We are most interested
in this paper in the `
1
regression problem due to its robust-
ness properties, but our methods hold for general p [1, 2],
and thus we formulate our results in `
p
.
It is well-known that for p 1, the overconstrained `
p
regression problem is a convex optimization problem; for
p = 1 and p = , it is an instance of linear program-
ming; and for p = 2, it can be solved with eigenvector-
based methods such as with the QR decomposition or the
Singular Value Decomposition of A. In spite of their low-
degree polynomial-time solvability, `
p
regression problems
have been the focus in recent years of a wide range of ran-
dom sampling and random projection algorithms, largely
due to a desire to develop improved algorithms for large-
scale data applications [3, 24, 10]. For example, Clark-
son [9] uses subgradient and sampling methods to compute
an approximate solution to the overconstrained `
1
regres-
sion problem in roughly O(nd
5
log n) time; and Dasgupta et
al. [12] use well-conditioned bases and subspace-preserving
sampling algorithms to solve general `
p
regression problems,
for p [1, ), in roughly O(nd
5
log n) time. A similar
subspace-preserving sampling algorithm was developed by
Drineas, Mahoney, and Muthukrishnan [16] to compute an
approximate solution to the `
2
regression problem. The al-
gorithm of [16] relies on the estimation of the `
2
leverage
scores
1
of A to be used as an importance sampling distribu-
1
Recall that for an n × d matrix A, with n d, the `
2
leverage scores of the rows of A are equal to the diagonal el-
91

tion, but when combined with the results of Sarl´os [28] and
Drineas et al. [17] (that quickly preprocess A to uniformize
those scores) or Drineas et al. [15] (that quickly computes ap-
proximations to those scores), this leads to a random projec-
tion or random sampling (respectively) algorithm for the `
2
regression problem that runs in roughly O(nd log d) time [17,
20]. More recently, Sohler and Woodruff [29] introduced
the Cauchy Transform to obtain improved `
1
embeddings,
thereby leading to an algorithm for the `
1
regression problem
that runs in O(nd
1.376+
) time; and Clarkson et al. [10] use
the Fast Cauchy Transform and ellipsoidal rounding meth-
ods to compute an approximation to the solution of general
`
p
regression problems in roughly O(nd log n) time.
These algorithms, and in particular the algorithms for
p = 2, form the basis for much of the large body of recent
work in randomized algorithms for low-rank matrix approx-
imation, and thus optimizing their properties can have im-
mediate practical benefits. See, e.g., the recent monograph
of Mahoney [20] and references therein for details. Although
some of these algorithms are near-optimal for dense inputs,
they all require Ω(nd log d) time, which can be large if the
input matrix is very sparse. Thus, it was a significant re-
sult when Clarkson and Woodruff [11] developed an algo-
rithm for the `
2
regression problem (as well as the related
problems of low-rank matrix approximation and `
2
leverage
score approximation) that runs in input-sparsity time, i.e.,
in O(nnz(A) + poly(d/)) time, where nnz(A) is the number
of non-zero elements in A and is an error parameter. This
result depends on the construction of a sparse embedding
matrix Π for `
2
. By this, we mean the following: for an
n × d matrix A, an s × n matrix Π such that,
(1 )kAxk
2
kΠAxk
2
(1 + )kAxk
2
,
for all x R
d
. That is, Π embeds the column space of A into
R
s
, while approximately preserving the `
2
norms of all vec-
tors in that subspace. Clarkson and Woodruff achieve their
improved results for `
2
-based problems by showing how to
construct such a Π with s = poly(d/) and showing that it
can be applied to an arbitrary A in O(nnz(A)) time [11].
(In particular, this embedding result improves the result of
Meng, Saunders, and Mahoney [24], who in their develop-
ment of the parallel least-squares solver LSRN use a re-
sult from Davidson and Szarek [14] to construct a constant-
distortion embedding for `
2
that runs in O(nnz(A)·d) time.)
Interestingly, the analysis of Clarkson and Woodruff cou-
pled ideas from the data streaming literature with the struc-
tural fact that there cannot be too many high-leverage con-
straints/rows in A. In particular, they showed that the
high-leverage parts of the subspace may be viewed as heavy-
hitters that are “perfectly hashed,” and thus contribute no
distortion, and that the distortion of the rest of the subspace
as well as the “cross terms” may be bounded with a result
of Dasgupta, Kumar, and Sarl´os [13].
In this paper, we provide improved low-distortion sub-
space embeddings for `
p
, for all p [1, 2], in input-sparsity
time. We also show that, by coupling with recent work on
fast subspace-preserving sampling from [10], these embed-
dings can be used to provide (1+)-approximate solutions to
ements of the projection matrix onto the span of A. See [20,
15] for details; and note that they can be generalized to `
1
and other `
p
norms [10] as well as to arbitrary n×d matrices,
with both n and d large [21, 15].
`
p
regression problems, for p [1, 2], in nearly input-sparsity
time. In more detail, our main results are the following.
First, for `
2
, we obtain an improved result for the input-
sparsity time (1 ± )-distortion embedding of [11]. In partic-
ular, for the same embedding procedure, we obtain improved
bounds for the embedding dimension with a much simpler
analysis than [11]. See Theorem 1 of Section 3 for a pre-
cise statement of this result. Our analysis is direct and does
not rely on splitting the high-dimensional space into a set
of heavy-hitters consisting of the high-leverage components
and the complement of that heavy-hitting set. Since our re-
sult directly improves the `
2
embedding result of Clarkson
and Woodruff [11], it immediately leads to improvements
for the `
2
regression, low-rank matrix approximation, and
`
2
leverage score estimation problems that they consider.
Second, for `
1
, we obtain a low-distortion sparse embed-
ding matrix Π such that ΠA can be computed in input-
sparsity time. That is, we construct an embedding matrix
Π R
O(poly(d))×n
such that, for all x R
d
,
1/ O(poly(d)) · kAxk
1
kΠAxk
1
O(poly(d)) · kAxk
1
,
with a constant probability, and ΠA can be computed in
O(nnz(A)) time. See Theorem 2 of Section 4 for a precise
statement of this result. Here, our proof involves splitting
the set Y = {Ux | kxk
= 1, x R
d
}, where U is an `
1
well-conditioned basis for the span of A, into two parts,
informally a subset where coordinates of high `
1
leverage
dominate kyk
1
and the complement of that subset. This `
1
result leads to immediate improvements in `
1
-based prob-
lems. For example, by taking advantage of the fast version
of subspace-preserving sampling from [10], we can construct
and apply a (1 ± )-distortion sparse embedding matrix for
`
1
in O(nnz(A) · log n + poly(d/)) time. In addition, we can
use it to compute a (1 + )-approximation to the `
1
regres-
sion problem in O(nnz(A) · log n + poly(d/)) time, which in
turn leads to immediate improvements in `
1
-based matrix
approximation objectives, e.g., for the `
1
subspace approxi-
mation problem [6, 29, 10].
Third, for `
p
, for all p (1, 2), we obtain a low-distortion
sparse embedding matrix Π such that ΠA can be computed
in input-sparsity time. That is, we construct an embedding
matrix Π R
O(poly(d))×n
such that, for all x R
d
,
1/ O(poly(d)) · kAxk
p
kΠAxk
p
O(poly(d)) · kAxk
p
,
with a constant probability, and ΠA can be computed in
O(nnz(A)) time. See Theorem 4 of Section 5 for a precise
statement of this result. Here, our proof generalizes the `
1
result, but we need to prove upper and lower tail bound
inequalities for sampling from general p-stable distributions
that are of independent interest. Although these distribu-
tions don’t have closed forms for p (1, 2) in general, we
prove that there exists an order among the Cauchy distribu-
tion, a p-stable distribution with p (1, 2), and the Gaussian
distribution such that for all p (1, 2) we can use the upper
bound from the Cauchy distribution and the lower bound
from the Gaussian distribution. As with our `
1
result, this `
p
result has several extensions: in O(nnz(A)·log n+poly(d/))
time, we can construct and apply a (1 ± )-distortion sparse
embedding matrix for `
p
; in O(nnz(A) · log n + poly(d/))
time, we can compute a (1 + )-approximation to the `
p
re-
gression problem; and in O(nnz(A) · d log d) time, we can
construct and apply a near-optimal (in terms of embedding
dimension and distortion factor) embedding matrix.
92

The (1 ± )-distortion subspace embedding (for `
p
, p
[1, 2), that we construct from the input-sparsity time embed-
ding and the fast subspace-preserving sampling) has embed-
ding dimension s = O(poly(d) log(1/)/
2
), where the some-
what large poly(d) term directly multiplies the log(1/)/
2
term. We can also improve this, showing that it is possible,
without increasing the overall complexity, to decouple the
large poly(d) and log(1/)/
2
via another round of sampling
and conditioning, thereby obtaining an embedding dimen-
sion that is a small poly(d) times log(1/)/
2
. See Theorem 7
of Section 6 for a precise statement of this result.
Remark. Subsequent to our posting the first version
of this paper on arXiv [23], Clarkson and Woodruff let us
know that, independently of us, they used a result from [10]
to extend their `
2
subspace embedding from [11] to pro-
vide a nearly input-sparsity time algorithm for `
p
regres-
sion, for all p [1, ). This is now posted as Version 2
of [11]. Their approach requires solving a rounding prob-
lem of size O(n/ poly(d)) × d, which depends on n (possibly
very large). Our approach via input-sparsity time oblivious
low-distortion `
p
subspace embeddings does not contain this
intermediate step and it only needs O(poly(d)) storage.
Remark. In the first version of this paper, the embedding
dimension for `
2
in Theorem 1 was O(d
4
/
2
). Subsequent to
the dissemination of this version, Drineas pointed out to us
that our result could very easily be improved to O(d
2
/
2
).
Nelson and Nguyen also let us know that, at about the same
time and using the same technique, but independent of us,
they first published the O(d
2
/
2
) embedding result [26].
2. BACKGROUND
We use k · k
p
to denote the `
p
norm of a vector, k · k
2
the spectral norm of a matrix, k · k
F
the Frobenius norm
of a matrix, and | · |
p
the element-wise `
p
norm of a matrix.
Given A R
n×d
with full column rank and p [1, 2], we use
A
p
to denote the `
p
subspace spanned by A’s columns. We
are interested in fast embedding of A
p
into a d-dimensional
subspace of (R
poly(d)
, k · k
p
), with distortion either poly(d)
or (1 ± ), for some > 0, as well as applications of this em-
bedding to problems such as `
p
regression. We assume that
n poly(d) d log n. To state our results, we assume
that we are capable of computing a (1+)-approximate solu-
tion to an `
p
regression problem of size n
0
×d for some > 0,
as long as n
0
is independent of n. Denote the running time
needed to solve this smaller problem by T
p
(; n
0
, d). In the-
ory, we have T
2
(; n
0
, d) = O(n
0
d log(d/) + d
3
) (see Drineas
et al. [17]), and T
p
(; n
0
, d) = O((n
0
d
2
+ poly(d)) log(n
0
/)),
for general p (see Mitchell [25]).
Conditioning.
The `
p
subspace embedding and `
p
regression problems
are closely related to the concept of conditioning. We state
here two related notions of `
p
-norm conditioning and then a
lemma that characterizes the relationship between them.
Definition 1 ([10]). Given an n ×d matrix A and p
[1, ], let σ
max
p
(A) = max
kxk
2
1
kAxk
p
and let σ
min
p
(A) =
min
kxk
2
1
kAxk
p
. Then, we denote by κ
p
(A) the `
p
-norm
condition number of A: κ
p
(A) = σ
max
p
(A)
min
p
(A). For
simplicity, we will use κ
p
, σ
min
p
, and σ
max
p
when the un-
derlying matrix is clear.
Definition 2 ([12]). Given an n ×d matrix A and p
[1, ], let q be the dual norm of p. Then A is (α, β, p)-
conditioned if (1) |A|
p
α, and (2) for all z R
d
, kzk
q
βkAzk
p
. Define ¯κ
p
(A) as the minimum value of αβ such
that A is (α, β, p)-conditioned.
Lemma 1 ([10]). Given an n × d matrix A and p
[1, ]: d
−|1/21/p|
κ
p
(A) ¯κ
p
(A) d
max{1/2,1/p}
κ
p
(A).
Remark. Given the equivalence established by Lemma 1,
we will say that A is well-conditioned in the `
p
norm if κ
p
(A)
or ¯κ
p
(A) = O(poly(d)), independent of n.
Although for an arbitrary matrix A R
n×d
, the condition
numbers κ
p
(A) and ¯κ
p
(A) can be arbitrarily large, we can
find a matrix R R
d×d
such that AR
1
is well-conditioned.
This procedure is called conditioning, and there exist two
approaches for conditioning: via low-distortion `
p
subspace
embedding and via ellipsoidal rounding.
Definition 3. Given an n × d matrix A and a number
p [1, ], Π R
s×n
is a low-distortion embedding of A
p
if s = O(poly(d)) and x R
d
:
1/ O(poly(d)) · kAxk
p
kΠAxk
p
O(poly(d)) · kAxk
p
.
Remark. Given a low-distortion embedding matrix Π of
A
p
, let R be the “R” matrix from the QR decomposition of
ΠA. Then, AR
1
is well-conditioned in the `
p
norm.
For a discussion of ellipsoidal rounding, we refer readers
to Clarkson et al. [10]. In this paper, we simply cite the
following lemma, which is based on ellipsoidal rounding.
Lemma 2 ([10]). Given an n × d matrix A and p
[1, ], it takes at most O(nd
3
log n) time to find a matrix
R R
d×d
such that κ
p
(AR
1
) 2d.
Subspace-preserving sampling and `
p
regression.
Given R R
d×d
such that AR
1
is well-conditioned in
the `
p
norm, we can construct a (1 ± )-distortion embed-
ding, specifically a subspace-preserving sampling, of A
p
in
O(nnz(A) · log n) additional time and with a constant prob-
ability. This result from Clarkson et al. [10, Theorem 5.4]
improves the subspace-preserving sampling algorithm pro-
posed by Dasgupta et al. [12] by estimating the row norms
of AR
1
(instead of computing them exactly) to define im-
portance sampling probabilities.
Lemma 3 ([10]). Given a matrix A R
n×d
, p [1, ),
> 0, and a matrix R R
d×d
such that AR
1
is well-
conditioned, it takes O(nnz(A) · log n) time to compute a
sampling matrix S R
s×n
(with only one nonzero element
per row) with s = O(¯κ
p
p
(AR
1
)d
|p/21|+1
log(1/)/
2
) such
that with a constant probability,
(1 )kAxk
p
kSAxk
p
(1 + )kAxk
p
, x R
d
.
Given a subspace-preserving sampling algorithm, Clarkson
et al. [10, Theorem 5.4] show it is straightforward to compute
a
1+
1
-approximate solution to an `
p
regression problem.
Lemma 4 ([10]). Given an `
p
regression problem speci-
fied by A R
n×d
, b R
n
, and p [1, ), let S be a (1 ± )-
distortion embedding matrix of the subspace spanned by A’s
columns and b from Lemma 3, and let ˆx be an optimal solu-
tion to the subsampled problem min
xR
d
kSAx Sbk
p
. Then
ˆx is a
1+
1
-approximate solution to the original problem.
Remark. Collecting these results, we see that low-distortion
`
p
subspace embedding is a fundamental building block (and
very likely a bottleneck) for (1 ± )-distortion `
p
subspace
embeddings, as well as for a (1 + )-approximation to an `
p
regression problem. This motivates our work and its em-
phasis on finding low-distortion subspace embeddings.
93

Stable distributions.
We use properties of p-stable distributions for analyzing
input-sparsity time low-distortion `
p
subspace embeddings.
Definition 4. A distribution D over R is called p-stable,
if for any m real numbers a
1
, . . . , a
m
, we have
m
X
i=1
a
i
X
i
'
m
X
i=1
|a
i
|
p
!
1/p
X,
where X
i
iid
D and X D. By X ' Y , we mean X and
Y have the same distribution.
By evy [19], it is known that p-stable distributions exist
for p (0, 2]; and from Chambers et al. [7], it is known that
p-stable random variables can be generated efficiently, thus
allowing their practical use. Let us use D
p
to denote the
“standard” p-stable distribution, for p [1, 2], specified by
its characteristic function ψ(t) = e
−|t|
p
. It is known that
D
1
is the standard Cauchy distribution, and that D
2
is the
Gaussian distribution with mean 0 and variance 2.
Tail inequalities.
We note two inequalities from Clarkson et al. [10] regard-
ing the tails of the Cauchy distribution.
Lemma 5. For i = 1, . . . , m, let C
i
be m (not necessarily
independent) standard Cauchy variables, and γ
i
> 0 with
γ =
P
i
γ
i
. Let X =
P
i
γ
i
|C
i
|. For any t > 1,
Pr[X > ]
1
πt
log(1 + (2mt)
2
)
1 1/(πt)
+ 1
.
For simplicity, we assume that m 3 and t 1, and then
we have Pr[X > ] 2 log(mt)/t.
Lemma 6. For i = 1, . . . , m, let C
i
be independent stan-
dard Cauchy random variables, and γ
i
0 with γ =
P
i
γ
i
.
Let X =
P
i
γ
i
|C
i
|. Then, for any t > 0,
log Pr[X (1 t)γ]
γt
2
3 max
i
γ
i
.
The following result about Gaussian variables is a direct
consequence of Maurer’s inequality ([22]), and we will use it
to derive lower tail inequalities for p-stable distributions.
Lemma 7. For i = 1, . . . , m, let G
i
be independent stan-
dard Gaussian random variables, and γ
i
0 with γ =
P
i
γ
i
.
Let X =
P
i
γ
i
|G
i
|
2
. Then, for any t > 0,
log Pr[X (1 t)γ]
γt
2
6 max
i
γ
i
.
3. MAIN RESULTS FOR `
2
EMBEDDING
Here is our result for input-sparsity time low-distortion
subspace embeddings for `
2
. See also Nelson and Nguyen [26]
for a similar result with a slightly better constant.
Theorem 1. Given a matrix A R
n×d
and (0, 1),
let Π = SD where S R
s×n
has each column chosen in-
dependently and uniformly from the s standard basis vectors
of R
s
and D R
n×n
is a diagonal matrix with diagonal
entries chosen independently and uniformly from ±1. Let
s = (d
2
+ d)/(
2
δ). Then with probability at least 1 δ,
(1 )kAxk
2
kΠAxk
2
(1 + )kAxk
2
, x R
d
.
In addition, ΠA can be computed in O(nnz(A)) time.
The construction of Π in this theorem is the same as in
Clarkson and Woodruff [11]. There, s = O((d/)
4
log
2
(d/))
in order to achieve (1 ± ) distortion with a constant prob-
ability. Theorem 1 shows that it actually suffices to set
s = O((d
2
+ d)/
2
). Surprisingly, the proof is rather simple.
Let X = U
T
Π
T
ΠU, where U is an orthonormal basis for
A
2
. Compute E[kX Ik
2
F
] and apply Markov’s inequality
to kX Ik
2
F
2
, which implies kX Ik
2
and hence the
embedding result. See Appendix A.1 for a complete proof.
Remark. The O(nnz(A)) running time is indeed optimal,
up to constant factors, for general inputs. Consider the case
when A has an important row a
j
such that A becomes rank-
deficient without it. Thus, we have to observe a
j
in order
to compute a low-distortion embedding. However, without
any prior knowledge, we have to scan at least a constant
portion of the input to guarantee that a
j
is observed with
a constant probability, which takes O(nnz(A)) time. Note
that this optimality result applies to general p.
The results of Theorem 1 propagate to related applica-
tions, e.g., to the `
2
regression problem, the low-rank ma-
trix approximation problem and the problem of computing
approximations to the `
2
leverage scores. Since it underlies
the other applications, only the `
2
regression improvement
is stated here explicitly; its proof is basically combining our
Theorem 1 with Theorem 19 of [11].
Corollary 1. With a constant probability, a (1 + )-
approximate solution to an `
2
regression problem can be com-
puted in O(nnz(A) + T
2
(; d
2
/
2
, d)) time.
Remark. Although our simpler direct proof leads to a bet-
ter result for `
2
subspace embedding, the technique used in
the proof of Clarkson and Woodruff [11], which splits coor-
dinates into “heavy” and “light” sets based on the leverage
scores, highlights an important structural property of `
2
sub-
space: that only a small subset of coordinates can have large
`
2
leverage scores. (We note that the technique of splitting
coordinates is also used by Ailon and Liberty [1] to get an
unrestricted fast Johnson-Lindenstrauss transform; and that
the difficulty in finding and approximating the large-leverage
directions was—until recently [20, 15]—responsible for diffi-
culties in obtaining fast relative-error random sampling al-
gorithms for `
2
regression and low-rank matrix approxima-
tion.) An analogous structural fact holds for `
1
and other `
p
spaces. Using this property, we can construct novel input-
sparsity time `
p
subspace embeddings for general p [1, 2),
as we discuss in the next two sections.
4. MAIN RESULTS FOR `
1
EMBEDDING
Here is our result for input-sparsity time low-distortion
subspace embeddings for `
1
.
Theorem 2. Given A R
n×d
, let Π = SC R
s×n
,
where S R
s×n
has each column chosen independently and
uniformly from the s standard basis vectors of R
s
, and where
C R
n×n
is a diagonal matrix with diagonals chosen inde-
pendently from the standard Cauchy distribution. Set s =
ωd
5
log
5
d with ω sufficiently large. Then with a constant
probability, we have x R
d
:
1/ O(d
2
log
2
d) · kAxk
1
kΠAxk
1
O(d log d) · kAxk
1
.
In addition, ΠA can be computed in O(nnz(A)) time.
The construction of the `
1
subspace embedding matrix is
different than its `
2
norm counterpart only by the diago-
nal elements of D (or C): whereas we use ±1 for the `
2
94

norm, we use Cauchy variables for the `
1
norm. The proof
of Theorem 2 uses the technique of splitting coordinates, the
fact that the Cauchy distribution is 1-stable, and the upper
and lower tail tail inequalities from Lemmas 5 and 6. See
Appendix A.2 for a complete proof.
Remark. As mentioned above, the O(nnz(A)) running time
is optimal. Whether the distortion O(d
3
log
3
d) is optimal is
still an open question. However, for the same construction of
Π, we can provide a “bad” case that provides a lower bound.
Choose A =
I
d
0
T
. Suppose that s is sufficiently large
such that with an overwhelming probability, the top d rows
of A are perfectly hashed, i.e., kΠAxk
1
=
P
d
k=1
|c
k
||x
k
|,
x R
d
, where c
k
is the k-th diagonal of C. Then, the
distortion of Π is max
kd
|c
k
|/ min
kd
|c
k
| O(d
2
). There-
fore, at most an O(d log
3
d) factor of the distortion is due to
artifacts in our analysis.
Our input-sparsity time `
1
subspace embedding of Theo-
rem 2 improves the O(nnz(A) · d log d)-time embedding by
Sohler and Woodruff [29] and the O(nd log n)-time embed-
ding of Clarkson et al. [10]. In addition, by combining The-
orem 2 and Lemma 3, we can compute a (1 ± )-distortion
embedding in nearly input-sparsity time.
Theorem 3. Given A R
n×d
, it takes O(nnz(A) · log n)
time to compute a sampling matrix S R
s×n
with s =
O(poly(d) log(1/)/
2
) such that with a constant probability,
S embeds A
1
into (R
s
, k · k
1
) with distortion 1 ± .
Our improvements in Theorems 2 and 3 also propagate to
related `
1
-based applications, including the `
1
regression and
the `
1
subspace approximation problem considered in [29,
10]. As before, only the regression improvement is stated
here explicitly. For completeness, we present in Algorithm 1
our algorithm for solving `
1
regression problems in nearly
input-sparsity time. The brief proof of Corollary 2, our
main quality-of-approximation result for Algorithm 1, may
be found in Appendix A.3.
Algorithm 1 Fast `
1
Regression Approximation in
O(nnz(A) · log n + poly(d) log(1/)/
2
) Time
Input: A R
n×d
, b R
n
, and (0, 1/2).
Output: A (1 + )-approximation solution ˆx to
min
xR
d
kAx bk
1
, with a constant probability.
1: Let
¯
A =
A b
and denote
¯
A
1
the `
1
subspace spanned
by A’s columns and b.
2: Compute a low-distortion embedding Π R
O(poly(d))×n
of
¯
A
1
(Theorem 2).
3: Compute
¯
R R
(d+1)×(d+1)
from Π
¯
A such that
¯
A
¯
R
1
is
well-conditioned (QR or Lemma 2).
4: Compute a (1 ± /4)-distortion embedding matrix S
R
O(poly(d) log(1/)/
2
)×n
of
¯
A
1
(Lemma 3).
5: Compute a (1 + /4)-approximate solution ˆx to
min
xR
d
kSAx Sbk
1
.
Corollary 2. With a constant probability, Algorithm 1
computes a (1+)-approximation to an `
1
regression problem
in O(nnz(A) · log n + T
1
(; poly(d) log(1/)/
2
, d)) time.
Remark. For readers familiar with the impossibility re-
sults for dimension reduction in `
1
[8, 18, 5], note that those
results apply to arbitrary point sets of size n and are inter-
ested in embeddings that are “oblivious,” in that they do not
depend on the input data. In this paper, we only consider
points in a subspace, and the subspace-preserving sampling
procedure of [12] that we use is data-dependent.
5. MAIN RESULTS FOR `
p
EMBEDDING
In this section, we use the properties of p-stable distri-
butions to generalize the input-sparsity time `
1
subspace
embedding to `
p
norms, for p (1, 2). Generally, D
p
does
not have explicit PDF/CDF, which increases the difficulty
for theoretical analysis. Indeed, the main technical difficulty
here is that we are not aware of `
p
analogues of Lemmas 5
and 6 that would provide upper and lower tail inequality for
p-stable distributions. (Indeed, even Lemmas 5 and 6 were
established only recently [10].)
Instead of analyzing D
p
directly, for any p (1, 2), we
establish an order among the Cauchy distribution, the p-
stable distribution, and the Gaussian distribution, and then
we derive upper and lower tail inequalities for the p-stable
distribution similar to the ones we used to prove Theorem 2.
We state these technical results here since they are of inde-
pendent interest. We start with the following lemma, which
is proved in Appendix A.4 and which establishes this order.
Lemma 8. For any p (1, 2), there exist constants α
p
>
0 and β
p
> 0 such that
α
p
|C| |X
p
|
p
β
p
|G|
2
,
where C is a standard Cauchy variable, X
p
D
p
, G is a
standard Gaussian variable. By X Y we mean Pr[X
t] Pr[Y t], t R.
Our numerical results suggest that the constants α
p
and β
p
are not too far away from 1. See [23] for more details.
Lemma 8 suggests that we can use Lemma 5 (regard-
ing Cauchy random variables) to derive upper tail inequal-
ities for general p-stable distributions and that we can use
Lemma 7 (regarding Gaussian variables) to derive lower tail
inequalities for general p-stable distributions. The following
two lemmas establish these results; the proofs of these lem-
mas are provided in Appendixes A.5 and A.6, respectively.
Lemma 9. Given p (1, 2), for i = 1, . . . , m, let X
i
be
m (not necessarily independent) random variables sampled
from D
p
, and γ
i
> 0 with γ =
P
i
γ
i
. Let X =
P
i
γ
i
|X
i
|
p
.
Assume that m 3. Then for any t 1,
Pr[X
p
γ]
2 log(mt)
t
.
Lemma 10. For i = 1, . . . , m, let X
i
be independent ran-
dom variables sampled from D
p
, and γ
i
0 with γ =
P
i
γ
i
.
Let X =
P
i
γ
i
|c
i
|. Then,
log Pr[X (1 t)β
p
γ]
γt
2
6 max
i
γ
i
.
Given these results, here is our main result for input-
sparsity time low-distortion subspace embeddings for `
p
. The
proof of this theorem is similar to the proof of Theorem 2,
except that we replace the `
1
norm k · k
1
by k · k
p
p
and use
Lemmas 9 and 10 (rather than Lemmas 5 and 6).
Theorem 4. Given A R
n×d
and p (1, 2), let Π =
SD R
s×n
where S R
s×n
has each column chosen inde-
pendently and uniformly from the s standard basis vectors
of R
s
, and where D R
n×n
is a diagonal matrix with di-
agonals chosen independently from D
p
. Set s = ωd
5
log
5
d
95

Citations
More filters
Book
David P. Woodruff1
14 Nov 2014
TL;DR: A survey of linear sketching algorithms for numeric allinear algebra can be found in this paper, where the authors consider least squares as well as robust regression problems, low rank approximation, and graph sparsification.
Abstract: This survey highlights the recent advances in algorithms for numericallinear algebra that have come from the technique of linear sketching,whereby given a matrix, one first compresses it to a much smaller matrixby multiplying it by a (usually) random matrix with certain properties.Much of the expensive computation can then be performed onthe smaller matrix, thereby accelerating the solution for the originalproblem. In this survey we consider least squares as well as robust regressionproblems, low rank approximation, and graph sparsification.We also discuss a number of variants of these problems. Finally, wediscuss the limitations of sketching methods.

584 citations

Proceedings ArticleDOI
01 Jun 2013
TL;DR: The fastest known algorithms for overconstrained least-squares regression, low-rank approximation, approximating all leverage scores, and lp-regression are obtained.
Abstract: We design a new distribution over poly(r e-1) x n matrices S so that for any fixed n x d matrix A of rank r, with probability at least 9/10, SAx2 = (1 pm e)Ax2 simultaneously for all x ∈ Rd. Such a matrix S is called a subspace embedding. Furthermore, SA can be computed in O(nnz(A)) + ~O(r2e-2) time, where nnz(A) is the number of non-zero entries of A. This improves over all previous subspace embeddings, which required at least Ω(nd log d) time to achieve this property. We call our matrices S sparse embedding matrices.Using our sparse embedding matrices, we obtain the fastest known algorithms for overconstrained least-squares regression, low-rank approximation, approximating all leverage scores, and lp-regression: to output an x' for which Ax'-b2 ≤ (1+e)minx Ax-b2 for an n x d matrix A and an n x 1 column vector b, we obtain an algorithm running in O(nnz(A)) + ~O(d3e-2) time, and another in O(nnz(A)log(1/e)) + ~O(d3log(1/e)) time. (Here ~O(f) = f ⋅ logO(1)(f).) to obtain a decomposition of an n x n matrix A into a product of an n x k matrix L, a k x k diagonal matrix D, and a n x k matrix W, for which F{A - L D W} ≤ (1+e)F{A-Ak}, where Ak is the best rank-k approximation, our algorithm runs in O(nnz(A)) + ~O(nk2 e-4log n + k3e-5log2n) time. to output an approximation to all leverage scores of an n x d input matrix A simultaneously, with constant relative error, our algorithms run in O(nnz(A) log n) + ~O(r3) time. to output an x' for which Ax'-bp ≤ (1+e)minx Ax-bp for an n x d matrix A and an n x 1 column vector b, we obtain an algorithm running in O(nnz(A) log n) + poly(r e-1) time, for any constant 1 ≤ p

443 citations

Journal ArticleDOI
TL;DR: A new distribution over m × n matrices S is designed so that, for any fixed n × d matrix A of rank r, with probability at least 9/10, ∥SAx∥2 = (1 ± ε)∥Ax∢2 simultaneously for all x ∈ Rd.
Abstract: We design a new distribution over m × n matrices S so that, for any fixed n × d matrix A of rank r, with probability at least 9/10, pSAxp2 = (1 ± e)pAxp2 simultaneously for all x ∈ Rd. Here, m is bounded by a polynomial in re− 1, and the parameter e ∈ (0, 1]. Such a matrix S is called a subspace embedding. Furthermore, SA can be computed in O(nnz(A)) time, where nnz(A) is the number of nonzero entries of A. This improves over all previous subspace embeddings, for which computing SA required at least Ω(ndlog d) time. We call these Ssparse embedding matrices.Using our sparse embedding matrices, we obtain the fastest known algorithms for overconstrained least-squares regression, low-rank approximation, approximating all leverage scores, and ep regression.More specifically, let b be an n × 1 vector, e > 0 a small enough value, and integers k, p g 1. Our results include the following.—Regression: The regression problem is to find d × 1 vector x′ for which pAx′ − bpp l (1 + e)min xpAx − bpp. For the Euclidean case p = 2, we obtain an algorithm running in O(nnz(A)) + O(d3e −2) time, and another in O(nnz(A)log(1/e)) + O(d3 log (1/e)) time. (Here, O(f) = f c log O(1)(f).) For p ∈ [1, ∞), more generally, we obtain an algorithm running in O(nnz(A) log n) + O(r\e −1)C time, for a fixed C.—Low-rank approximation: We give an algorithm to obtain a rank-k matrix Âk such that pA − ÂkpF ≤ (1 + e )p A − AkpF, where Ak is the best rank-k approximation to A. (That is, Ak is the output of principal components analysis, produced by a truncated singular value decomposition, useful for latent semantic indexing and many other statistical problems.) Our algorithm runs in O(nnz(A)) + O(nk2e−4 + k3e−5) time.—Leverage scores: We give an algorithm to estimate the leverage scores of A, up to a constant factor, in O(nnz(A)log n) + O(r3)time.

381 citations

Journal ArticleDOI
David P. Woodruff1
TL;DR: This survey highlights the recent advances in algorithms for numericallinear algebra that have come from the technique of linear sketching, and considers least squares as well as robust regression problems, low rank approximation, and graph sparsification.
Abstract: This survey highlights the recent advances in algorithms for numerical linear algebra that have come from the technique of linear sketching, whereby given a matrix, one first compresses it to a much smaller matrix by multiplying it by a (usually) random matrix with certain properties. Much of the expensive computation can then be performed on the smaller matrix, thereby accelerating the solution for the original problem. In this survey we consider least squares as well as robust regression problems, low rank approximation, and graph sparsification. We also discuss a number of variants of these problems. Finally, we discuss the limitations of sketching methods.

335 citations

References
More filters
Journal ArticleDOI
TL;DR: In this article, a nonlinear transformation of two independent uniform random variables into one stable random variable is presented, which is a continuous function of each of the uniform random variable, and of α and a modified skewness parameter β' throughout their respective permissible ranges.
Abstract: A new algorithm is presented for simulating stable random variables on a digital computer for arbitrary characteristic exponent α(0 < α ≤ 2) and skewness parameter β(-1 ≤ β ≤ 1). The algorithm involves a nonlinear transformation of two independent uniform random variables into one stable random variable. This stable random variable is a continuous function of each of the uniform random variables, and of α and a modified skewness parameter β' throughout their respective permissible ranges.

1,124 citations

Proceedings ArticleDOI
21 Oct 2006
TL;DR: In this paper, the authors present a (1 + ∆)-approximation algorithm for the singular value decomposition of an m? n matrix A with M non-zero entries that requires 2 passes over the data and runs in time O(n 2 ).
Abstract: Recently several results appeared that show significant reduction in time for matrix multiplication, singular value decomposition as well as linear (\ell_ 2) regression, all based on data dependent random sampling. Our key idea is that low dimensional embeddings can be used to eliminate data dependence and provide more versatile, linear time pass efficient matrix computation. Our main contribution is summarized as follows. --Independent of the recent results of Har-Peled and of Deshpande and Vempala, one of the first -- and to the best of our knowledge the most efficient -- relative error (1 + \in) \parallel A - A_k \parallel _F approximation algorithms for the singular value decomposition of an m ? n matrix A with M non-zero entries that requires 2 passes over the data and runs in time O\left( {\left( {M(\frac{k} { \in } + k\log k) + (n + m)(\frac{k} { \in } + k\log k)^2 } \right)\log \frac{1} {\delta }} \right) --The first o(nd^{2}) time (1 + \in) relative error approximation algorithm for n ? d linear (\ell_2) regression. --A matrix multiplication and norm approximation algorithm that easily applies to implicitly given matrices and can be used as a black box probability boosting tool.

852 citations

Journal ArticleDOI
TL;DR: An algorithm is presented that preferentially chooses columns and rows that exhibit high “statistical leverage” and exert a disproportionately large “influence” on the best low-rank fit of the data matrix, obtaining improved relative-error and constant-factor approximation guarantees in worst-case analysis, as opposed to the much coarser additive-error guarantees of prior work.
Abstract: Principal components analysis and, more generally, the Singular Value Decomposition are fundamental data analysis tools that express a data matrix in terms of a sequence of orthogonal or uncorrelated vectors of decreasing importance. Unfortunately, being linear combinations of up to all the data points, these vectors are notoriously difficult to interpret in terms of the data and processes generating the data. In this article, we develop CUR matrix decompositions for improved data analysis. CUR decompositions are low-rank matrix decompositions that are explicitly expressed in terms of a small number of actual columns and/or actual rows of the data matrix. Because they are constructed from actual data elements, CUR decompositions are interpretable by practitioners of the field from which the data are drawn (to the extent that the original data are). We present an algorithm that preferentially chooses columns and rows that exhibit high “statistical leverage” and, thus, in a very precise statistical sense, exert a disproportionately large “influence” on the best low-rank fit of the data matrix. By selecting columns and rows in this manner, we obtain improved relative-error and constant-factor approximation guarantees in worst-case analysis, as opposed to the much coarser additive-error guarantees of prior work. In addition, since the construction involves computing quantities with a natural and widely studied statistical interpretation, we can leverage ideas from diagnostic regression analysis to employ these matrix decompositions for exploratory data analysis.

815 citations

Posted Content
TL;DR: This monograph will provide a detailed overview of recent work on the theory of randomized matrix algorithms as well as the application of those ideas to the solution of practical problems in large-scale data analysis.
Abstract: Randomized algorithms for very large matrix problems have received a great deal of attention in recent years. Much of this work was motivated by problems in large-scale data analysis, and this work was performed by individuals from many different research communities. This monograph will provide a detailed overview of recent work on the theory of randomized matrix algorithms as well as the application of those ideas to the solution of practical problems in large-scale data analysis. An emphasis will be placed on a few simple core ideas that underlie not only recent theoretical advances but also the usefulness of these tools in large-scale data applications. Crucial in this context is the connection with the concept of statistical leverage. This concept has long been used in statistical regression diagnostics to identify outliers; and it has recently proved crucial in the development of improved worst-case matrix algorithms that are also amenable to high-quality numerical implementation and that are useful to domain scientists. Randomized methods solve problems such as the linear least-squares problem and the low-rank matrix approximation problem by constructing and operating on a randomized sketch of the input matrix. Depending on the specifics of the situation, when compared with the best previously-existing deterministic algorithms, the resulting randomized algorithms have worst-case running time that is asymptotically faster; their numerical implementations are faster in terms of clock-time; or they can be implemented in parallel computing environments where existing numerical algorithms fail to run at all. Numerous examples illustrating these observations will be described in detail.

639 citations

Journal Article

608 citations

Frequently Asked Questions (16)
Q1. What are the contributions in "Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression" ?

The authors show that, given a matrix A ∈ Rn×d with n d and a p ∈ [ 1, 2 ), with a constant probability, they can construct a low-distortion embedding matrix Π ∈ RO ( poly ( d ) ) ×n that embeds Their result generalizes the input-sparsity time ` 2 subspace embedding by Clarkson and Woodruff [ STOC ’ 13 ] ; and for completeness, the authors present a simpler and improved analysis of their construction for ` 2. For ` 2, the authors show that a ( 1 + ) -approximate solution to the ` 2 regression problem specified by the matrix A and a vector b ∈ R can be computed in O ( nnz ( A ) + d log ( d/ ) / ) time ; and for ` p, via a subspace-preserving sampling procedure, they show that a ( 1 ± ) -distortion embedding of Ap into RO ( poly ( d ) ) can be computed in O ( nnz ( A ) · logn ) time, and they also show that a ( 1 + ) -approximate solution to the ` p regression problem minx∈Rd ‖Ax − b‖p can be computed in O ( nnz ( A ) · logn + poly ( d ) log ( 1/ ) / ) time. Moreover, the authors can also improve the embedding dimension or equivalently the sample size to O ( d log ( 1/ ) / ) without increasing the complexity. Ap, the ` p subspace spanned by A ’ s columns, into ( RO ( poly ( d ) ), ‖ · ‖p ) ; the distortion of their embeddings is only O ( poly ( d ) ), and the authors can compute ΠA in O ( nnz ( A ) ) time, i. e., input-sparsity time. 

A parameterized family of regression problems that is of particular interest is the overconstrained `p regression problem: given a matrix A ∈ Rn×d, with n > d, a vector b ∈ 

Given R ∈ Rd×d such that AR−1 is well-conditioned in the `p norm, the authors can construct a (1 ± )-distortion embedding, specifically a subspace-preserving sampling, of Ap in O(nnz(A) · logn) additional time and with a constant probability. 

Given an `p regression problem specified by A ∈ Rn×d, b ∈ Rn, and p ∈ [1,∞), let S be a (1± )- distortion embedding matrix of the subspace spanned by A’s columns and b from Lemma 3, and let x̂ be an optimal solution to the subsampled problem minx∈Rd ‖SAx−Sb‖p. 

The authors are interested in fast embedding of Ap into a d-dimensional subspace of (Rpoly(d), ‖ · ‖p), with distortion either poly(d) or (1± ), for some > 0, as well as applications of this embedding to problems such as `p regression. 

The (1 ± )-distortion subspace embedding (for `p, p ∈ [1, 2), that the authors construct from the input-sparsity time embedding and the fast subspace-preserving sampling) has embedding dimension s = O(poly(d) log(1/ )/ 2), where the somewhat large poly(d) term directly multiplies the log(1/ )/ 2 term. 

Given a matrix A ∈ Rn×d, p ∈ [1,∞), > 0, and a matrix R ∈ Rd×d such that AR−1 is wellconditioned, it takes O(nnz(A) · logn) time to compute a sampling matrix S ∈ Rs×n (with only one nonzero element per row) with s = O(κ̄pp(AR−1) 

In Theorem 2 and Theorem 4, the embedding dimension is s = O(poly(d) log(1/ )/ 2), where the poly(d) term is a somewhat large polynomial of d that directly multiplies the log(1/ )/ 2 term. 

In addition, the authors can use it to compute a (1 + )-approximation to the `1 regression problem in O(nnz(A) · logn+ poly(d/ )) time, which in turn leads to immediate improvements in `1-based matrix approximation objectives, e.g., for the `1 subspace approximation problem [6, 29, 10]. 

The authors also show that, by coupling with recent work on fast subspace-preserving sampling from [10], these embeddings can be used to provide (1+ )-approximate solutions toements of the projection matrix onto the span of A. See [20, 15] for details; and note that they can be generalized to `1 and other `p norms [10] as well as to arbitrary n×d matrices, with both n and d large [21, 15].`p regression problems, for p ∈ [1, 2], in nearly input-sparsity time. 

Clarkson and Woodruff achieve their improved results for `2-based problems by showing how to construct such a Π with s = poly(d/ ) and showing that it can be applied to an arbitrary A in O(nnz(A)) time [11]. 

without any prior knowledge, the authors have to scan at least a constant portion of the input to guarantee that aj is observed with a constant probability, which takes O(nnz(A)) time. 

Definition 4. A distribution D over R is called p-stable, if for any m real numbers a1, . . . , am, the authors havem∑ i=1 aiXi ' ( m∑ i=1 |ai|p )1/p X,where Xi iid∼ D and X ∼ D. By “X ' Y ”, the authors mean X and Y have the same distribution. 

The authors want to thank P. Drineas for reading the first version of this paper and pointing out that the embedding dimension in Theorem 1 can be easily improved from O(d4/ 2) to O(d2/ 2) using the same technique. 

Given a subspace-preserving sampling algorithm, Clarkson et al. [10, Theorem 5.4] show it is straightforward to compute a 1+1− -approximate solution to an `p regression problem. 

Although their simpler direct proof leads to a better result for `2 subspace embedding, the technique used in the proof of Clarkson and Woodruff [11], which splits coordinates into “heavy” and “light” sets based on the leverage scores, highlights an important structural property of `2 subspace: that only a small subset of coordinates can have large `2 leverage scores.