Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression
Summary (2 min read)
1. INTRODUCTION
- Regression problems are ubiquitous, and the fast computation of their solutions is of interest in many large-scale data applications.
- The authors analysis is direct and does not rely on splitting the high-dimensional space into a set of heavy-hitters consisting of the high-leverage components and the complement of that heavy-hitting set.
- 2) in general, the authors prove that there exists an order among the Cauchy distribution, a p-stable distribution with p ∈ (1, 2), and the Gaussian distribution such that for all p ∈ (1, 2) one can use the upper bound from the Cauchy distribution and the lower bound from the Gaussian distribution.
- The (1 ± )-distortion subspace embedding (for p, p ∈ [1, 2), that the authors construct from the input-sparsity time embedding and the fast subspace-preserving sampling) has embedding dimension s = O(poly(d) log(1/ )/ 2 ), where the somewhat large poly(d) term directly multiplies the log(1/ )/ 2 term.
Conditioning.
- The p subspace embedding and p regression problems are closely related to the concept of conditioning.
- The authors state here two related notions of p-norm conditioning and then a lemma that characterizes the relationship between them.
Lemma 1 ([10]). Given an n
- This procedure is called conditioning, and there exist two approaches for conditioning: via low-distortion p subspace embedding and via ellipsoidal rounding.
- The authors simply cite the following lemma, which is based on ellipsoidal rounding.
Stable distributions.
- The authors use properties of p-stable distributions for analyzing input-sparsity time low-distortion p subspace embeddings.
- By Lévy [19] , it is known that p-stable distributions exist for p ∈ (0, 2]; and from Chambers et al. [7] , it is known that p-stable random variables can be generated efficiently, thus allowing their practical use.
Tail inequalities.
- The authors note two inequalities from Clarkson et al. [10] regarding the tails of the Cauchy distribution.
- The following result about Gaussian variables is a direct consequence of Maurer's inequality ( [22] ), and the authors will use it to derive lower tail inequalities for p-stable distributions.
3. MAIN RESULTS FOR 2 EMBEDDING
- Here is their result for input-sparsity time low-distortion subspace embeddings for 2.
- See also Nelson and Nguyen [26] for a similar result with a slightly better constant.
- The O(nnz(A)) running time is indeed optimal, up to constant factors, for general inputs.
- The results of Theorem 1 propagate to related applications, e.g., to the 2 regression problem, the low-rank matrix approximation problem and the problem of computing approximations to the 2 leverage scores.
- The technique used in the proof of Clarkson and Woodruff [11] , which splits coordinates into "heavy" and "light" sets based on the leverage scores, highlights an important structural property of 2 subspace: that only a small subset of coordinates can have large 2 leverage scores.
4. MAIN RESULTS FOR 1 EMBEDDING
- Here is their result for input-sparsity time low-distortion subspace embeddings for 1.
- As mentioned above, the O(nnz(A)) running time is optimal.
- For the same construction of Π, one can provide a "bad" case that provides a lower bound.the authors.
- The authors input-sparsity time 1 subspace embedding of Theorem 2 improves the O(nnz(A) d log d)-time embedding by Sohler and Woodruff [29] and the O(nd log n)-time embedding of Clarkson et al. [10] .
- The authors improvements in Theorems 2 and 3 also propagate to related 1-based applications, including the 1 regression and the 1 subspace approximation problem considered in [29, 10] .
5. MAIN RESULTS FOR p EMBEDDING
- Generally, Dp does not have explicit PDF/CDF, which increases the difficulty for theoretical analysis.
- Lemma 8 suggests that the authors can use Lemma 5 (regarding Cauchy random variables) to derive upper tail inequalities for general p-stable distributions and that they can use Lemma 7 (regarding Gaussian variables) to derive lower tail inequalities for general p-stable distributions.
- Given these results, here is their main result for inputsparsity time low-distortion subspace embeddings for p.
- In particular, the authors can establish an improved algorithm for solving the p regression problem in nearly input-sparsity time.
6. IMPROVED EMBEDDING DIMENSION
- (See the remark below for comments on the precise value of the poly(d) term.).
- This is not ideal for the subspace embedding and the p regression, because the authors want to have a small embedding dimension and a small subsampled problem, respectively.
- Here, the authors show that it is possible to decouple the large polynomial of d and the log(1/ )/ 2 term via another round of sampling and conditioning without increasing the complexity.
- See Algorithm 2 for details on this procedure.
Algorithm 2 Improving the Embedding Dimension
- Then, by applying Theorem 7 to the p regression problem, the authors can improve the size of the subsampled problem and hence the overall running time.
- The authors have stated their results in the previous sections as poly(d) without stating the value of the polynomial because there are numerous trade-offs between the conditioning quality and the running time.
Did you find this useful? Give us your feedback
Citations
584 citations
527 citations
443 citations
381 citations
335 citations
References
1,124 citations
852 citations
815 citations
639 citations
Related Papers (5)
Frequently Asked Questions (16)
Q2. What is the common parameterized family of regression problems?
A parameterized family of regression problems that is of particular interest is the overconstrained `p regression problem: given a matrix A ∈ Rn×d, with n > d, a vector b ∈
Q3. What is the simplest way to construct a subspace-preserving sampling?
Given R ∈ Rd×d such that AR−1 is well-conditioned in the `p norm, the authors can construct a (1 ± )-distortion embedding, specifically a subspace-preserving sampling, of Ap in O(nnz(A) · logn) additional time and with a constant probability.
Q4. What is the way to solve the p regression problem?
Given an `p regression problem specified by A ∈ Rn×d, b ∈ Rn, and p ∈ [1,∞), let S be a (1± )- distortion embedding matrix of the subspace spanned by A’s columns and b from Lemma 3, and let x̂ be an optimal solution to the subsampled problem minx∈Rd ‖SAx−Sb‖p.
Q5. What is the simplest way to embed a subspace?
The authors are interested in fast embedding of Ap into a d-dimensional subspace of (Rpoly(d), ‖ · ‖p), with distortion either poly(d) or (1± ), for some > 0, as well as applications of this embedding to problems such as `p regression.
Q6. What is the embedding dimension for p?
The (1 ± )-distortion subspace embedding (for `p, p ∈ [1, 2), that the authors construct from the input-sparsity time embedding and the fast subspace-preserving sampling) has embedding dimension s = O(poly(d) log(1/ )/ 2), where the somewhat large poly(d) term directly multiplies the log(1/ )/ 2 term.
Q7. What is the simplest way to compute a subspace-preserving sampling?
Given a matrix A ∈ Rn×d, p ∈ [1,∞), > 0, and a matrix R ∈ Rd×d such that AR−1 is wellconditioned, it takes O(nnz(A) · logn) time to compute a sampling matrix S ∈ Rs×n (with only one nonzero element per row) with s = O(κ̄pp(AR−1)
Q8. What is the embedding dimension in the two theorems?
In Theorem 2 and Theorem 4, the embedding dimension is s = O(poly(d) log(1/ )/ 2), where the poly(d) term is a somewhat large polynomial of d that directly multiplies the log(1/ )/ 2 term.
Q9. What is the way to solve the 1 regression problem?
In addition, the authors can use it to compute a (1 + )-approximation to the `1 regression problem in O(nnz(A) · logn+ poly(d/ )) time, which in turn leads to immediate improvements in `1-based matrix approximation objectives, e.g., for the `1 subspace approximation problem [6, 29, 10].
Q10. How can the authors use sparse embeddings to solve 2 regression problems?
The authors also show that, by coupling with recent work on fast subspace-preserving sampling from [10], these embeddings can be used to provide (1+ )-approximate solutions toements of the projection matrix onto the span of A. See [20, 15] for details; and note that they can be generalized to `1 and other `p norms [10] as well as to arbitrary n×d matrices, with both n and d large [21, 15].`p regression problems, for p ∈ [1, 2], in nearly input-sparsity time.
Q11. How did Clarkson and Woodruff achieve their improved results for 2-based problems?
Clarkson and Woodruff achieve their improved results for `2-based problems by showing how to construct such a Π with s = poly(d/ ) and showing that it can be applied to an arbitrary A in O(nnz(A)) time [11].
Q12. How long does it take to compute aj?
without any prior knowledge, the authors have to scan at least a constant portion of the input to guarantee that aj is observed with a constant probability, which takes O(nnz(A)) time.
Q13. What is the definition of a distribution D over R?
Definition 4. A distribution D over R is called p-stable, if for any m real numbers a1, . . . , am, the authors havem∑ i=1 aiXi ' ( m∑ i=1 |ai|p )1/p X,where Xi iid∼ D and X ∼ D. By “X ' Y ”, the authors mean X and Y have the same distribution.
Q14. What is the author's reaction to the first version of this paper?
The authors want to thank P. Drineas for reading the first version of this paper and pointing out that the embedding dimension in Theorem 1 can be easily improved from O(d4/ 2) to O(d2/ 2) using the same technique.
Q15. What is the simplest way to compute a 1+1 solution to an p?
Given a subspace-preserving sampling algorithm, Clarkson et al. [10, Theorem 5.4] show it is straightforward to compute a 1+1− -approximate solution to an `p regression problem.
Q16. What is the proof for 2 subspace?
Although their simpler direct proof leads to a better result for `2 subspace embedding, the technique used in the proof of Clarkson and Woodruff [11], which splits coordinates into “heavy” and “light” sets based on the leverage scores, highlights an important structural property of `2 subspace: that only a small subset of coordinates can have large `2 leverage scores.