Open AccessJournal Article
A statistical perspective on algorithmic leveraging
Reads0
Chats0
TLDR
In this article, Li et al. provide a simple yet effective framework to evaluate the statistical properties of leverage-based sampling in the context of estimating parameters in a linear regression model with a fixed number of predictors.Abstract:
One popular method for dealing with large-scale data sets is sampling. For example, by using the empirical statistical leverage scores as an importance sampling distribution, the method of algorithmic leveraging samples and rescales rows/columns of data matrices to reduce the data size before performing computations on the subproblem. This method has been successful in improving computational efficiency of algorithms for matrix problems such as least-squares approximation, least absolute deviations approximation, and low-rank matrix approximation. Existing work has focused on algorithmic issues such as worst-case running times and numerical issues associated with providing high-quality implementations, but none of it addresses statistical aspects of this method.
In this paper, we provide a simple yet effective framework to evaluate the statistical properties of algorithmic leveraging in the context of estimating parameters in a linear regression model with a fixed number of predictors. In particular, for several versions of leverage-based sampling, we derive results for the bias and variance, both conditional and unconditional on the observed data. We show that from the statistical perspective of bias and variance, neither leverage-based sampling nor uniform sampling dominates the other. This result is particularly striking, given the well-known result that, from the algorithmic perspective of worst-case analysis, leverage-based sampling provides uniformly superior worst-case algorithmic results, when compared with uniform sampling.
Based on these theoretical results, we propose and analyze two new leveraging algorithms: one constructs a smaller least-squares problem with "shrinkage" leverage scores (SLEV), and the other solves a smaller and unweighted (or biased) least-squares problem (LEVUNW). A detailed empirical evaluation of existing leverage-based methods as well as these two new methods is carried out on both synthetic and real data sets. The empirical results indicate that our theory is a good predictor of practical performance of existing and new leverage-based algorithms and that the new algorithms achieve improved performance. For example, with the same computation reduction as in the original algorithmic leveraging approach, our proposed SLEV typically leads to improved biases and variances both unconditionally and conditionally (on the observed data), and our proposed LEVUNW typically yields improved unconditional biases and variances.read more
Citations
More filters
Journal ArticleDOI
RandNLA: randomized numerical linear algebra
TL;DR: RandNLA is an interdisciplinary research area that exploits randomization as a computational resource to develop improved algorithms for large-scale linear algebra problems and promises a sound algorithmic and statistical foundation for modern large- scale data analysis.
Random design analysis of ridge regression
TL;DR: This work gives a simultaneous analysis of both the ordinary least squares estimator and the ridge regression estimator in the random design setting under mild assumptions on the covariate/response distributions.
Journal ArticleDOI
Optimal Subsampling for Large Sample Logistic Regression.
HaiYing Wang,Rong Zhu,Ping Ma +2 more
TL;DR: A two-step algorithm is developed to efficiently approximate the maximum likelihood estimate in logistic regression and derive optimal subsampling probabilities that minimize the asymptotic mean squared error of the resultant estimator.
Journal ArticleDOI
Information-Based Optimal Subdata Selection for Big Data Linear Regression
TL;DR: In many branches of science, a large amount of data are being produced in many branches and many proven statistical methods are no longer applicable with extraordinary large datasets due to computational limitations as mentioned in this paper.
References
More filters
Journal ArticleDOI
Bootstrap Methods: Another Look at the Jackknife
TL;DR: In this article, the authors discuss the problem of estimating the sampling distribution of a pre-specified random variable R(X, F) on the basis of the observed data x.
Book
Statistics for spatial data
Noel A Cressie,Noel A Cressie +1 more
TL;DR: In this paper, the authors present a survey of statistics for spatial data in the field of geostatistics, including spatial point patterns and point patterns modeling objects, using Lattice Data and spatial models on lattices.
Book
Applied Linear Regression
TL;DR: In this paper, the authors present a method to estimate the least squares of a scatterplot matrix using a simple linear regression model, and compare it with the mean function of the scatterplot matrices.
Journal ArticleDOI
A Leisurely Look at the Bootstrap, the Jackknife, and Cross-Validation
Bradley Efron,Gail Gong +1 more
TL;DR: This paper reviewed the nonparametric estimation of statistical error, mainly the bias and standard error of an estimator, or the error rate of a prediction rule, at a relaxed mathematical level, omitting most proofs, regularity conditions and technical details.