Sparse PCA: Optimal rates and adaptive estimation

doi:10.1214/13-AOS1178

Home
/
Papers
/
Sparse PCA: Optimal rates and adaptive estimation

Journal Article•DOI•

Sparse PCA: Optimal rates and adaptive estimation

T. Tony Cai¹, Zongming Ma¹, Yihong Wu•Institutions (1)

University of Pennsylvania¹

01 Dec 2013-Annals of Statistics (Institute of Mathematical Statistics)-Vol. 41, Iss: 6, pp 3074-3110

TL;DR: In this paper, the authors considered both minimax and adaptive estimation of the principal subspace in the high dimensional setting and established the optimal rates of convergence for estimating the subspace which are sharp with respect to all the parameters, thus providing a complete characterization of the difficulty of the estimation problem in terms of the convergence rate.

read less

Abstract: Principal component analysis (PCA) is one of the most commonly used statistical procedures with a wide range of applications. This paper considers both minimax and adaptive estimation of the principal subspace in the high dimensional setting. Under mild technical conditions, we first establish the optimal rates of convergence for estimating the principal subspace which are sharp with respect to all the parameters, thus providing a complete characterization of the difficulty of the estimation problem in term of the convergence rate. The lower bound is obtained by calculating the local metric entropy and an application of Fano’s lemma. The rate optimal estimator is constructed using aggregation, which, however, might not be computationally feasible. We then introduce an adaptive procedure for estimating the principal subspace which is fully data driven and can be computed efficiently. It is shown that the estimator attains the optimal rates of convergence simultaneously over a large collection of the parameter spaces. A key idea in our construction is a reduction scheme which reduces the sparse PCA problem to a high-dimensional multivariate regression problem. This method is potentially also useful for other related problems.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A useful variant of the Davis--Kahan theorem for statisticians

[...]

Yi Yu¹, Tengyao Wang¹, Richard J. Samworth¹•Institutions (1)

University of Cambridge¹

01 Jun 2015-Biometrika

TL;DR: In this paper, the authors present a variant of the Davis-Kahan theorem that relies only on a population eigenvalue separation condition, making it more natural and convenient for direct application in statistical contexts, and provide an improvement in many cases to the usual bound.

...read moreread less

Abstract: SUMMARY The Davis–Kahan theorem is used in the analysis of many statistical procedures to bound the distance between subspaces spanned by population eigenvectors and their sample versions. It relies on an eigenvalue separation condition between certain population and sample eigenvalues. We present a variant of this result that depends only on a population eigenvalue separation condition, making it more natural and convenient for direct application in statistical contexts, and provide an improvement in many cases to the usual bound in the statistical literature. We also give an extension to situations where the matrices under study may be asymmetric or even non-square, and where interest is in the distance between subspaces spanned by corresponding singular vectors.

...read moreread less

436 citations

Posted Content•

Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees

[...]

Yudong Chen, Martin J. Wainwright

10 Sep 2015-arXiv: Statistics Theory

TL;DR: This work provides a simple set of conditions under which projected gradient descent, when given a suitable initialization, converges geometrically to a statistically useful solution to the factorized optimization problem with rank constraints.

...read moreread less

Abstract: Optimization problems with rank constraints arise in many applications, including matrix regression, structured PCA, matrix completion and matrix decomposition problems. An attractive heuristic for solving such problems is to factorize the low-rank matrix, and to run projected gradient descent on the nonconvex factorized optimization problem. The goal of this problem is to provide a general theoretical framework for understanding when such methods work well, and to characterize the nature of the resulting fixed point. We provide a simple set of conditions under which projected gradient descent, when given a suitable initialization, converges geometrically to a statistically useful solution. Our results are applicable even when the initial solution is outside any region of local convexity, and even when the problem is globally concave. Working in a non-asymptotic framework, we show that our conditions are satisfied for a wide range of concrete models, including matrix regression, structured PCA, matrix completion with real and quantized observations, matrix decomposition, and graph clustering problems. Simulation results show excellent agreement with the theoretical predictions.

...read moreread less

311 citations

Journal Article•

Truncated power method for sparse eigenvalue problems

[...]

Xiao-Tong Yuan¹, Tong Zhang¹•Institutions (1)

Rutgers University¹

01 Jan 2013-Journal of Machine Learning Research

TL;DR: In this paper, the authors proposed a truncated power method that can approximately solve the underlying nonconvex optimization problem of sparse eigenvalue problem, which is to extract dominant (largest) sparse Eigenvectors with at most k non-zero components.

...read moreread less

Abstract: This paper considers the sparse eigenvalue problem, which is to extract dominant (largest) sparse eigenvectors with at most k non-zero components. We propose a simple yet effective solution called truncated power method that can approximately solve the underlying nonconvex optimization problem. A strong sparse recovery result is proved for the truncated power method, and this theory is our key motivation for developing the new algorithm. The proposed method is tested on applications such as sparse principal component analysis and the densest k-subgraph problem. Extensive experiments on several synthetic and real-world data sets demonstrate the competitive empirical performance of our method.

...read moreread less

309 citations

Journal Article•DOI•

An overview of the estimation of large covariance and precision matrices

[...]

Jianqing Fan¹, Yuan Liao², Han Liu¹•Institutions (2)

Princeton University¹, University of Maryland, College Park²

01 Feb 2016-Econometrics Journal

TL;DR: In this article, the authors provide a selective review of several recent developments on the estimation of large covariance and precision matrices, focusing on two general approaches: a rank-based method and a factor-model based method.

...read moreread less

Abstract: Summary The estimation of large covariance and precision matrices is fundamental in modern multivariate analysis. However, problems arise from the statistical analysis of large panel economic and financial data. The covariance matrix reveals marginal correlations between variables, while the precision matrix encodes conditional correlations between pairs of variables given the remaining variables. In this paper, we provide a selective review of several recent developments on the estimation of large covariance and precision matrices. We focus on two general approaches: a rank-based method and a factor-model-based method. Theories and applications of both approaches are presented. These methods are expected to be widely applicable to the analysis of economic and financial data.

...read moreread less

291 citations

Proceedings Article•

Complexity Theoretic Lower Bounds for Sparse Principal Component Detection

[...]

Quentin Berthet¹, Philippe Rigollet¹•Institutions (1)

Princeton University¹

13 Jun 2013

TL;DR: The performance of a test is measured by the smallest signal strength that it can detect and a computationally efficient method based on semidefinite programming is proposed and it is proved that the statistical performance of this test cannot be strictly improved by any computationallyefficient method.

...read moreread less

Abstract: In the context of sparse principal component detection, we bring evidence towards the existence of a statistical price to pay for computational efficiency. We measure the performance of a test by the smallest signal strength that it can detect and we propose a computationally efficient method based on semidefinite programming.We also prove that the statistical performance of this test cannot be strictly improved by any computationally efficient method. Our results can be viewed as complexity theoretic lower bounds conditionally on the assumptions that some instances of the planted clique problem cannot be solved in randomized polynomial time.

...read moreread less

275 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61

Collapse

References

PDF

Open Access

More filters

Book•

Elements of information theory

[...]

Thomas M. Cover¹, Joy A. Thomas²•Institutions (2)

Stanford University¹, IBM²

01 Jan 1991

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.

...read moreread less

Abstract: Preface to the Second Edition. Preface to the First Edition. Acknowledgments for the Second Edition. Acknowledgments for the First Edition. 1. Introduction and Preview. 1.1 Preview of the Book. 2. Entropy, Relative Entropy, and Mutual Information. 2.1 Entropy. 2.2 Joint Entropy and Conditional Entropy. 2.3 Relative Entropy and Mutual Information. 2.4 Relationship Between Entropy and Mutual Information. 2.5 Chain Rules for Entropy, Relative Entropy, and Mutual Information. 2.6 Jensen's Inequality and Its Consequences. 2.7 Log Sum Inequality and Its Applications. 2.8 Data-Processing Inequality. 2.9 Sufficient Statistics. 2.10 Fano's Inequality. Summary. Problems. Historical Notes. 3. Asymptotic Equipartition Property. 3.1 Asymptotic Equipartition Property Theorem. 3.2 Consequences of the AEP: Data Compression. 3.3 High-Probability Sets and the Typical Set. Summary. Problems. Historical Notes. 4. Entropy Rates of a Stochastic Process. 4.1 Markov Chains. 4.2 Entropy Rate. 4.3 Example: Entropy Rate of a Random Walk on a Weighted Graph. 4.4 Second Law of Thermodynamics. 4.5 Functions of Markov Chains. Summary. Problems. Historical Notes. 5. Data Compression. 5.1 Examples of Codes. 5.2 Kraft Inequality. 5.3 Optimal Codes. 5.4 Bounds on the Optimal Code Length. 5.5 Kraft Inequality for Uniquely Decodable Codes. 5.6 Huffman Codes. 5.7 Some Comments on Huffman Codes. 5.8 Optimality of Huffman Codes. 5.9 Shannon-Fano-Elias Coding. 5.10 Competitive Optimality of the Shannon Code. 5.11 Generation of Discrete Distributions from Fair Coins. Summary. Problems. Historical Notes. 6. Gambling and Data Compression. 6.1 The Horse Race. 6.2 Gambling and Side Information. 6.3 Dependent Horse Races and Entropy Rate. 6.4 The Entropy of English. 6.5 Data Compression and Gambling. 6.6 Gambling Estimate of the Entropy of English. Summary. Problems. Historical Notes. 7. Channel Capacity. 7.1 Examples of Channel Capacity. 7.2 Symmetric Channels. 7.3 Properties of Channel Capacity. 7.4 Preview of the Channel Coding Theorem. 7.5 Definitions. 7.6 Jointly Typical Sequences. 7.7 Channel Coding Theorem. 7.8 Zero-Error Codes. 7.9 Fano's Inequality and the Converse to the Coding Theorem. 7.10 Equality in the Converse to the Channel Coding Theorem. 7.11 Hamming Codes. 7.12 Feedback Capacity. 7.13 Source-Channel Separation Theorem. Summary. Problems. Historical Notes. 8. Differential Entropy. 8.1 Definitions. 8.2 AEP for Continuous Random Variables. 8.3 Relation of Differential Entropy to Discrete Entropy. 8.4 Joint and Conditional Differential Entropy. 8.5 Relative Entropy and Mutual Information. 8.6 Properties of Differential Entropy, Relative Entropy, and Mutual Information. Summary. Problems. Historical Notes. 9. Gaussian Channel. 9.1 Gaussian Channel: Definitions. 9.2 Converse to the Coding Theorem for Gaussian Channels. 9.3 Bandlimited Channels. 9.4 Parallel Gaussian Channels. 9.5 Channels with Colored Gaussian Noise. 9.6 Gaussian Channels with Feedback. Summary. Problems. Historical Notes. 10. Rate Distortion Theory. 10.1 Quantization. 10.2 Definitions. 10.3 Calculation of the Rate Distortion Function. 10.4 Converse to the Rate Distortion Theorem. 10.5 Achievability of the Rate Distortion Function. 10.6 Strongly Typical Sequences and Rate Distortion. 10.7 Characterization of the Rate Distortion Function. 10.8 Computation of Channel Capacity and the Rate Distortion Function. Summary. Problems. Historical Notes. 11. Information Theory and Statistics. 11.1 Method of Types. 11.2 Law of Large Numbers. 11.3 Universal Source Coding. 11.4 Large Deviation Theory. 11.5 Examples of Sanov's Theorem. 11.6 Conditional Limit Theorem. 11.7 Hypothesis Testing. 11.8 Chernoff-Stein Lemma. 11.9 Chernoff Information. 11.10 Fisher Information and the Cram-er-Rao Inequality. Summary. Problems. Historical Notes. 12. Maximum Entropy. 12.1 Maximum Entropy Distributions. 12.2 Examples. 12.3 Anomalous Maximum Entropy Problem. 12.4 Spectrum Estimation. 12.5 Entropy Rates of a Gaussian Process. 12.6 Burg's Maximum Entropy Theorem. Summary. Problems. Historical Notes. 13. Universal Source Coding. 13.1 Universal Codes and Channel Capacity. 13.2 Universal Coding for Binary Sequences. 13.3 Arithmetic Coding. 13.4 Lempel-Ziv Coding. 13.5 Optimality of Lempel-Ziv Algorithms. Compression. Summary. Problems. Historical Notes. 14. Kolmogorov Complexity. 14.1 Models of Computation. 14.2 Kolmogorov Complexity: Definitions and Examples. 14.3 Kolmogorov Complexity and Entropy. 14.4 Kolmogorov Complexity of Integers. 14.5 Algorithmically Random and Incompressible Sequences. 14.6 Universal Probability. 14.7 Kolmogorov complexity. 14.9 Universal Gambling. 14.10 Occam's Razor. 14.11 Kolmogorov Complexity and Universal Probability. 14.12 Kolmogorov Sufficient Statistic. 14.13 Minimum Description Length Principle. Summary. Problems. Historical Notes. 15. Network Information Theory. 15.1 Gaussian Multiple-User Channels. 15.2 Jointly Typical Sequences. 15.3 Multiple-Access Channel. 15.4 Encoding of Correlated Sources. 15.5 Duality Between Slepian-Wolf Encoding and Multiple-Access Channels. 15.6 Broadcast Channel. 15.7 Relay Channel. 15.8 Source Coding with Side Information. 15.9 Rate Distortion with Side Information. 15.10 General Multiterminal Networks. Summary. Problems. Historical Notes. 16. Information Theory and Portfolio Theory. 16.1 The Stock Market: Some Definitions. 16.2 Kuhn-Tucker Characterization of the Log-Optimal Portfolio. 16.3 Asymptotic Optimality of the Log-Optimal Portfolio. 16.4 Side Information and the Growth Rate. 16.5 Investment in Stationary Markets. 16.6 Competitive Optimality of the Log-Optimal Portfolio. 16.7 Universal Portfolios. 16.8 Shannon-McMillan-Breiman Theorem (General AEP). Summary. Problems. Historical Notes. 17. Inequalities in Information Theory. 17.1 Basic Inequalities of Information Theory. 17.2 Differential Entropy. 17.3 Bounds on Entropy and Relative Entropy. 17.4 Inequalities for Types. 17.5 Combinatorial Bounds on Entropy. 17.6 Entropy Rates of Subsets. 17.7 Entropy and Fisher Information. 17.8 Entropy Power Inequality and Brunn-Minkowski Inequality. 17.9 Inequalities for Determinants. 17.10 Inequalities for Ratios of Determinants. Summary. Problems. Historical Notes. Bibliography. List of Symbols. Index.

...read moreread less

45,034 citations

Book•

Matrix Analysis

[...]

Roger A. Horn¹, Charles R. Johnson²•Institutions (2)

Johns Hopkins University¹, Clemson University²

01 Jan 1985

TL;DR: In this article, the authors present results of both classic and recent matrix analyses using canonical forms as a unifying theme, and demonstrate their importance in a variety of applications, such as linear algebra and matrix theory.

...read moreread less

Abstract: Linear algebra and matrix theory are fundamental tools in mathematical and physical science, as well as fertile fields for research. This new edition of the acclaimed text presents results of both classic and recent matrix analyses using canonical forms as a unifying theme, and demonstrates their importance in a variety of applications. The authors have thoroughly revised, updated, and expanded on the first edition. The book opens with an extended summary of useful concepts and facts and includes numerous new topics and features, such as: - New sections on the singular value and CS decompositions - New applications of the Jordan canonical form - A new section on the Weyr canonical form - Expanded treatments of inverse problems and of block matrices - A central role for the Von Neumann trace theorem - A new appendix with a modern list of canonical forms for a pair of Hermitian matrices and for a symmetric-skew symmetric pair - Expanded index with more than 3,500 entries for easy reference - More than 1,100 problems and exercises, many with hints, to reinforce understanding and develop auxiliary themes such as finite-dimensional quantum systems, the compound and adjugate matrices, and the Loewner ellipsoid - A new appendix provides a collection of problem-solving hints.

...read moreread less

23,986 citations

Book•

An Introduction to Multivariate Statistical Analysis

[...]

T. W. Anderson

14 Sep 1984

TL;DR: In this article, the distribution of the Mean Vector and the Covariance Matrix and the Generalized T2-Statistic is analyzed. But the distribution is not shown to be independent of sets of Variates.

...read moreread less

Abstract: Preface to the Third Edition.Preface to the Second Edition.Preface to the First Edition.1. Introduction.2. The Multivariate Normal Distribution.3. Estimation of the Mean Vector and the Covariance Matrix.4. The Distributions and Uses of Sample Correlation Coefficients.5. The Generalized T2-Statistic.6. Classification of Observations.7. The Distribution of the Sample Covariance Matrix and the Sample Generalized Variance.8. Testing the General Linear Hypothesis: Multivariate Analysis of Variance9. Testing Independence of Sets of Variates.10. Testing Hypotheses of Equality of Covariance Matrices and Equality of Mean Vectors and Covariance Matrices.11. Principal Components.12. Cononical Correlations and Cononical Variables.13. The Distributions of Characteristic Roots and Vectors.14. Factor Analysis.15. Pattern of Dependence Graphical Models.Appendix A: Matrix Theory.Appendix B: Tables.References.Index.

...read moreread less

9,693 citations

Journal Article•DOI•

Introduction to Multivariate Statistical Analysis.

[...]

William G. Madow, T. W. Anderson

01 May 1959-American Mathematical Monthly

7,408 citations

Journal Article•DOI•

Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization

[...]

Benjamin Recht¹, Maryam Fazel, Pablo A. Parrilo•Institutions (1)

University of Wisconsin-Madison¹

01 Aug 2010-Siam Review

TL;DR: It is shown that if a certain restricted isometry property holds for the linear transformation defining the constraints, the minimum-rank solution can be recovered by solving a convex optimization problem, namely, the minimization of the nuclear norm over the given affine space.

...read moreread less

Abstract: The affine rank minimization problem consists of finding a matrix of minimum rank that satisfies a given system of linear equality constraints. Such problems have appeared in the literature of a diverse set of fields including system identification and control, Euclidean embedding, and collaborative filtering. Although specific instances can often be solved with specialized algorithms, the general affine rank minimization problem is NP-hard because it contains vector cardinality minimization as a special case. In this paper, we show that if a certain restricted isometry property holds for the linear transformation defining the constraints, the minimum-rank solution can be recovered by solving a convex optimization problem, namely, the minimization of the nuclear norm over the given affine space. We present several random ensembles of equations where the restricted isometry property holds with overwhelming probability, provided the codimension of the subspace is sufficiently large. The techniques used in our analysis have strong parallels in the compressed sensing framework. We discuss how affine rank minimization generalizes this preexisting concept and outline a dictionary relating concepts from cardinality minimization to those of rank minimization. We also discuss several algorithmic approaches to minimizing the nuclear norm and illustrate our results with numerical examples.

...read moreread less

3,432 citations