scispace - formally typeset

Journal ArticleDOI

Penalized Composite Quasi-Likelihood for Ultrahigh-Dimensional Variable Selection

01 Jun 2011-Journal of The Royal Statistical Society Series B-statistical Methodology (NIH Public Access)-Vol. 73, Iss: 3, pp 325-349

TL;DR: A data‐driven weighted linear combination of convex loss functions, together with weighted L1‐penalty is proposed and established a strong oracle property of the method proposed that has both the model selection consistency and estimation efficiency for the true non‐zero coefficients.

AbstractIn high-dimensional model selection problems, penalized least-square approaches have been extensively used. This paper addresses the question of both robustness and efficiency of penalized model selection methods, and proposes a data-driven weighted linear combination of convex loss functions, together with weighted L1-penalty. It is completely data-adaptive and does not require prior knowledge of the error distribution. The weighted L1-penalty is used both to ensure the convexity of the penalty term and to ameliorate the bias caused by the L1-penalty. In the setting with dimensionality much larger than the sample size, we establish a strong oracle property of the proposed method that possesses both the model selection consistency and estimation efficiency for the true non-zero coefficients. As specific examples, we introduce a robust method of composite L1-L2, and optimal composite quantile method and evaluate their performance in both simulated and real data examples.

Topics: Model selection (57%), Weight function (55%), Penalty method (54%), Feature selection (53%), Robustness (computer science) (52%)

...read more

Content maybe subject to copyright    Report

Citations
More filters

Journal Article
Abstract: High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods.

813 citations


Journal ArticleDOI
TL;DR: This work proposes adaptive penalization methods for variable selection in the semiparametric varying-coefficient partially linear model and proves that the methods possess the oracle property.
Abstract: The complexity of semiparametric models poses new challenges to statistical inference and model selection that frequently arise from real applications In this work, we propose new estimation and variable selection procedures for the semiparametric varying-coefficient partially linear model We first study quantile regression estimates for the nonparametric varying-coefficient functions and the parametric regression coefficients To achieve nice efficiency properties, we further develop a semiparametric composite quantile regression procedure We establish the asymptotic normality of proposed estimators for both the parametric and nonparametric parts and show that the estimators achieve the best convergence rate Moreover, we show that the proposed method is much more efficient than the least-squares-based method for many non-normal errors and that it only loses a small amount of efficiency for normal errors In addition, it is shown that the loss in efficiency is at most 111% for estimating varying coefficient functions and is no greater than 136% for estimating parametric components To achieve sparsity with high-dimensional covariates, we propose adaptive penalization methods for variable selection in the semiparametric varying-coefficient partially linear model and prove that the methods possess the oracle property Extensive Monte Carlo simulation studies are conducted to examine the finite-sample performance of the proposed procedures Finally, we apply the new methods to analyze the plasma beta-carotene level data

245 citations


Cites methods from "Penalized Composite Quasi-Likelihoo..."

  • ...Moreover, we show that the proposed method is much more efficient than the least-squares-based method for many non-normal errors and that it only loses a small amount of efficiency for normal errors....

    [...]


Journal ArticleDOI
TL;DR: This paper reviews the literature on sparse high dimensional models and discusses some applications in economics and finance, including variable selection methods that are proved to be effective in high dimensional sparse modeling.
Abstract: This article reviews the literature on sparse high-dimensional models and discusses some applications in economics and finance. Recent developments in theory, methods, and implementations in penalized least-squares and penalized likelihood methods are highlighted. These variable selection methods are effective in sparse high-dimensional modeling. The limits of dimensionality that regularization methods can handle, the role of penalty functions, and their statistical properties are detailed. Some recent advances in sparse ultra-high-dimensional modeling are also briefly discussed.

201 citations


Journal ArticleDOI
Abstract: Multiple hypothesis testing is a fundamental problem in high-dimensional inference, with wide applications in many scientific fields. In genome-wide association studies, tens of thousands of tests are performed simultaneously to find if any single-nucleotide polymorphisms (SNPs) are associated with some traits and those tests are correlated. When test statistics are correlated, false discovery control becomes very challenging under arbitrary dependence. In this article, we propose a novel method—based on principal factor approximation—that successfully subtracts the common dependence and weakens significantly the correlation structure, to deal with an arbitrary dependence structure. We derive an approximate expression for false discovery proportion (FDP) in large-scale multiple testing when a common threshold is used and provide a consistent estimate of realized FDP. This result has important applications in controlling false discovery rate and FDP. Our estimate of realized FDP compares favorably with Efr...

171 citations


Posted Content
TL;DR: An approximate expression for false discovery proportion (FDP) in large-scale multiple testing when a common threshold is used and a consistent estimate of realized FDP is provided, which has important applications in controlling false discovery rate and FDP.
Abstract: Multiple hypothesis testing is a fundamental problem in high dimensional inference, with wide applications in many scientific fields. In genome-wide association studies, tens of thousands of tests are performed simultaneously to find if any SNPs are associated with some traits and those tests are correlated. When test statistics are correlated, false discovery control becomes very challenging under arbitrary dependence. In the current paper, we propose a novel method based on principal factor approximation, which successfully subtracts the common dependence and weakens significantly the correlation structure, to deal with an arbitrary dependence structure. We derive an approximate expression for false discovery proportion (FDP) in large scale multiple testing when a common threshold is used and provide a consistent estimate of realized FDP. This result has important applications in controlling FDR and FDP. Our estimate of realized FDP compares favorably with Efron (2007)'s approach, as demonstrated in the simulated examples. Our approach is further illustrated by some real data applications. We also propose a dependence-adjusted procedure, which is more powerful than the fixed threshold procedure.

152 citations


References
More filters

Journal ArticleDOI
TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Abstract: SUMMARY We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.

36,018 citations


Journal ArticleDOI
TL;DR: A publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates is described.
Abstract: The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.

7,274 citations


"Penalized Composite Quasi-Likelihoo..." refers background or methods in this paper

  • ...…(16) can be recast as a penalized weighted least square regression argmin β n∑ i=1 w1∣∣∣Yi −XTi β̂ (0) ∣∣∣ + w2 ( Yi −XTi β )2 + n p∑ j=1 γλ(|β(0)j |)|βj | which can be efficiently solved by pathwise coordinate optimization (Friedman et al., 2008) or least angle regression (Efron et al., 2004)....

    [...]

  • ...) are all nonnegative. This class of problems can be solved with fast and efficient computational algorithms such as pathwise coordinate optimization (Friedman et al., 2008) and least angle regression (Efron et al., 2004). One particular example is the combination of L 1 and L 2 regressions, in which K= 2, ρ 1(t) = |t−b 0|andρ 2(t) = t2. Here b 0 denotes themedian of error distributionε. Iftheerror distribution is sym...

    [...]

  • ...i=1 w 1 Yi −XT i βˆ (0) +w 2 Yi −XT i β 2 +n Xp j=1 γλ(|β (0) j |)|βj| which can be efficiently solved by pathwise coordinate optimization (Friedman et al., 2008) or least angle regression (Efron et al., 2004). If b 0 6= 0, the penalized least-squares problem ( 16) is somewhat different from (5) since we have an additional parameter b 0. Using the same arguments, and treating b 0 as an additional parameter ...

    [...]

  • ...This class of problems can be solved with fast and efficient computational algorithms such as pathwise coordinate optimization (Friedman et al., 2008) and least angle regression (Efron et al., 2004)....

    [...]


Journal ArticleDOI
TL;DR: In this article, penalized likelihood approaches are proposed to handle variable selection problems, and it is shown that the newly proposed estimators perform as well as the oracle procedure in variable selection; namely, they work as well if the correct submodel were known.
Abstract: Variable selection is fundamental to high-dimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0, ∞), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. The proposed ideas are widely applicable. They are readily applied to a variety of ...

7,149 citations


Journal ArticleDOI
Hui Zou1
TL;DR: A new version of the lasso is proposed, called the adaptive lasso, where adaptive weights are used for penalizing different coefficients in the ℓ1 penalty, and the nonnegative garotte is shown to be consistent for variable selection.
Abstract: The lasso is a popular technique for simultaneous estimation and variable selection. Lasso variable selection has been shown to be consistent under certain conditions. In this work we derive a necessary condition for the lasso variable selection to be consistent. Consequently, there exist certain scenarios where the lasso is inconsistent for variable selection. We then propose a new version of the lasso, called the adaptive lasso, where adaptive weights are used for penalizing different coefficients in the l1 penalty. We show that the adaptive lasso enjoys the oracle properties; namely, it performs as well as if the true underlying model were given in advance. Similar to the lasso, the adaptive lasso is shown to be near-minimax optimal. Furthermore, the adaptive lasso can be solved by the same efficient algorithm for solving the lasso. We also discuss the extension of the adaptive lasso in generalized linear models and show that the oracle properties still hold under mild regularity conditions. As a bypro...

5,810 citations


Book
14 Mar 1996
Abstract: 1.1. Introduction.- 1.2. Outer Integrals and Measurable Majorants.- 1.3. Weak Convergence.- 1.4. Product Spaces.- 1.5. Spaces of Bounded Functions.- 1.6. Spaces of Locally Bounded Functions.- 1.7. The Ball Sigma-Field and Measurability of Suprema.- 1.8. Hilbert Spaces.- 1.9. Convergence: Almost Surely and in Probability.- 1.10. Convergence: Weak, Almost Uniform, and in Probability.- 1.11. Refinements.- 1.12. Uniformity and Metrization.- 2.1. Introduction.- 2.2. Maximal Inequalities and Covering Numbers.- 2.3. Symmetrization and Measurability.- 2.4. Glivenko-Cantelli Theorems.- 2.5. Donsker Theorems.- 2.6. Uniform Entropy Numbers.- 2.7. Bracketing Numbers.- 2.8. Uniformity in the Underlying Distribution.- 2.9. Multiplier Central Limit Theorems.- 2.10. Permanence of the Donsker Property.- 2.11. The Central Limit Theorem for Processes.- 2.12. Partial-Sum Processes.- 2.13. Other Donsker Classes.- 2.14. Tail Bounds.- 3.1. Introduction.- 3.2. M-Estimators.- 3.3. Z-Estimators.- 3.4. Rates of Convergence.- 3.5. Random Sample Size, Poissonization and Kac Processes.- 3.6. The Bootstrap.- 3.7. The Two-Sample Problem.- 3.8. Independence Empirical Processes.- 3.9. The Delta-Method.- 3.10. Contiguity.- 3.11. Convolution and Minimax Theorems.- A. Appendix.- A.1. Inequalities.- A.2. Gaussian Processes.- A.2.1. Inequalities and Gaussian Comparison.- A.2.2. Exponential Bounds.- A.2.3. Majorizing Measures.- A.2.4. Further Results.- A.3. Rademacher Processes.- A.4. Isoperimetric Inequalities for Product Measures.- A.5. Some Limit Theorems.- A.6. More Inequalities.- A.6.1. Binomial Random Variables.- A.6.2. Multinomial Random Vectors.- A.6.3. Rademacher Sums.- Notes.- References.- Author Index.- List of Symbols.

5,229 citations