scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Penalized Composite Quasi-Likelihood for Ultrahigh-Dimensional Variable Selection

TL;DR: A data‐driven weighted linear combination of convex loss functions, together with weighted L1‐penalty is proposed and established a strong oracle property of the method proposed that has both the model selection consistency and estimation efficiency for the true non‐zero coefficients.
Abstract: In high-dimensional model selection problems, penalized least-square approaches have been extensively used. This paper addresses the question of both robustness and efficiency of penalized model selection methods, and proposes a data-driven weighted linear combination of convex loss functions, together with weighted L1-penalty. It is completely data-adaptive and does not require prior knowledge of the error distribution. The weighted L1-penalty is used both to ensure the convexity of the penalty term and to ameliorate the bias caused by the L1-penalty. In the setting with dimensionality much larger than the sample size, we establish a strong oracle property of the proposed method that possesses both the model selection consistency and estimation efficiency for the true non-zero coefficients. As specific examples, we introduce a robust method of composite L1-L2, and optimal composite quantile method and evaluate their performance in both simulated and real data examples.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, the authors consider a model-averaged quantile estimator as a computationally cheaper alternative to the least-squares estimator in linear models and derive its asymptotic properties in high-dimensional linear models.

8 citations


Cites background or methods from "Penalized Composite Quasi-Likelihoo..."

  • ...Bradic et al. (2011) consider a weighted penalized CQR estimator and its oracle properties when the error distribution is unknown....

    [...]

  • ...In Bradic et al. (2011), the optimal value of the weights is given by ν = A−1f to achieve the lower bound for the variance, (f>A−1f)−1. While such weights may be negative and thus lead to a nonconvex objective function that is hard to optimize, an alternative weight vector ν+ is obtained by minimizing ν>Aν subject to having all weights nonnegative and f>ν = 1. There is no explicit expression for the nonnegative optimal weights ν+. The authors show by simulations that both types of optimal weights outperform the equally-weighted estimator. Noteworthy, Bradic et al. (2011) comment upon the computational complexity of the composite quantile estimation method with a large number of quantiles, but report that usually k ≤ 10 suffices....

    [...]

  • ...Under the above-mentioned assumptions and the assumptions of Theorem 2 of Bradic et al. (2011), for the model-averaged penalized quantile predictions it holds that √ n( 1 n Ua(X > a Xa) −1U>a ) −1/2{ω1Ua(β̂a,τ1,pen − βa) + . . .+ ωkUa(β̂a,τk,pen − βa)} →d Nr(0, (ω>Ωω)Ir) where β̂a,τl,pen is the…...

    [...]

  • ...For high-dimensional models, only the composite estimator has been considered (Bradic et al., 2011)....

    [...]

  • ...Consider now a sparse high-dimensional linear model as in Bradic et al. (2011), Y = Xβ + ε (5) with independent and identically distributed mean-zero errors ε and with p, the number of columns of X large relative to the sample size n, allowing for an exponential order such that log(p) = O(nδ) with…...

    [...]

Journal Article
TL;DR: In this paper, the authors consider the effects of model selection on the estimation efficiency of penalized methods and derive the asymptotic mean squared error for regularized M-estimators.
Abstract: Understanding efficiency in high dimensional linear models is a longstanding problem of interest. Classical work with smaller dimensional problems dating back to Huber and Bickel has illustrated the benefits of efficient loss functions. When the number of parameters $p$ is of the same order as the sample size $n$, $p \approx n$, an efficiency pattern different from the one of Huber was recently established. In this work, we consider the effects of model selection on the estimation efficiency of penalized methods. In particular, we explore whether sparsity, results in new efficiency patterns when $p > n$. In the interest of deriving the asymptotic mean squared error for regularized M-estimators, we use the powerful framework of approximate message passing. We propose a novel, robust and sparse approximate message passing algorithm (RAMP), that is adaptive to the error distribution. Our algorithm includes many non-quadratic and non-differentiable loss functions. We derive its asymptotic mean squared error and show its convergence, while allowing $p, n, s \to \infty$, with $n/p \in (0,1)$ and $n/s \in (1,\infty)$. We identify new patterns of relative efficiency regarding a number of penalized $M$ estimators, when $p$ is much larger than $n$. We show that the classical information bound is no longer reachable, even for light--tailed error distributions. We show that the penalized least absolute deviation estimator dominates the penalized least square estimator, in cases of heavy--tailed distributions. We observe this pattern for all choices of the number of non-zero parameters $s$, both $s \leq n$ and $s \approx n$. In non-penalized problems where $s =p \approx n$, the opposite regime holds. Therefore, we discover that the presence of model selection significantly changes the efficiency patterns.

8 citations

Journal ArticleDOI
TL;DR: This paper develops a novel composite quantile regression and a weighted quantile average estimation procedure for parameter estimation in linear regression models when some responses are missing at random and proposes adaptive penalization methods to simultaneously select significant variables and estimate unknown parameters.
Abstract: Coefficient estimation in linear regression models with missing data is routinely done in the mean regression framework. However, the mean regression theory breaks down if the error variance is infinite. In addition, correct specification of the likelihood function for existing imputation approach is often challenging in practice, especially for skewed data. In this paper, we develop a novel composite quantile regression and a weighted quantile average estimation procedure for parameter estimation in linear regression models when some responses are missing at random. Instead of imputing the missing response by randomly drawing from its conditional distribution, we propose to impute both missing and observed responses by their estimated conditional quantiles given the observed data and to use the parametrically estimated propensity scores to weigh check functions that define a regression parameter. Both estimation procedures are resistant to heavy-tailed errors or outliers in the response and can achieve nice robustness and efficiency. Moreover, we propose adaptive penalization methods to simultaneously select significant variables and estimate unknown parameters. Asymptotic properties of the proposed estimators are carefully investigated. An efficient algorithm is developed for fast implementation of the proposed methodologies. We also discuss a model selection criterion, which is based on an IC Q -type statistic, to select the penalty parameters. The performance of the proposed methods is illustrated via simulated and real data sets.

7 citations


Cites methods from "Penalized Composite Quasi-Likelihoo..."

  • ...Following Bradic et al. (2011), we set the number of quantiles to beK D 9 and the quantile vector T D ....

    [...]

  • ...Following Bradic et al. (2011), we set the number of quantiles to beK D 9 and the quantile vector T D .0:1; 0:2; : : : ; 0:9/....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a two-level hierarchical Bayesian model for coefficient estimation and future selection is proposed, which assumes a prior distribution that favors sparseness. But the proposed approach performs quite good in comparison to the other approaches.
Abstract: This paper introduces a Bayesian approach for composite quantile regression employing the skewed Laplace distribution for the error distribution. We use a two-level hierarchical Bayesian model for coefficient estimation and future selection which assumes a prior distribution that favors sparseness. An efficient Gibbs sampling algorithm is developed to update the unknown quantities from the posteriors. The proposed approach is illustrated via simulation studies and two real datasets. Results indicate that the proposed approach performs quite good in comparison to the other approaches.

7 citations

Journal ArticleDOI
TL;DR: Adapt composite M-estimation (ACME) is proposed for partially overlapping models using a composite loss function, which is a linear combination of loss functions defining the individual models.
Abstract: In many problems, one has several models of interest that capture key parameters describing the distribution of the data. Partially overlapping models are taken as models in which at least one covariate effect is common to the models. A priori knowledge of such structure enables efficient estimation of all model parameters. However, in practice, this structure may be unknown. We propose adaptive composite M-estimation (ACME) for partially overlapping models using a composite loss function, which is a linear combination of loss functions defining the individual models. Penalization is applied to pairwise differences of parameters across models, resulting in data driven identification of the overlap structure. Further penalization is imposed on the individual parameters, enabling sparse estimation in the regression setting. The recovery of the overlap structure enables more efficient parameter estimation. An oracle result is established. Simulation studies illustrate the advantages of ACME over existing methods that fit individual models separately or make strong a priori assumption about the overlap structure.

7 citations


Cites background or methods from "Penalized Composite Quasi-Likelihoo..."

  • ...The criterion for the choice of weights is to maximize the efficiency of the estimator Bradic, Fan, and Wang (2011)....

    [...]

  • ...Bradic, Fan, and Wang (2011) chooses the weight vector by minimizing the scalar function....

    [...]

  • ...They considered the composite loss function as an approximation to the unknown log-likelihood function of the error distribution Bradic, Fan, and Wang (2011) while ACME considers each loss component as a model targeting different profiles of the conditional distribution....

    [...]

  • ...For completely overlapping models, Bradic, Fan, and Wang (2011) and Zou and Yuan (2008) used composite loss functions with the goal of improving efficiency of the regression parameter estimators....

    [...]

  • ...We also compared with penalized composite quasi-likelihood (PCQ) in Bradic, Fan, and Wang (2011), which was developed for a classical linear model....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Abstract: SUMMARY We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.

40,785 citations

Journal ArticleDOI
TL;DR: In this article, penalized likelihood approaches are proposed to handle variable selection problems, and it is shown that the newly proposed estimators perform as well as the oracle procedure in variable selection; namely, they work as well if the correct submodel were known.
Abstract: Variable selection is fundamental to high-dimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0, ∞), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. The proposed ideas are widely applicable. They are readily applied to a variety of ...

8,314 citations

Journal ArticleDOI
TL;DR: A publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates is described.
Abstract: The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.

7,828 citations


"Penalized Composite Quasi-Likelihoo..." refers background or methods in this paper

  • ...…(16) can be recast as a penalized weighted least square regression argmin β n∑ i=1 w1∣∣∣Yi −XTi β̂ (0) ∣∣∣ + w2 ( Yi −XTi β )2 + n p∑ j=1 γλ(|β(0)j |)|βj | which can be efficiently solved by pathwise coordinate optimization (Friedman et al., 2008) or least angle regression (Efron et al., 2004)....

    [...]

  • ...) are all nonnegative. This class of problems can be solved with fast and efficient computational algorithms such as pathwise coordinate optimization (Friedman et al., 2008) and least angle regression (Efron et al., 2004). One particular example is the combination of L 1 and L 2 regressions, in which K= 2, ρ 1(t) = |t−b 0|andρ 2(t) = t2. Here b 0 denotes themedian of error distributionε. Iftheerror distribution is sym...

    [...]

  • ...i=1 w 1 Yi −XT i βˆ (0) +w 2 Yi −XT i β 2 +n Xp j=1 γλ(|β (0) j |)|βj| which can be efficiently solved by pathwise coordinate optimization (Friedman et al., 2008) or least angle regression (Efron et al., 2004). If b 0 6= 0, the penalized least-squares problem ( 16) is somewhat different from (5) since we have an additional parameter b 0. Using the same arguments, and treating b 0 as an additional parameter ...

    [...]

  • ...This class of problems can be solved with fast and efficient computational algorithms such as pathwise coordinate optimization (Friedman et al., 2008) and least angle regression (Efron et al., 2004)....

    [...]

Journal ArticleDOI
Hui Zou1
TL;DR: A new version of the lasso is proposed, called the adaptive lasso, where adaptive weights are used for penalizing different coefficients in the ℓ1 penalty, and the nonnegative garotte is shown to be consistent for variable selection.
Abstract: The lasso is a popular technique for simultaneous estimation and variable selection. Lasso variable selection has been shown to be consistent under certain conditions. In this work we derive a necessary condition for the lasso variable selection to be consistent. Consequently, there exist certain scenarios where the lasso is inconsistent for variable selection. We then propose a new version of the lasso, called the adaptive lasso, where adaptive weights are used for penalizing different coefficients in the l1 penalty. We show that the adaptive lasso enjoys the oracle properties; namely, it performs as well as if the true underlying model were given in advance. Similar to the lasso, the adaptive lasso is shown to be near-minimax optimal. Furthermore, the adaptive lasso can be solved by the same efficient algorithm for solving the lasso. We also discuss the extension of the adaptive lasso in generalized linear models and show that the oracle properties still hold under mild regularity conditions. As a bypro...

6,765 citations

Journal ArticleDOI
TL;DR: In this article, a new approach toward a theory of robust estimation is presented, which treats in detail the asymptotic theory of estimating a location parameter for contaminated normal distributions, and exhibits estimators that are asyptotically most robust (in a sense to be specified) among all translation invariant estimators.
Abstract: This paper contains a new approach toward a theory of robust estimation; it treats in detail the asymptotic theory of estimating a location parameter for contaminated normal distributions, and exhibits estimators—intermediaries between sample mean and sample median—that are asymptotically most robust (in a sense to be specified) among all translation invariant estimators. For the general background, see Tukey (1960) (p. 448 ff.)

5,628 citations