Penalized Composite Quasi-Likelihood for Ultrahigh-Dimensional Variable Selection

doi:10.1111/J.1467-9868.2010.00764.X

Home
/
Papers
/
Penalized Composite Quasi-Likelihood for Ultrahigh-Dimensional Variable Selection

Journal Article•DOI•

Penalized Composite Quasi-Likelihood for Ultrahigh-Dimensional Variable Selection

Jelena Bradic¹, Jianqing Fan¹, Weiwei Wang²•Institutions (2)

Princeton University¹, University of Texas Health Science Center at Houston²

01 Jun 2011-Journal of The Royal Statistical Society Series B-statistical Methodology (NIH Public Access)-Vol. 73, Iss: 3, pp 325-349

TL;DR: A data‐driven weighted linear combination of convex loss functions, together with weighted L1‐penalty is proposed and established a strong oracle property of the method proposed that has both the model selection consistency and estimation efficiency for the true non‐zero coefficients.

read less

Abstract: In high-dimensional model selection problems, penalized least-square approaches have been extensively used. This paper addresses the question of both robustness and efficiency of penalized model selection methods, and proposes a data-driven weighted linear combination of convex loss functions, together with weighted L1-penalty. It is completely data-adaptive and does not require prior knowledge of the error distribution. The weighted L1-penalty is used both to ensure the convexity of the penalty term and to ameliorate the bias caused by the L1-penalty. In the setting with dimensionality much larger than the sample size, we establish a strong oracle property of the proposed method that possesses both the model selection consistency and estimation efficiency for the true non-zero coefficients. As specific examples, we introduce a robust method of composite L1-L2, and optimal composite quantile method and evaluate their performance in both simulated and real data examples.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Detangling robustness in high dimensions: composite versus model-averaged estimation

[...]

Jing Zhou, Gerda Claeskens, Jelena Bradic

12 Jun 2020-arXiv: Statistics Theory

TL;DR: It is found that model-averaged and composite quantile estimators often outperform least-squares methods, even in the case of Gaussian model noise, and a toolbox to further study robustness in these settings is provided.

...read moreread less

Abstract: Robust methods, though ubiquitous in practice, are yet to be fully understood in the context of regularized estimation and high dimensions. Even simple questions become challenging very quickly. For example, classical statistical theory identifies equivalence between model-averaged and composite quantile estimation. However, little to nothing is known about such equivalence between methods that encourage sparsity. This paper provides a toolbox to further study robustness in these settings and focuses on prediction. In particular, we study optimally weighted model-averaged as well as composite $l_1$-regularized estimation. Optimal weights are determined by minimizing the asymptotic mean squared error. This approach incorporates the effects of regularization, without the assumption of perfect selection, as is often used in practice. Such weights are then optimal for prediction quality. Through an extensive simulation study, we show that no single method systematically outperforms others. We find, however, that model-averaged and composite quantile estimators often outperform least-squares methods, even in the case of Gaussian model noise. Real data application witnesses the method's practical use through the reconstruction of compressed audio signals.

...read moreread less

4 citations

Cites background or methods from "Penalized Composite Quasi-Likelihoo..."

...…on minimising the asymptotic variance of the estimators of only the active set of coefficients, denoted by wMA,2 (Bloznelis et al., 2019) and wC,2 (Bradic et al., 2011) where, with the (k1, k2)th component of A equal to Ak1,k2 = min(τk1 , τk2){1 − max(τk1 , τk2)}, Aε = diag(fε(uτ1), . . . ,…...
[...]
..., 2019) and wC,2 (Bradic et al., 2011) where, with the (k1, k2)th component of A equal to Ak1,k2 = min(τk1 , τk2){1 − max(τk1 , τk2)}, Aε = diag(fε(uτ1), ....
[...]
...This is where our approach differs from Bloznelis et al. (2019) or Bradic et al. (2011), where an irrepresentable condition (needed for consistent model or asymptotically perfect selection) has been used to specify weights and analyze robustness....
[...]
...3 end 4 return wC,1, β̂(αopt, wC,1), and ÂMSE(β̂C(αopt;wcand);β) A possible initial weight vector wC,init is the vector of equal weights or the weight proposed in Bradic et al. (2011); β̂C is estimated by Algorithm 2, and AMSE(β̂C;β) is estimated by (29)....
[...]
...…of the sparse coefficient vector β without assuming that the nonzero entries are selected perfectly; whereas another type of weight choice derived in Bradic et al. (2011); Bloznelis et al. (2019) aims at the lower bound of the variance of the nonzero part of β by imposing the perfect selection…...
[...]

Journal Article•DOI•

Local weighted composite quantile estimation and smoothing parameter selection for nonparametric derivative function

[...]

Qichang Xie, Qiankun Sun, Junxian Liu

15 Mar 2020-Econometric Reviews

Abstract: Estimating derivatives is of primary interest as it quantitatively measures the rate of change of the relationship between response and explanatory variables. We propose a local weighted composite ...

...read moreread less

4 citations

Cites background or methods from "Penalized Composite Quasi-Likelihoo..."

...Generalizing the method of CQR, the weighted composite quantile regression (WCQR) was suggested by Bradic et al. (2011) for linear models and Jiang et al. (2012, 2014) for nonlinear models. In terms of nonparametric models, Sun et al. (2013) has shown that the WCQR inherits good properties of CQR to symmetric errors and is applicable to asymmetric errors. However, their works mainly focus on the estimation of nonparametric itself rather than its derivatives. The performance of nonparametric estimation relies on the values chosen for smoothing parameter or bandwidth. For recovering mean response function, much attention has been given to smoothing parameter selection, see Li and Racine (2007) and the references therein. Comparatively, few works have addressed smoothing parameter selection for derivative estimation. In the earlier literature, the studies of choosing smoothing parameter were mainly carried out under the setting of the LS. Rice (1986) introduced a nearly unbiased estimator of the integrated mean square error to select the optimal smoothing parameter for nonparametric derivative estimation. By kernel smoothing, M€ uller et al. (1987) proposed a generalized version of the crossvalidation (CV) technique to estimate the first-order derivative. Fan and Gijbels (1995) developed an integrated residual squares criterion for estimating derivatives employing the local polynomial fitting. Richard et al. (2011) suggested a generalized Cp bandwidth selection criterion to estimate derivative. Through the local LS regression, Henderson et al. (2015) put forward a minimum CV approach for choosing bandwidth of gradient estimation. Whereafter, this work is further promoted by Lin et al. (2015) to discuss the selection of bandwidth for quantile derivative estimation....
[...]
...Generalizing the method of CQR, the weighted composite quantile regression (WCQR) was suggested by Bradic et al. (2011) for linear models and Jiang et al. (2012, 2014) for nonlinear models. In terms of nonparametric models, Sun et al. (2013) has shown that the WCQR inherits good properties of CQR to symmetric errors and is applicable to asymmetric errors. However, their works mainly focus on the estimation of nonparametric itself rather than its derivatives. The performance of nonparametric estimation relies on the values chosen for smoothing parameter or bandwidth. For recovering mean response function, much attention has been given to smoothing parameter selection, see Li and Racine (2007) and the references therein....
[...]
...Generalizing the method of CQR, the weighted composite quantile regression (WCQR) was suggested by Bradic et al. (2011) for linear models and Jiang et al. (2012, 2014) for nonlinear models....
[...]
...Generalizing the method of CQR, the weighted composite quantile regression (WCQR) was suggested by Bradic et al. (2011) for linear models and Jiang et al. (2012, 2014) for nonlinear models. In terms of nonparametric models, Sun et al. (2013) has shown that the WCQR inherits good properties of CQR to symmetric errors and is applicable to asymmetric errors. However, their works mainly focus on the estimation of nonparametric itself rather than its derivatives. The performance of nonparametric estimation relies on the values chosen for smoothing parameter or bandwidth. For recovering mean response function, much attention has been given to smoothing parameter selection, see Li and Racine (2007) and the references therein. Comparatively, few works have addressed smoothing parameter selection for derivative estimation. In the earlier literature, the studies of choosing smoothing parameter were mainly carried out under the setting of the LS. Rice (1986) introduced a nearly unbiased estimator of the integrated mean square error to select the optimal smoothing parameter for nonparametric derivative estimation....
[...]
...Generalizing the method of CQR, the weighted composite quantile regression (WCQR) was suggested by Bradic et al. (2011) for linear models and Jiang et al. (2012, 2014) for nonlinear models. In terms of nonparametric models, Sun et al. (2013) has shown that the WCQR inherits good properties of CQR to symmetric errors and is applicable to asymmetric errors. However, their works mainly focus on the estimation of nonparametric itself rather than its derivatives. The performance of nonparametric estimation relies on the values chosen for smoothing parameter or bandwidth. For recovering mean response function, much attention has been given to smoothing parameter selection, see Li and Racine (2007) and the references therein. Comparatively, few works have addressed smoothing parameter selection for derivative estimation. In the earlier literature, the studies of choosing smoothing parameter were mainly carried out under the setting of the LS. Rice (1986) introduced a nearly unbiased estimator of the integrated mean square error to select the optimal smoothing parameter for nonparametric derivative estimation. By kernel smoothing, M€ uller et al. (1987) proposed a generalized version of the crossvalidation (CV) technique to estimate the first-order derivative. Fan and Gijbels (1995) developed an integrated residual squares criterion for estimating derivatives employing the local polynomial fitting. Richard et al. (2011) suggested a generalized Cp bandwidth selection criterion to estimate derivative. Through the local LS regression, Henderson et al. (2015) put forward a minimum CV approach for choosing bandwidth of gradient estimation....
[...]

Journal Article•DOI•

An improvement on the efficiency of complete-case-analysis with nonignorable missing covariate data

[...]

Jing Sun¹•Institutions (1)

Ludong University¹

01 Dec 2020-Computational Statistics

TL;DR: Results show that the proposed estimator is more efficient than the CCA one if the probability of missingness on the fully observed variables is correctly specified and the proposed algorithm is computationally simple and easy to implement.

...read moreread less

Abstract: This paper develops a weighted composite quantile regression method for linear models where some covariates are missing not at random but the missingness is conditionally independent of the response variable. It is known that complete case analysis (CCA) is valid under these missingness assumptions. By fully utilizing the information from incomplete data, empirical likelihood-based weights are obtained to conduct the weighted composite quantile regression. Theoretical results show that the proposed estimator is more efficient than the CCA one if the probability of missingness on the fully observed variables is correctly specified. Besides, the proposed algorithm is computationally simple and easy to implement. The methodology is illustrated on simulated data and a real data set.

...read moreread less

4 citations

Journal Article•DOI•

Finite Mixture of Generalized Semiparametric Models: Variable Selection via Penalized Estimation

[...]

Farzad Eskandari¹, Ehsan Ormoz²•Institutions (2)

Allameh Tabataba'i University¹, Islamic Azad University²

25 Nov 2016-Communications in Statistics - Simulation and Computation

TL;DR: To overcome computational burden, a class of variable selection procedures for finite mixture of generalized semiparametric models using penalized approach for variable selection is introduced.

...read moreread less

Abstract: Selection of the important variables is one of the most important model selection problems in statistical applications. In this article, we address variable selection in finite mixture of generalized semiparametric models. To overcome computational burden, we introduce a class of variable selection procedures for finite mixture of generalized semiparametric models using penalized approach for variable selection. Estimation of nonparametric component will be done via multivariate kernel regression. It is shown that the new method is consistent for variable selection and the performance of proposed method will be assessed via simulation.

...read moreread less

4 citations

Cites methods from "Penalized Composite Quasi-Likelihoo..."

...Bradic et al. (2011) addressed robustness and efficiency of penalized model selection method and proposed a data-driven weighted linear combination of convex loss functions, along with weighted L1 penalty....
[...]

Journal Article•DOI•

Unified Noncrossing Multiple Quantile Regressions Tree

[...]

Jaeoh Kim¹, HyungJun Cho¹, Sungwan Bang²•Institutions (2)

Korea University¹, Korea Military Academy²

09 Mar 2019-Journal of Computational and Graphical Statistics

TL;DR: The unified noncrossing multiple quantile regressions tree (UNQRT) method is proposed, which constructs a common tree structure across all interesting quantile levels for better data visualization and model interpretation and the improvement of prediction accuracy.

...read moreread less

Abstract: In this article, we consider the estimation problem of a tree model for multiple conditional quantile functions of the response. Using the generalized, unbiased interaction detection and estimation...

...read moreread less

4 citations

Cites methods from "Penalized Composite Quasi-Likelihoo..."

...Koenker (1984) discussed the optimal weight in composite quantile regression, and Bradic, Fan, and Wang (2011) proposed a data-driven method to estimate the optimal weights....
[...]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
…
19
20
21
22
23
24
25
…
26
27
28
29
30
31
32
33
34
35
36
37

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Regression Shrinkage and Selection via the Lasso

[...]

Robert Tibshirani

01 Jan 1996-Journal of the royal statistical society series b-methodological

TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.

...read moreread less

Abstract: SUMMARY We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.

...read moreread less

40,785 citations

Journal Article•DOI•

Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties

[...]

Jianqing Fan¹, Runze Li¹•Institutions (1)

University of California, Los Angeles¹

01 Dec 2001-Journal of the American Statistical Association

TL;DR: In this article, penalized likelihood approaches are proposed to handle variable selection problems, and it is shown that the newly proposed estimators perform as well as the oracle procedure in variable selection; namely, they work as well if the correct submodel were known.

...read moreread less

Abstract: Variable selection is fundamental to high-dimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0, ∞), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. The proposed ideas are widely applicable. They are readily applied to a variety of ...

...read moreread less

8,314 citations

Journal Article•DOI•

Least angle regression

[...]

Bradley Efron¹, Trevor Hastie¹, Iain M. Johnstone¹, Robert Tibshirani¹, Hemant Ishwaran², Keith Knight³, Jean-Michel Loubes⁴, Jean-Michel Loubes⁵, Pascal Massart⁶, Pascal Massart⁵, David Madigan⁷, David Madigan⁸, Greg Ridgeway⁸, Greg Ridgeway⁹, Saharon Rosset¹⁰, Saharon Rosset¹, Ji Zhu, Robert A. Stine¹¹, Berwin A. Turlach¹², Sanford Weisberg¹³ - Show less +16 more•Institutions (13)

Stanford University¹, Cleveland Clinic², University of Toronto³, Centre national de la recherche scientifique⁴, Université Paris-Saclay⁵, University of Paris-Sud⁶, Avaya⁷, Rutgers University⁸, RAND Corporation⁹, IBM¹⁰, University of Pennsylvania¹¹, University of Western Australia¹², University of Minnesota¹³

01 Apr 2004-Annals of Statistics

TL;DR: A publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates is described.

...read moreread less

Abstract: The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.

...read moreread less

7,828 citations

"Penalized Composite Quasi-Likelihoo..." refers background or methods in this paper

...…(16) can be recast as a penalized weighted least square regression argmin β n∑ i=1 w1∣∣∣Yi −XTi β̂ (0) ∣∣∣ + w2 ( Yi −XTi β )2 + n p∑ j=1 γλ(|β(0)j |)|βj | which can be efficiently solved by pathwise coordinate optimization (Friedman et al., 2008) or least angle regression (Efron et al., 2004)....
[...]
...) are all nonnegative. This class of problems can be solved with fast and eﬃcient computational algorithms such as pathwise coordinate optimization (Friedman et al., 2008) and least angle regression (Efron et al., 2004). One particular example is the combination of L 1 and L 2 regressions, in which K= 2, ρ 1(t) = |t−b 0|andρ 2(t) = t2. Here b 0 denotes themedian of error distributionε. Iftheerror distribution is sym...
[...]
...i=1 w 1 Yi −XT i βˆ (0) +w 2 Yi −XT i β 2 +n Xp j=1 γλ(|β (0) j |)|βj| which can be eﬃciently solved by pathwise coordinate optimization (Friedman et al., 2008) or least angle regression (Efron et al., 2004). If b 0 6= 0, the penalized least-squares problem ( 16) is somewhat diﬀerent from (5) since we have an additional parameter b 0. Using the same arguments, and treating b 0 as an additional parameter ...
[...]
...This class of problems can be solved with fast and efficient computational algorithms such as pathwise coordinate optimization (Friedman et al., 2008) and least angle regression (Efron et al., 2004)....
[...]

Journal Article•DOI•

The adaptive lasso and its oracle properties

[...]

Hui Zou¹•Institutions (1)

University of Minnesota¹

01 Dec 2006-Journal of the American Statistical Association

TL;DR: A new version of the lasso is proposed, called the adaptive lasso, where adaptive weights are used for penalizing different coefficients in the ℓ1 penalty, and the nonnegative garotte is shown to be consistent for variable selection.

...read moreread less

Abstract: The lasso is a popular technique for simultaneous estimation and variable selection. Lasso variable selection has been shown to be consistent under certain conditions. In this work we derive a necessary condition for the lasso variable selection to be consistent. Consequently, there exist certain scenarios where the lasso is inconsistent for variable selection. We then propose a new version of the lasso, called the adaptive lasso, where adaptive weights are used for penalizing different coefficients in the l1 penalty. We show that the adaptive lasso enjoys the oracle properties; namely, it performs as well as if the true underlying model were given in advance. Similar to the lasso, the adaptive lasso is shown to be near-minimax optimal. Furthermore, the adaptive lasso can be solved by the same efficient algorithm for solving the lasso. We also discuss the extension of the adaptive lasso in generalized linear models and show that the oracle properties still hold under mild regularity conditions. As a bypro...

...read moreread less

6,765 citations

Journal Article•DOI•

Robust Estimation of a Location Parameter

[...]

Peter J. Huber¹•Institutions (1)

University of California¹

01 Mar 1964-Annals of Mathematical Statistics

TL;DR: In this article, a new approach toward a theory of robust estimation is presented, which treats in detail the asymptotic theory of estimating a location parameter for contaminated normal distributions, and exhibits estimators that are asyptotically most robust (in a sense to be specified) among all translation invariant estimators.

...read moreread less

Abstract: This paper contains a new approach toward a theory of robust estimation; it treats in detail the asymptotic theory of estimating a location parameter for contaminated normal distributions, and exhibits estimators—intermediaries between sample mean and sample median—that are asymptotically most robust (in a sense to be specified) among all translation invariant estimators. For the general background, see Tukey (1960) (p. 448 ff.)

...read moreread less

5,628 citations