Nonconcave Penalized Likelihood With NP-Dimensionality
Summary (5 min read)
Introduction
- The penalty functions that they used are not any nonconvex functions, but really the folded-concave functions.
- These constitute the main theoretical contributions of the paper.
A. Penalty Function
- For simplicity, the authors will drop its dependence on and write as when there is no confusion.
- Many penalty functions have been proposed in the literature for regularization.
- The penalty for bridges these two cases (Frank and Friedman, 1993).
- Hereafter the authors consider penalty functions that satisfy the following condition: Condition 1: is increasing and concave in , and has a continuous derivative with .
- Clearly the penalty is a convex function that falls at the boundary of the class of penalty functions satisfying Condition 1. Fan and Li (2001) advocate penalty functions that give estimators with three desired properties: unbiasedness, sparsity and continuity, and provide insights into them (see also Antoniadis and Fan, 2001).
C. Global Optimality
- A natural question is when the nonconcave penalized maximum likelihood estimator is a global maximizer of the penalized likelihood .
- Since is always positive, it is easy to show that the Hessian matrix of is always positive definite, which entails that the loglikelihood function is strictly concave in .
- The proposition below gives a condition under which the penalty term in (3) does not change the global maximizer.
- Of particular interest is to derive the conditions under which the PMLE is also an oracle estimator, in addition to possessing the above restricted global optimal estimator on .
- Assume that the conditions of Proposition 2 are satisfied for the submatrix of formed by columns in , the true model is -identifiable for some , and .
III. NONASYMPTOTIC WEAK ORACLE PROPERTIES
- The authors study a nonasymptotic property of the nonconcave penalized likelihood estimator , called the weak oracle property introduced by Lv and Fan (2009) in the setting of penalized least squares.
- The weak oracle property means sparsity in the sense of with probability tending to 1 as , and consistency under the loss, where and is a subvector of formed by components in .
- This property is weaker than the oracle property introduced by Fan and Li (2001).
A. Regularity Conditions
- As mentioned before, the authors condition on the design matrix and use the penalty in the class satisfying Condition 1.
- To simplify the presentation, the authors assume without loss of generality that each covariate has been standardized so that .
- Since the authors will assume that , condition (15) usually holds with .
- For the Gaussian linear regression model, condition (17) holds automatically.
- For the case of unbounded responses satisfying the moment condition (20), the authors define , where .
B. Weak Oracle Properties
- Theorem 2 (Weak Oracle Property): Assume that Conditions 1 – 3 and the probability bound (22) are satisfied, , and .
- Then there exists a nonconcave penalized likelihood estimator such that for sufficiently large , with probability at least , satisfies: a) . ; b) ( loss). , where and are respectively the subvectors of and formed by components in .
- It also enters the nonasymptotic probability bound.
- The value of can be taken as large as for concave penalties.
- The large value of puts more stringent condi- tion on the design matrix.
C. Sampling Properties of -Based PMLE
- When the -penalty is applied, the penalized likelihood in (3) is concave.
- The local maximizer in Theorems 1 and 2 becomes the global maximizer.
- Due to its popularity, the authors now examine the implications of Theorem 2 in the context of penalized least-squares and penalized likelihood.
- As a corollary of Theorem 2, the authors have Corollary 1 (Penalized Estimator): Under Conditions 2 and 3 and probability bound (22), if and , then the penalized likelihood estimator has model selection consistency with rate .
- For the penalized least-squares, Corollary 1 continues to hold without normality assumption, as long as probability bound (22) holds.
IV. ORACLE PROPERTIES
- 2001) of the nonconcave penalized likelihood estimator .the authors.
- Thus, Condition 5 is less restrictive for SCAD-like penalties, since for sufficiently large .
- Theorem 3 can be thought of as answering the question that given the dimensionality, how strong the minimum signal should be in order for the penalized likelihood estimator to have some nice properties, through Conditions 4 and 5.
- Specifically, for each coordinate within each iteration, ICA uses the second order approximation of at the -vector from the previous step along that coordinate and maximizes the univariate penalized quadratic approximation.
- When the penalty is used, it is known that the choice of ensures that is the global maximizer of (3).
A. Logistic Regression
- The authors demonstrate the performance of nonconcave penalized likelihood methods in logistic regression.
- Thus, the authors used five-fold cross-validation (CV) based on prediction error to select the tuning parameter.
- Table II and Fig. 2 report the comparison results given by PE, loss, loss, deviance, #S, and FN.
- It is clear from Table II that LASSO selects far larger model size than SCAD and MCP.
- Since the coefficients of the sixth through tenth covariates are significantly smaller than other nonzero coefficients and the covariates are independent, the distribution of the response can be well approximated by the sparse model with the five small nonzero coefficients set to be zero.
B. Poisson Regression
- The authors demonstrate the performance of nonconcave penalized likelihood methods in Poisson regression.
- The authors set TABLE V MEDIANS AND ROBUST STANDARD DEVIATIONS (IN PARENTHESES) OF PE, LOSS, LOSS, DEVIANCE, #S, AND FN OVER 100 SIMULATIONS FOR ALL METHODS IN POISSON REGRESSION BY BIC AND CV, WHERE AND 1000 and chose the true regression coefficients vector by setting .
- Lasso, SCAD and MCP had over 100 simulations.
- The BIC and five-fold CV were used to select the regularization parameter.
- Table VI presents the comparison results given by the PE, loss, loss, deviance, #S, and FN.
C. Real Data Analysis
- The authors apply nonconcave penalized likelihood methods to the neuroblastoma data set, which was studied by Oberthuer et al. (2006).
- The patients at diagnosis were aged from 0 to 296 months with a median age of 15 months.
- The study aimed to develop a gene expression-based classifier for neuroblastoma patients that can reliably predict courses of the disease.
- The authors applied Lasso, SCAD and MCP using the logistic regression model.
- For the 3-year EFS classification, the authors randomly selected 125 subjects (25 positives and 100 negatives) as the training set and the rest as the test set.
VII. DISCUSSIONS
- The authors have studied penalized likelihood methods for ultrahigh dimensional variable selection.
- In the context of GLMs, the authors have shown that such methods have model selection consistency with oracle properties even for NP-dimensionality, for a class of nonconcave penalized likelihood approaches.
- The authors results are consistent with a known fact in the literature that concave penalties can reduce the bias problems of convex penalties.
- The authors have exploited the coordinate optimization with the ICA algorithm to find the solution paths and illustrated the performance of nonconcave penalized likelihood methods with numerical studies.
- The authors results show that the coordinate optimization works equally well and efficiently for producing the entire solution paths for concave penalties.
A. Proof of Theorem 1
- The authors will first derive the necessary condition.
- It follows from the classical optimization theory that if is a local maximizer of the penalized likelihood (3), it satisfies the Karush-Kuhn-Tucker (KKT) conditions, i.e., there exists some such that (31) where , for , and for .
- Let be the projection of onto the subspace .
- Note that the components of are zero for the indices in and the sign of is the same as that of for , where and are the th components of and , respectively.
- By condition (8) and the continuity of and , there exists some such that for any in a ball in centered at with radius (37).
B. Proof of Proposition 1
- By the concavity of , the authors can easily show that for , is a closed convex set with and being its interior points and the level set is its boundary.
- The authors now show that the global maximizer of the penalized likelihood belongs to .
- This follows easily from the definition of , , and , where .
C. Proof of Proposition 2
- Since , from the proof of Proposition 1 the authors know that the global maximizer of the penalized likelihood belongs to .
- Note that by assumption, the SCAD penalized likelihood estimator and .
- The key idea is to use a first order Taylor expansion of around and retain the Lagrange remainder term.
- This can easily been shown from the analytical solution to (38).
- Thus, it suffices to prove on the interval .
D. Proof of Proposition 3
- Let be any -dimensional coordinate subspace different from .
- Clearly is a -dimensional coordinate subspace with .
- Then part a) follows easily from the assumptions and Proposition 1.
- Part b) is an easy consequence of Proposition 2 in view of the assumptions and the fact that for the SCAD penalty given by (4).
E. Proof of Proposition 4
- Part a) follows easily from a simple application of Hoeffding’s inequality (Hoeffding, 1963), since are independent bounded random variables, where .
- In view of condition (20), are independent random variables with mean zero and satisfy Thus, an application of Bernstein’s inequality (see, e.g., Bennett, 1962 or van der Vaart and Wellner, 1996) yields which concludes the proof.
F. Proof of Theorem 2
- The authors break the whole proof into several steps.
- Since , it follows from Bonferroni’s inequality and (22) that (39) where and for unbounded responses, which is guaranteed for sufficiently large by Condition 3.
- To this end, the authors represent by using a second order Taylor expansion around with the Lagrange remainder term componentwise and obtain (42) where and for each with some -vector lying on the line segment joining and .
- Thus, the authors have shown that (7) indeed has a solution in .
- It remains to bound the second term of (48).
G. Proof of Theorem 3
- To prove the conclusions, it suffices to show that under the given regularity conditions, there exists a strict local maximizer of the penalized likelihood in (3) such that 1) with probability tending to 1 as (i.e., sparsity), and 2) (i.e., -consistency).
- Step 1: Consistency in the -Dimensional Subspace:.
- The authors now show that there exists a strict local maximizer of such that .
- To this end, the authors define an event where denotes the boundary of the closed set and .
- Then for sufficiently large , by (26) and in Conditions 4 and 5 the authors have Thus, by (53), they have which along with Markov’s inequality entails that It follows from , , and Conditions 4 and 5 that since is decreasing in .
H. Proof of Theorem 4
- On the event defined in the proof of Theorem 3, it has been shown that is a strict local maximizer of and .
- This along with the first part of (26) in Condition 4 entails (57) where and the small order term is understood under the norm.
- The authors are now ready to show the asymptotic normality of .
- Logistic regression model, and Poisson regression model.the authors.
- Thus, maximizing becomes the penalized least squares problem.
Did you find this useful? Give us your feedback
Citations
892 citations
748 citations
579 citations
509 citations
441 citations
References
40,785 citations
"Nonconcave Penalized Likelihood Wit..." refers methods in this paper
...In this section we discuss the choice of penalty functions in regularization methods and characterize the non-concave penalized likelihood estimator as well as its global optimality....
[...]
...Lasso (Tibshirani, 1996) uses the L1-penalized least squares....
[...]
23,215 citations
10,520 citations
"Nonconcave Penalized Likelihood Wit..." refers background in this paper
...Generalized linear models (GLMs) provide a flexible parametric approach to estimating the covariate effects (McCullagh and Nelder, 1989)....
[...]
...Jinchi Lv is Assistant Professor of Statistics, Information and Operations Management Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089, USA (e-mail: jinchilv@marshall.usc.edu)....
[...]
8,655 citations
"Nonconcave Penalized Likelihood Wit..." refers background or methods in this paper
...Then the exponential bounds in (19) and (21) can be expressed as...
[...]
...Part a) follows easily from a simple application of Hoeffding’s inequality (Hoeffding, 1963), since a1Y1, · · · , anYn are n independent bounded random variables, where a = (a1, · · · , an)T ....
[...]
Related Papers (5)
Frequently Asked Questions (10)
Q2. What is the condition for the local maximizer of the nonconcave penalized likelihood?
Then is a strict local maximizer of the nonconcave penalized likelihood defined by (3) if(7)(8)(9)where and respectively denote the submatrices of formed by columns in and its complement, ,is a subvector of formed by all nonzero components, and .
Q3. What is the dimensionality of the penalized least squares?
In this case, the dimensionality that the penalized least-squares can handle is as high as when, which is usually smaller than that for the case of .
Q4. What is the condition of the Gaussian linear regression model?
Condition (16) controls the uniform growth rate of the -norm of these multiple regression coefficients, a notion of weak correlation between and .
Q5. What is the definition of a coordinate subspace?
A subspace of is called coordinate subspace if it is spanned by a subset of the natural basis , where each is the -vector with th component 1 and 0 elsewhere.
Q6. what is the maxi-mizer of the penalized likelihood?
Then there exists a strict local maxi-mizer of the penalized likelihood such that with probability tending to 1 as and, where is a subvector of formed by components in .
Q7. What is the simplest way to show that the second derivative of the penalty function does not exist?
More generally, when the second derivative of the penalty function does not necessarily exist, it is easy to show that the second part of the matrix can be replaced by a diagonal matrix with maximum absolute element bounded by .
Q8. What is the concavity of the convex set?
By the concavity of , the authors can easily show that for , is a closed convex set with and being its interior points and the level set is its boundary.
Q9. What is the second order approximation in ICA?
When is quadratic in , e.g., for the Gaussian linear regression model, the second order approximation in ICA is exact at each step.
Q10. Why do the authors examine the implications of Theorem 2?
Due to its popularity, the authors now examine the implications of Theorem 2 in the context of penalized least-squares and penalized likelihood.