Journal Article•DOI•

Nonconcave Penalized Likelihood With NP-Dimensionality

Jianqing Fan¹, Jinchi Lv²•Institutions (2)

Princeton University¹, University of Southern California²

01 Aug 2011-IEEE Transactions on Information Theory (NIH Public Access)-Vol. 57, Iss: 8, pp 5467-5484

TL;DR: It is shown that in the context of generalized linear models, such methods possess model selection consistency with oracle properties even for dimensionality of nonpolynomial order of sample size, for a class of penalized likelihood approaches using folded-concave penalty functions, which were introduced to ameliorate the bias problems of convex penalty functions.

read less

Abstract: Penalized likelihood methods are fundamental to ultrahigh dimensional variable selection. How high dimensionality such methods can handle remains largely unknown. In this paper, we show that in the context of generalized linear models, such methods possess model selection consistency with oracle properties even for dimensionality of nonpolynomial (NP) order of sample size, for a class of penalized likelihood approaches using folded-concave penalty functions, which were introduced to ameliorate the bias problems of convex penalty functions. This fills a long-standing gap in the literature where the dimensionality is allowed to grow slowly with the sample size. Our results are also applicable to penalized likelihood with the L1-penalty, which is a convex function at the boundary of the class of folded-concave penalty functions under consideration. The coordinate optimization is implemented for finding the solution paths, whose performance is evaluated by a few simulation examples and the real data analysis.

...read moreread less

Summary (5 min read)

Jump to: [Introduction] – [A. Penalty Function] – [C. Global Optimality] – [III. NONASYMPTOTIC WEAK ORACLE PROPERTIES] – [A. Regularity Conditions] – [B. Weak Oracle Properties] – [C. Sampling Properties of -Based PMLE] – [IV. ORACLE PROPERTIES] – [A. Logistic Regression] – [B. Poisson Regression] – [C. Real Data Analysis] – [VII. DISCUSSIONS] – [A. Proof of Theorem 1] – [B. Proof of Proposition 1] – [C. Proof of Proposition 2] – [D. Proof of Proposition 3] – [E. Proof of Proposition 4] – [F. Proof of Theorem 2] – [G. Proof of Theorem 3] and [H. Proof of Theorem 4]

Introduction

The penalty functions that they used are not any nonconvex functions, but really the folded-concave functions.
These constitute the main theoretical contributions of the paper.

A. Penalty Function

For simplicity, the authors will drop its dependence on and write as when there is no confusion.
Many penalty functions have been proposed in the literature for regularization.
The penalty for bridges these two cases (Frank and Friedman, 1993).
Hereafter the authors consider penalty functions that satisfy the following condition: Condition 1: is increasing and concave in , and has a continuous derivative with .
Clearly the penalty is a convex function that falls at the boundary of the class of penalty functions satisfying Condition 1. Fan and Li (2001) advocate penalty functions that give estimators with three desired properties: unbiasedness, sparsity and continuity, and provide insights into them (see also Antoniadis and Fan, 2001).

C. Global Optimality

A natural question is when the nonconcave penalized maximum likelihood estimator is a global maximizer of the penalized likelihood .
Since is always positive, it is easy to show that the Hessian matrix of is always positive definite, which entails that the loglikelihood function is strictly concave in .
The proposition below gives a condition under which the penalty term in (3) does not change the global maximizer.
Of particular interest is to derive the conditions under which the PMLE is also an oracle estimator, in addition to possessing the above restricted global optimal estimator on .
Assume that the conditions of Proposition 2 are satisfied for the submatrix of formed by columns in , the true model is -identifiable for some , and .

III. NONASYMPTOTIC WEAK ORACLE PROPERTIES

The authors study a nonasymptotic property of the nonconcave penalized likelihood estimator , called the weak oracle property introduced by Lv and Fan (2009) in the setting of penalized least squares.
The weak oracle property means sparsity in the sense of with probability tending to 1 as , and consistency under the loss, where and is a subvector of formed by components in .
This property is weaker than the oracle property introduced by Fan and Li (2001).

A. Regularity Conditions

As mentioned before, the authors condition on the design matrix and use the penalty in the class satisfying Condition 1.
To simplify the presentation, the authors assume without loss of generality that each covariate has been standardized so that .
Since the authors will assume that , condition (15) usually holds with .
For the Gaussian linear regression model, condition (17) holds automatically.
For the case of unbounded responses satisfying the moment condition (20), the authors define , where .

B. Weak Oracle Properties

Theorem 2 (Weak Oracle Property): Assume that Conditions 1 – 3 and the probability bound (22) are satisfied, , and .
Then there exists a nonconcave penalized likelihood estimator such that for sufficiently large , with probability at least , satisfies: a) . ; b) ( loss). , where and are respectively the subvectors of and formed by components in .
It also enters the nonasymptotic probability bound.
The value of can be taken as large as for concave penalties.
The large value of puts more stringent condi- tion on the design matrix.

C. Sampling Properties of -Based PMLE

When the -penalty is applied, the penalized likelihood in (3) is concave.
The local maximizer in Theorems 1 and 2 becomes the global maximizer.
Due to its popularity, the authors now examine the implications of Theorem 2 in the context of penalized least-squares and penalized likelihood.
As a corollary of Theorem 2, the authors have Corollary 1 (Penalized Estimator): Under Conditions 2 and 3 and probability bound (22), if and , then the penalized likelihood estimator has model selection consistency with rate .
For the penalized least-squares, Corollary 1 continues to hold without normality assumption, as long as probability bound (22) holds.

IV. ORACLE PROPERTIES

2001) of the nonconcave penalized likelihood estimator .the authors.
Thus, Condition 5 is less restrictive for SCAD-like penalties, since for sufficiently large .
Theorem 3 can be thought of as answering the question that given the dimensionality, how strong the minimum signal should be in order for the penalized likelihood estimator to have some nice properties, through Conditions 4 and 5.
Specifically, for each coordinate within each iteration, ICA uses the second order approximation of at the -vector from the previous step along that coordinate and maximizes the univariate penalized quadratic approximation.
When the penalty is used, it is known that the choice of ensures that is the global maximizer of (3).

A. Logistic Regression

The authors demonstrate the performance of nonconcave penalized likelihood methods in logistic regression.
Thus, the authors used five-fold cross-validation (CV) based on prediction error to select the tuning parameter.
Table II and Fig. 2 report the comparison results given by PE, loss, loss, deviance, #S, and FN.
It is clear from Table II that LASSO selects far larger model size than SCAD and MCP.
Since the coefficients of the sixth through tenth covariates are significantly smaller than other nonzero coefficients and the covariates are independent, the distribution of the response can be well approximated by the sparse model with the five small nonzero coefficients set to be zero.

B. Poisson Regression

The authors demonstrate the performance of nonconcave penalized likelihood methods in Poisson regression.
The authors set TABLE V MEDIANS AND ROBUST STANDARD DEVIATIONS (IN PARENTHESES) OF PE, LOSS, LOSS, DEVIANCE, #S, AND FN OVER 100 SIMULATIONS FOR ALL METHODS IN POISSON REGRESSION BY BIC AND CV, WHERE AND 1000 and chose the true regression coefficients vector by setting .
Lasso, SCAD and MCP had over 100 simulations.
The BIC and five-fold CV were used to select the regularization parameter.
Table VI presents the comparison results given by the PE, loss, loss, deviance, #S, and FN.

C. Real Data Analysis

The authors apply nonconcave penalized likelihood methods to the neuroblastoma data set, which was studied by Oberthuer et al. (2006).
The patients at diagnosis were aged from 0 to 296 months with a median age of 15 months.
The study aimed to develop a gene expression-based classifier for neuroblastoma patients that can reliably predict courses of the disease.
The authors applied Lasso, SCAD and MCP using the logistic regression model.
For the 3-year EFS classification, the authors randomly selected 125 subjects (25 positives and 100 negatives) as the training set and the rest as the test set.

VII. DISCUSSIONS

The authors have studied penalized likelihood methods for ultrahigh dimensional variable selection.
In the context of GLMs, the authors have shown that such methods have model selection consistency with oracle properties even for NP-dimensionality, for a class of nonconcave penalized likelihood approaches.
The authors results are consistent with a known fact in the literature that concave penalties can reduce the bias problems of convex penalties.
The authors have exploited the coordinate optimization with the ICA algorithm to find the solution paths and illustrated the performance of nonconcave penalized likelihood methods with numerical studies.
The authors results show that the coordinate optimization works equally well and efficiently for producing the entire solution paths for concave penalties.

A. Proof of Theorem 1

The authors will first derive the necessary condition.
It follows from the classical optimization theory that if is a local maximizer of the penalized likelihood (3), it satisfies the Karush-Kuhn-Tucker (KKT) conditions, i.e., there exists some such that (31) where , for , and for .
Let be the projection of onto the subspace .
Note that the components of are zero for the indices in and the sign of is the same as that of for , where and are the th components of and , respectively.
By condition (8) and the continuity of and , there exists some such that for any in a ball in centered at with radius (37).

B. Proof of Proposition 1

By the concavity of , the authors can easily show that for , is a closed convex set with and being its interior points and the level set is its boundary.
The authors now show that the global maximizer of the penalized likelihood belongs to .
This follows easily from the definition of , , and , where .

C. Proof of Proposition 2

Since , from the proof of Proposition 1 the authors know that the global maximizer of the penalized likelihood belongs to .
Note that by assumption, the SCAD penalized likelihood estimator and .
The key idea is to use a first order Taylor expansion of around and retain the Lagrange remainder term.
This can easily been shown from the analytical solution to (38).
Thus, it suffices to prove on the interval .

D. Proof of Proposition 3

Let be any -dimensional coordinate subspace different from .
Clearly is a -dimensional coordinate subspace with .
Then part a) follows easily from the assumptions and Proposition 1.
Part b) is an easy consequence of Proposition 2 in view of the assumptions and the fact that for the SCAD penalty given by (4).

E. Proof of Proposition 4

Part a) follows easily from a simple application of Hoeffding’s inequality (Hoeffding, 1963), since are independent bounded random variables, where .
In view of condition (20), are independent random variables with mean zero and satisfy Thus, an application of Bernstein’s inequality (see, e.g., Bennett, 1962 or van der Vaart and Wellner, 1996) yields which concludes the proof.

F. Proof of Theorem 2

The authors break the whole proof into several steps.
Since , it follows from Bonferroni’s inequality and (22) that (39) where and for unbounded responses, which is guaranteed for sufficiently large by Condition 3.
To this end, the authors represent by using a second order Taylor expansion around with the Lagrange remainder term componentwise and obtain (42) where and for each with some -vector lying on the line segment joining and .
Thus, the authors have shown that (7) indeed has a solution in .
It remains to bound the second term of (48).

G. Proof of Theorem 3

To prove the conclusions, it suffices to show that under the given regularity conditions, there exists a strict local maximizer of the penalized likelihood in (3) such that 1) with probability tending to 1 as (i.e., sparsity), and 2) (i.e., -consistency).
Step 1: Consistency in the -Dimensional Subspace:.
The authors now show that there exists a strict local maximizer of such that .
To this end, the authors define an event where denotes the boundary of the closed set and .
Then for sufficiently large , by (26) and in Conditions 4 and 5 the authors have Thus, by (53), they have which along with Markov’s inequality entails that It follows from , , and Conditions 4 and 5 that since is decreasing in .

H. Proof of Theorem 4

On the event defined in the proof of Theorem 3, it has been shown that is a strict local maximizer of and .
This along with the first part of (26) in Condition 4 entails (57) where and the small order term is understood under the norm.
The authors are now ready to show the asymptotic normality of .
Logistic regression model, and Poisson regression model.the authors.
Thus, maximizing becomes the penalized least squares problem.

Did you find this useful? Give us your feedback

Figures (7)

TABLE VII CLASSIFICATION ERRORS IN THE NEUROBLASTOMA DATA SET

Fig. 1. Boxplots of PE, loss, and #S over 100 simulations for all methods in logistic regression, where . The -axis represents different methods. Top panel is for BIC and bottom panel is for SIC.

TABLE IX SELECTED GENES FOR THE GENDER CLASSIFICATION

TABLE II MEDIANS AND ROBUST STANDARD DEVIATIONS (IN PARENTHESES) OF PE, LOSS, LOSS, DEVIANCE, #S, AND FN OVER 100 SIMULATIONS FOR ALL METHODS IN LOGISTIC REGRESSION, WHERE AND 1000

TABLE IV MEDIANS AND ROBUST STANDARD DEVIATIONS (IN PARENTHESES) OF PE, LOSS, LOSS, DEVIANCE, #S, AND FN OVER 100 SIMULATIONS FOR ALL METHODS IN POISSON REGRESSION, WHERE

TABLE III MEDIANS AND ROBUST STANDARD DEVIATIONS (IN PARENTHESES) OF PE, LOSS, LOSS, DEVIANCE, #S, AND FN OVER 100 SIMULATIONS FOR ALL METHODS IN LOGISTIC REGRESSION MODEL HAVING SMALL NONZERO COEFFICIENTS, WHERE

Fig. 2. Boxplots of PE, loss, and #S over 100 simulations for all methods in logistic regression, where and 1000. The -axis represents different methods. Top panel is for and bottom panel is for .

Content maybe subject to copyright Report

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011 5467

Nonconcave Penalized Likelihood

With NP-Dimensionality

Jianqing Fan and Jinchi Lv

Abstract—Penalized likelihood methods are fundamental to

ultrahigh dimensional variable selection. How high dimension-

ality such methods can handle remains largely unknown. In this

paper, we show that in the context of generalized linear models,

such methods possess model selection consistency with oracle

properties even for dimensionality of nonpolynomial (NP) order

of sample size, for a class of penalized likelihood approaches

using folded-concave penalty functions, which were introduced to

ameliorate the bias problems of convex penalty functions. This

ﬁlls a long-standing gap in the literature where the dimensionality

is allowed to grow slowly with the sample size. Our results are also

applicable to penalized likelihood with the

-penalty, which is

a convex function at the boundary of the class of folded-concave

penalty functions under consideration. The coordinate opti-

mization is implemented for ﬁnding the solution paths, whose

performance is evaluated by a few simulation examples and the

real data analysis.

Index Terms—Coordinate optimization, folded-concave penalty,

high dimensionality, Lasso, nonconcave penalized likelihood, or-

acle property, SCAD, variable selection, weak oracle property.

I. INTRODUCTION

HE analysis of data sets with the number of variables

comparable to or much larger than the sample size fre-

quently arises nowadays in many ﬁelds ranging from genomics

and health sciences to economics and machine learning. The

data that we collect is usually of the type

where the

’s are independent observations of the response

variable

given its covariates, or explanatory variables,

. Generalized linear models (GLMs) provide a

ﬂexible parametric approach to estimating the covariate effects

(McCullagh and Nelder, 1989). In this paper we consider the

variable selection problem of nonpolynomial (NP) dimension-

ality in the context of GLMs. By NP-dimensionality we mean

that

for some . See Fan and Lv (2010)

Manuscript received January 13, 2010; revised February 23, 2011; accepted

March 02, 2011. Date of current version July 29, 2011. J. Fan was supported

in part by NSF Grants DMS-0704337 and DMS-0714554 and in part by NIH

Grant R01-GM072611 from the National Institute of General Medical Sciences.

J. Lv was supported in part by NSF CAREER Award DMS-0955316, in part by

NSF Grant DMS-0806030, and in part by the 2008 Zumberge Individual Award

from USC’s James H. Zumberge Faculty Research and Innovation Fund.

J. Fan is with the Department of Operations Research and Financial

Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail:

jqfan@princeton.edu).

J. Lv is with the Information and Operations Management Department, Mar-

shall School of Business, University of Southern California, Los Angeles, CA

90089 USA (e-mail: jinchilv@marshall.usc.edu).

Communicated by A. Krzyzak, Associate Editor for Pattern Recognition, Sta-

tistical Learning and Inference.

Color versions of one or more of the ﬁgures in this paper are available online

at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TIT.2011.2158486

for an overview of recent developments in high dimensional

variable selection.

We denote by

the design matrix with

, and

the -dimensional response vector. Throughout the paper we

consider deterministic design matrix. With a canonical link, the

conditional distribution of

given belongs to the canonical

exponential family, having the following density function with

respect to some ﬁxed measure

(1)

where

is an unknown -dimensional vector

of regression coefﬁcients,

is a family of dis-

tributions in the regular exponential family with dispersion pa-

rameter

, and . As is common

in GLM, the function

is implicitly assumed to be twice con-

tinuously differentiable with

always positive. In the sparse

modeling, we assume that majority of the true regression coefﬁ-

cients

are exactly zero. Without loss of

generality, assume that

with each component

nonzero and . Hereafter we refer to the support

as the true underlying sparse model of

the indices. Variable selection aims at locating those predictors

with nonzero and giving an efﬁcient estimate of .

In view of (1), the log-likelihood

of the

sample is given, up to an afﬁne transformation, by

(2)

where

for .We

consider the following penalized likelihood

(3)

where

is a penalty function and is a regularization

parameter.

In a pioneering paper, Fan and Li (2001) build the theoret-

ical foundation of nonconcave penalized likelihood for vari-

able selection. The penalty functions that they used are not any

nonconvex functions, but really the folded-concave functions.

For this reason, we will call them more precisely folded-con-

cave penalties. The paper also introduces the oracle property for

model selection. An estimator

is said to have

the oracle property (Fan and Li, 2001) if it enjoys the model

selection consistency in the sense of

with probability

tending to 1 as

, and it attains an information bound

5468 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

mimicking that of the oracle estimator, where is a subvector

formed by its ﬁrst components and the oracle knew the

true model

ahead of time. Fan and Li

(2001) study the oracle properties of nonconcave penalized like-

lihood estimators in the ﬁnite-dimensional setting. Their results

were extended later by Fan and Peng (2004) to the setting of

or in a general likelihood framework.

How large can the dimensionality

be, compared with

the sample size

, such that the oracle property continues to

hold in penalized likelihood estimation? What role does the

penalty function play? In this paper, we provide an answer to

these long-standing questions for a class of penalized likeli-

hood methods using folded-concave penalties in the context

of GLMs with NP-dimensionality. We also characterize the

nonasymptotic weak oracle property and the global optimality

of the nonconcave penalized maximum likelihood estimator.

Our theory applies to the

-penalty as well, but its conditions

are far more stringent than those for other members of the class.

These constitute the main theoretical contributions of the paper.

Numerous efforts have lately been devoted to studying the

properties of variable selection with ultrahigh dimensionality

and signiﬁcant progress has been made. Meinshausen and

Bühlmann (2006), Zhao and Yu (2006), and Zhang and Huang

(2008) investigate the issue of model selection consistency for

LASSO under different setups when the number of variables is

of a greater order than the sample size. Candes and Tao (2007)

introduce the Dantzig selector to handle the NP-dimensional

variable selection problem, which was shown to behave simi-

larly to Lasso by Bickel

et al. (2009). Zhang (2010) is among

the ﬁrst to study the nonconvex penalized least-squares esti-

mator with NP-dimensionality and demonstrates its advantages

over LASSO. He also developed the PLUS algorithm to ﬁnd the

solution path that has the desired sampling properties. Fan and

Lv (2008) and Huang et al. (2008) introduce the independence

screening procedure to reduce the dimensionality in the context

of least-squares. The former establishes the sure screening

property with NP-dimensionality and the latter also studies the

bridge regression, a folded-concave penalty approach. Hall and

Miller (2009) introduce feature ranking using a generalized cor-

relation, and Hall et al. (2009) propose independence screening

using tilting methods and empirical likelihood. Fan and Fan

(2008) investigate the impact of dimensionality on ultrahigh

dimensional classiﬁcation and establish an oracle property

for features annealed independence rules. Lv and Fan (2009)

make important connections between model selection and

sparse recovery using folded-concave penalties and establish

a nonasymptotic weak oracle property for the penalized least

squares estimator with NP-dimensionality. There are also a

number of important papers on establishing the oracle inequal-

ities for penalized empirical risk minimization. For example,

Bunea et al. (2007) establish sparsity oracle inequalities for the

Lasso under quadratic loss in the context of least-squares; van

de Geer (2008) obtains a nonasymptotic oracle inequality for

the empirical risk minimizer with the

-penalty in the context

of GLMs; Koltchinskii (2008) proves oracle inequalities for

penalized least squares with entropy penalization.

The rest of the paper is organized as follows. In Section II,

we discuss the choice of penalty functions and characterize the

nonconcave penalized likelihood estimator and its global opti-

mality. We study the nonasymptotic weak oracle properties and

oracle properties of nonconcave penalized likelihood estimator

in Sections III and IV, respectively. Section V introduces a co-

ordinate optimization algorithm, the iterative coordinate ascent

(ICA) algorithm, to solve regularization problems with concave

penalties. In Section VI, we present three numerical examples

using both simulated and real data sets. We provide some discus-

sions of our results and their implications in Section VII. Proofs

are presented in Section VIII. Technical details are relegated to

the Appendix.

II. N

ONCONCAVE

PENALIZED

LIKELIHOOD ESTIMATION

In this section, we discuss the choice of penalty functions in

regularization methods and characterize the nonconcave penal-

ized likelihood estimator as well as its global optimality.

A. Penalty Function

For any penalty function

, let .For

simplicity, we will drop its dependence on

and write as

when there is no confusion. Many penalty functions have

been proposed in the literature for regularization. For example,

the best subset selection amounts to using the

penalty. The

ridge regression uses the

penalty. The penalty

for bridges these two cases (Frank and Friedman,

1993). Breiman (1995) introduces the non-negative garrote for

shrinkage estimation and variable selection. Lasso (Tibshirani,

1996) uses the

-penalized least squares. The SCAD penalty

(Fan, 1997; Fan and Li, 2001) is the function whose derivative

is given by

(4)

where often is used, and MCP (Zhang, 2010) is deﬁned

through

. Clearly the SCAD penalty takes

off at the origin as the

penalty and then levels off, and MCP

translates the ﬂat part of the derivative of SCAD to the origin.

A family of folded concave penalties that bridge the

and

penalties were studied by Lv and Fan (2009).

Hereafter we consider penalty functions

that satisfy the

following condition:

Condition 1:

is increasing and concave in ,

and has a continuous derivative

with .In

addition,

is increasing in and is

independent of

The above class of penalty functions has been considered by

Lv and Fan (2009). Clearly the

penalty is a convex function

that falls at the boundary of the class of penalty functions satis-

fying Condition 1. Fan and Li (2001) advocate penalty functions

that give estimators with three desired properties: unbiasedness,

sparsity and continuity, and provide insights into them (see also

Antoniadis and Fan, 2001). SCAD satisﬁes Condition 1 and the

above three properties simultaneously. The

penalty and MCP

also satisfy Condition 1, but

does not enjoy the unbiased-

ness due to its constant rate of penalty and MCP violates the

continuity property. However, our results are applicable to the

-penalized and MCP regression. Condition 1 is needed for

FAN AND LV: NONCONCAVE PENALIZED LIKELIHOOD WITH NP-DIMENSIONALITY 5469

establishing the oracle properties of nonconcave penalized like-

lihood estimator.

B. Nonconcave Penalized Likelihood Estimator

It is generally difﬁcult to study the global maximizer of

the penalized likelihood analytically without concavity. As

is common in the literature, we study the behavior of local

maximizers.

We introduce some notation to simplify our presentation. For

any

, deﬁne

and

(5)

It is known that the

-dimensional response vector following

the distribution in (1) has mean vector

and covariance ma-

trix

, where . Let , and

, , where de-

notes the sign function. We denote by

the norm of a

vector or matrix for

. Following Lv and Fan (2009)

and Zhang (2010), deﬁne the local concavity of the penalty

with as

(6)

By the concavity of

in Condition 1, we have .

It is easy to show by the mean-value theorem that

provided that the second derivative of

is continuous. For the SCAD penalty, unless some

component of

takes values in . In the latter case,

Throughout the paper, we use

and to repre-

sent the smallest and largest eigenvalues of a symmetric matrix,

respectively.

The following theorem gives a sufﬁcient condition on the

strict local maximizer of the penalized likelihood

in (3).

Theorem 1 (Characterization of PMLE): Assume that

sat-

isﬁes Condition 1. Then

is a strict local maximizer of

the nonconcave penalized likelihood

deﬁned by (3) if

(7)

(8)

(9)

where

and respectively denote the submatrices of

formed by columns in and its complement, ,

is a subvector of formed by all nonzero components, and

. On the other hand, if is a local

maximizer of

, then it must satisfy (7) – (9) with strict

inequalities replaced by nonstrict inequalities.

There is only a tiny gap (nonstrict versus strict inequalities)

between the necessary condition for local maximizer and sufﬁ-

cient condition for strict local maximizer. Conditions (7) and (9)

ensure that

is a strict local maximizer of (3) when constrained

on the

-dimensional subspace of ,

where

denotes the subvector of formed by components in

the complement of

. Condition (8) makes sure that the

sparse vector

is indeed a strict local maximizer of (3) on the

whole space

When

is the penalty, the penalized likelihood function

in (3) is concave in . Then the classical convex op-

timization theory applies to show that

a global maximizer if and only if there exists a subgradient

such that

(10)

that is, it satisﬁes the Karush-Kuhn-Tucker (KKT) condi-

tions, where the subdifferential of the

penalty is given

for and . Thus, condition

(10) reduces to (7) and (8) with strict inequality replaced by

nonstrict inequality. Since

for the -penalty,

condition (9) holds provided that

is nonsingular.

However, to ensure that

is the strict maximizer we need the

strict inequality in (8).

C. Global Optimality

A natural question is when the nonconcave penalized max-

imum likelihood estimator (NCPMLE)

is a global maximizer

of the penalized likelihood

. We characterize such a prop-

erty from two perspectives.

1) Global Optimality: Assume that the

design matrix

has a full column rank . This implies that . Since

is always positive, it is easy to show that the Hessian matrix of

is always positive deﬁnite, which entails that the log-

likelihood function

is strictly concave in . Thus, there

exists a unique maximizer

of . Let

be a sublevel set of for some and

be the maximum concavity of the penalty function . For the

penalty, SCAD and MCP, we have , , and

, respectively. The following proposition gives a sufﬁcient

condition on the global optimality of NCPMLE.

Proposition 1 (Global Optimality): Assume that

has rank

and satisﬁes

(11)

Then the NCPMLE

is a global maximizer of the penalized

likelihood

if .

Note that for penalized least-squares, (11) reduces to

(12)

This condition holds for sufﬁciently large

in SCAD and MCP,

when the correlation between covariates is not too strong. The

latter holds for design matrices constructed by using spline

bases to approximate a nonparametric function. According

to Proposition 1, under (12), the penalized least-squares with

folded-concave penalty is a global minimum.

The proposition below gives a condition under which the

penalty term in (3) does not change the global maximizer. It

5470 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

will be used to derive the condition under which the PMLE is

the same as the oracle estimator in Proposition 3(b). Here for

simplicity we consider the SCAD penalty

given by (4), and

the technical arguments are applicable to other folded-concave

penalties as well.

Proposition 2 (Robustness): Assume that

has rank

with and there exists some such that

for some . Then

the SCAD penalized likelihood estimator

is the global maxi-

mizer and equals

if and ,

where

2) Restricted Global Optimality: When

, it is hard

to show the global optimality of a local maximizer. However,

we can study the global optimality of the NCPMLE

on the

union of coordinate subspaces. A subspace of

is called co-

ordinate subspace if it is spanned by a subset of the natural basis

, where each is the -vector with th component

1 and 0 elsewhere. Here each

corresponds to the th predictor

. We will investigate the global optimality of on the union

of all -dimensional coordinate subspaces of in Proposi-

tion 3(a).

Of particular interest is to derive the conditions under which

the PMLE is also an oracle estimator, in addition to possessing

the above restricted global optimal estimator on

. To this

end, we introduce an identiﬁability condition on the true model

. The true model is called -identiﬁable for some

(13)

where

. In other words, is the best subset of size

, with a margin at least . The following proposition is an easy

consequence of Propositions 1 and 2.

Proposition 3 (Global Optimality on

a) If the conditions of Proposition 1 are satisﬁed for each

submatrix of , then the NCPMLE is a global

maximizer of

on .

b) Assume that the conditions of Proposition 2 are satis-

ﬁed for the

submatrix of formed by columns

, the true model is -identiﬁable for some

, and . Then the

SCAD penalized likelihood estimator

is the global max-

imizer on

and equals to the oracle maximum likelihood

estimator

On the event that the PMLE estimator is the same as the oracle

estimator, it possesses of course the oracle property.

III. N

ONASYMPTOTIC WEAK ORACLE PROPERTIES

In this section, we study a nonasymptotic property of the non-

concave penalized likelihood estimator

, called the weak or-

acle property introduced by Lv and Fan (2009) in the setting

of penalized least squares. The weak oracle property means

sparsity in the sense of

with probability tending to 1

, and consistency under the loss, where

and is a subvector of formed by components in

. This property is weaker than the oracle

property introduced by Fan and Li (2001).

A. Regularity Conditions

As mentioned before, we condition on the design matrix

and use the penalty in the class satisfying Condition 1. Let

and respectively be the submatrices of the design

matrix

formed by columns in and

its complement, and

. To simplify the presentation,

we assume without loss of generality that each covariate

has

been standardized so that

. If the covariates have

not been standardized, the results still hold with

assumed

to be in the order of

. Let

(14)

be half of the minimum signal. We make the following assump-

tions on the design matrix and the distribution of the response.

Let

be a diverging sequence of positive numbers that de-

pends on the nonsparsity size

and hence depends on . Recall

that

is the nonvanishing components of the true parameter

Condition 2: The design matrix

satisﬁes

(15)

(16)

(17)

where the

norm of a matrix is the maximum of the norm

of each row,

, ,

, the derivative is taken componentwise, and

denotes the Hadamard (componentwise) product.

Here and below,

is associated with regularization param-

eter

satisfying (18) unless speciﬁed otherwise. For the clas-

sical Gaussian linear regression model, we have

and

. In this case, since we will assume that , condi-

tion (15) usually holds with

. In fact, Wainwright (2009)

shows that

if the rows of are

i.i.d. Gaussian vectors with

.In

general, since

we can take if . More

generally, (15) can be bounded as

and the above remark for the multiple regression model applies

to the submatrix

, which consists of rows of the samples

with

for some .

The left hand side of (16) is the multiple regression co-

efﬁcients of each unimportant variable in

on , using

the weighted least squares with weights

. The order

FAN AND LV: NONCONCAVE PENALIZED LIKELIHOOD WITH NP-DIMENSIONALITY 5471

is mainly technical and can be relaxed, whereas

the condition

is genuine. When the

penalty is used, the upper bound in (16) is more restric-

tive, requiring uniformly less than 1. This condition is the

same as the strong irrepresentable condition of Zhao and Yu

(2006) for the consistency of the LASSO estimator, namely

. It is a drawback of the

penalty. In constrast, when a folded-concave penalty is used,

the upper bound on the right hand side of (16) can grow to

at rate .

Condition (16) controls the uniform growth rate of the

-norm of these multiple regression coefﬁcients, a notion of

weak correlation between

and . If each element of the

multiple regression coefﬁcients is of order

, then the

norm is of order . Hence, we can handle the nonsparse

dimensionality

, by (16), as long as the ﬁrst term in

(16) dominates, which occurs for SCAD type of penalty with

. Of course, the actual dimensionality can be higher

or lower, depending on the correlation between

and ,but

for ﬁnite nonsparse dimensionality

, (16) is usually

satisﬁed.

For the Gaussian linear regression model, condition (17)

holds automatically.

We now choose the regularization parameter

and intro-

duce Condition 3. We will assume that half of the minimum

signal

for some . Take satis-

fying

and (18)

where

and is associated with the

nonsparsity size

Condition 3: Assume that

and

In addition, assume that

satisﬁes (18) and

, where and

, and that

if the responses are

unbounded.

The condition

is needed to ensure condition

(9). The condition always holds when

and is satisﬁed

for the SCAD type of penalty when

In view of (7) and (8), to study the nonconcave penalized like-

lihood estimator

we need to analyze the deviation of the -di-

mensional random vector

from its mean , where

denotes the -dimensional random re-

sponse vector in the GLM (1). The following proposition, whose

proof is given in Section VIII.E, characterizes such deviation

for the case of bounded responses and the case of unbounded

responses satisfying a moment condition, respectively.

Proposition 4 (Deviation): Let

be the

-dimensional independent random response vector and

. Then

a) If

are bounded in for some ,

then for any

(19)

b) If

are unbounded and there exist some

such that

(20)

with

, then for any

(21)

In light of (1), it is known that for the exponential family, the

moment-generating function of

is given by

where is in the domain of . Thus, the moment

condition (20) is reasonable. It is easy to show that condition

(20) holds for the Gaussian linear regression model and for the

Poisson regression model with bounded mean responses. Sim-

ilar probability bounds also hold for sub-Gaussian errors.

We now express the results in Proposition 4 in a uniﬁed form.

For the case of bounded responses, we deﬁne

for , where . For the case of un-

bounded responses satisfying the moment condition (20), we

deﬁne

, where . Then the

exponential bounds in (19) and (21) can be expressed as

(22)

where

if the responses are bounded and

if the responses are unbounded.

B. Weak Oracle Properties

Theorem 2 (Weak Oracle Property): Assume that Conditions

1 – 3 and the probability bound (22) are satisﬁed,

, and

. Then there exists a nonconcave penalized

likelihood estimator

such that for sufﬁciently large , with

probability at least

satisﬁes:

a) (Sparsity).

;

b) (

loss). ,

where

and are respectively the subvectors of and

formed by components in .

Under the given regularity conditions, the dimensionality

allowed to grow up to exponentially fast with the sample size

The growth rate of

is controlled by . It also enters

the nonasymptotic probability bound. This probability tends to

1 under our technical assumptions. From the proof of Theorem

2, we see that with asymptotic probability one, the

estima-

tion loss of the nonconcave penalized likelihood estimator

bounded from above by three terms (see (45)), where the second

term

is associated with

the penalty function

. For the penalty, the ratio

is equal to one, and for other concave penalties,

it can be (much) smaller than one. This is in line with the

HTML Viewer

Frequently Asked Questions (10)

Q1. What are the contributions mentioned in the paper "Nonconcave penalized likelihood with np-dimensionality" ?

In this paper, the authors show that in the context of generalized linear models, such methods possess model selection consistency with oracle properties even for dimensionality of nonpolynomial ( NP ) order of sample size, for a class of penalized likelihood approaches using folded-concave penalty functions, which were introduced to ameliorate the bias problems of convex penalty functions.

Q2. What is the condition for the local maximizer of the nonconcave penalized likelihood?

Then is a strict local maximizer of the nonconcave penalized likelihood defined by (3) if(7)(8)(9)where and respectively denote the submatrices of formed by columns in and its complement, ,is a subvector of formed by all nonzero components, and .

Q3. What is the dimensionality of the penalized least squares?

In this case, the dimensionality that the penalized least-squares can handle is as high as when, which is usually smaller than that for the case of .

Q4. What is the condition of the Gaussian linear regression model?

Condition (16) controls the uniform growth rate of the -norm of these multiple regression coefficients, a notion of weak correlation between and .

Q5. What is the definition of a coordinate subspace?

A subspace of is called coordinate subspace if it is spanned by a subset of the natural basis , where each is the -vector with th component 1 and 0 elsewhere.

Q6. what is the maxi-mizer of the penalized likelihood?

Then there exists a strict local maxi-mizer of the penalized likelihood such that with probability tending to 1 as and, where is a subvector of formed by components in .

Q7. What is the simplest way to show that the second derivative of the penalty function does not exist?

More generally, when the second derivative of the penalty function does not necessarily exist, it is easy to show that the second part of the matrix can be replaced by a diagonal matrix with maximum absolute element bounded by .

Q8. What is the concavity of the convex set?

By the concavity of , the authors can easily show that for , is a closed convex set with and being its interior points and the level set is its boundary.

Q9. What is the second order approximation in ICA?

When is quadratic in , e.g., for the Gaussian linear regression model, the second order approximation in ICA is exact at each step.

Q10. Why do the authors examine the implications of Theorem 2?

Due to its popularity, the authors now examine the implications of Theorem 2 in the context of penalized least-squares and penalized likelihood.

Nonconcave Penalized Likelihood With NP-Dimensionality

Summary (5 min read)

Introduction

A. Penalty Function

C. Global Optimality

III. NONASYMPTOTIC WEAK ORACLE PROPERTIES

A. Regularity Conditions

B. Weak Oracle Properties

C. Sampling Properties of -Based PMLE

IV. ORACLE PROPERTIES

A. Logistic Regression

B. Poisson Regression

C. Real Data Analysis

VII. DISCUSSIONS

A. Proof of Theorem 1

B. Proof of Proposition 1

C. Proof of Proposition 2

D. Proof of Proposition 3

E. Proof of Proposition 4

F. Proof of Theorem 2

G. Proof of Theorem 3

H. Proof of Theorem 4

Figures (7)

Citations

References

"Nonconcave Penalized Likelihood Wit..." refers methods in this paper

"Nonconcave Penalized Likelihood Wit..." refers background in this paper

"Nonconcave Penalized Likelihood Wit..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (10)

Q1. What are the contributions mentioned in the paper "Nonconcave penalized likelihood with np-dimensionality" ?

Q2. What is the condition for the local maximizer of the nonconcave penalized likelihood?

Q3. What is the dimensionality of the penalized least squares?

Q4. What is the condition of the Gaussian linear regression model?

Q5. What is the definition of a coordinate subspace?

Q6. what is the maxi-mizer of the penalized likelihood?

Q7. What is the simplest way to show that the second derivative of the penalty function does not exist?

Q8. What is the concavity of the convex set?

Q9. What is the second order approximation in ICA?

Q10. Why do the authors examine the implications of Theorem 2?