What is the dimensionality of the penalized least squares?

In this case, the dimensionality that the penalized least-squares can handle is as high as when, which is usually smaller than that for the case of .

What is the condition of the Gaussian linear regression model?

Condition (16) controls the uniform growth rate of the -norm of these multiple regression coefficients, a notion of weak correlation between and .

what is the maxi-mizer of the penalized likelihood?

Then there exists a strict local maxi-mizer of the penalized likelihood such that with probability tending to 1 as and, where is a subvector of formed by components in .

What is the simplest way to show that the second derivative of the penalty function does not exist?

More generally, when the second derivative of the penalty function does not necessarily exist, it is easy to show that the second part of the matrix can be replaced by a diagonal matrix with maximum absolute element bounded by .

What is the concavity of the convex set?

By the concavity of , the authors can easily show that for , is a closed convex set with and being its interior points and the level set is its boundary.

What is the second order approximation in ICA?

When is quadratic in , e.g., for the Gaussian linear regression model, the second order approximation in ICA is exact at each step.

Why do the authors examine the implications of Theorem 2?

Due to its popularity, the authors now examine the implications of Theorem 2 in the context of penalized least-squares and penalized likelihood.

(Open Access) Nonconcave Penalized Likelihood With NP-Dimensionality (2011) | Jianqing Fan

Q: What are the contributions mentioned in the paper "Nonconcave penalized likelihood with np-dimensionality" ?

In this paper, the authors show that in the context of generalized linear models, such methods possess model selection consistency with oracle properties even for dimensionality of nonpolynomial ( NP ) order of sample size, for a class of penalized likelihood approaches using folded-concave penalty functions, which were introduced to ameliorate the bias problems of convex penalty functions.

Q: What is the definition of a coordinate subspace?

A subspace of is called coordinate subspace if it is spanned by a subset of the natural basis , where each is the -vector with th component 1 and 0 elsewhere.

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011 5467

Nonconcave Penalized Likelihood

With NP-Dimensionality

Jianqing Fan and Jinchi Lv

Abstract—Penalized likelihood methods are fundamental to

ultrahigh dimensional variable selection. How high dimension-

ality such methods can handle remains largely unknown. In this

paper, we show that in the context of generalized linear models,

such methods possess model selection consistency with oracle

properties even for dimensionality of nonpolynomial (NP) order

of sample size, for a class of penalized likelihood approaches

using folded-concave penalty functions, which were introduced to

ameliorate the bias problems of convex penalty functions. This

ﬁlls a long-standing gap in the literature where the dimensionality

is allowed to grow slowly with the sample size. Our results are also

applicable to penalized likelihood with the

-penalty, which is

a convex function at the boundary of the class of folded-concave

penalty functions under consideration. The coordinate opti-

mization is implemented for ﬁnding the solution paths, whose

performance is evaluated by a few simulation examples and the

real data analysis.

Index Terms—Coordinate optimization, folded-concave penalty,

high dimensionality, Lasso, nonconcave penalized likelihood, or-

acle property, SCAD, variable selection, weak oracle property.

I. INTRODUCTION

HE analysis of data sets with the number of variables

comparable to or much larger than the sample size fre-

quently arises nowadays in many ﬁelds ranging from genomics

and health sciences to economics and machine learning. The

data that we collect is usually of the type

where the

’s are independent observations of the response

variable

given its covariates, or explanatory variables,

. Generalized linear models (GLMs) provide a

ﬂexible parametric approach to estimating the covariate effects

(McCullagh and Nelder, 1989). In this paper we consider the

variable selection problem of nonpolynomial (NP) dimension-

ality in the context of GLMs. By NP-dimensionality we mean

that

for some . See Fan and Lv (2010)

Manuscript received January 13, 2010; revised February 23, 2011; accepted

March 02, 2011. Date of current version July 29, 2011. J. Fan was supported

in part by NSF Grants DMS-0704337 and DMS-0714554 and in part by NIH

Grant R01-GM072611 from the National Institute of General Medical Sciences.

J. Lv was supported in part by NSF CAREER Award DMS-0955316, in part by

NSF Grant DMS-0806030, and in part by the 2008 Zumberge Individual Award

from USC’s James H. Zumberge Faculty Research and Innovation Fund.

J. Fan is with the Department of Operations Research and Financial

Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail:

jqfan@princeton.edu).

J. Lv is with the Information and Operations Management Department, Mar-

shall School of Business, University of Southern California, Los Angeles, CA

90089 USA (e-mail: jinchilv@marshall.usc.edu).

Communicated by A. Krzyzak, Associate Editor for Pattern Recognition, Sta-

tistical Learning and Inference.

Color versions of one or more of the ﬁgures in this paper are available online

at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TIT.2011.2158486

for an overview of recent developments in high dimensional

variable selection.

We denote by

the design matrix with

, and

the -dimensional response vector. Throughout the paper we

consider deterministic design matrix. With a canonical link, the

conditional distribution of

given belongs to the canonical

exponential family, having the following density function with

respect to some ﬁxed measure

(1)

where

is an unknown -dimensional vector

of regression coefﬁcients,

is a family of dis-

tributions in the regular exponential family with dispersion pa-

rameter

, and . As is common

in GLM, the function

is implicitly assumed to be twice con-

tinuously differentiable with

always positive. In the sparse

modeling, we assume that majority of the true regression coefﬁ-

cients

are exactly zero. Without loss of

generality, assume that

with each component

nonzero and . Hereafter we refer to the support

as the true underlying sparse model of

the indices. Variable selection aims at locating those predictors

with nonzero and giving an efﬁcient estimate of .

In view of (1), the log-likelihood

of the

sample is given, up to an afﬁne transformation, by

(2)

where

for .We

consider the following penalized likelihood

(3)

where

is a penalty function and is a regularization

parameter.

In a pioneering paper, Fan and Li (2001) build the theoret-

ical foundation of nonconcave penalized likelihood for vari-

able selection. The penalty functions that they used are not any

nonconvex functions, but really the folded-concave functions.

For this reason, we will call them more precisely folded-con-

cave penalties. The paper also introduces the oracle property for

model selection. An estimator

is said to have

the oracle property (Fan and Li, 2001) if it enjoys the model

selection consistency in the sense of

with probability

tending to 1 as

, and it attains an information bound

5468 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

mimicking that of the oracle estimator, where is a subvector

formed by its ﬁrst components and the oracle knew the

true model

ahead of time. Fan and Li

(2001) study the oracle properties of nonconcave penalized like-

lihood estimators in the ﬁnite-dimensional setting. Their results

were extended later by Fan and Peng (2004) to the setting of

or in a general likelihood framework.

How large can the dimensionality

be, compared with

the sample size

, such that the oracle property continues to

hold in penalized likelihood estimation? What role does the

penalty function play? In this paper, we provide an answer to

these long-standing questions for a class of penalized likeli-

hood methods using folded-concave penalties in the context

of GLMs with NP-dimensionality. We also characterize the

nonasymptotic weak oracle property and the global optimality

of the nonconcave penalized maximum likelihood estimator.

Our theory applies to the

-penalty as well, but its conditions

are far more stringent than those for other members of the class.

These constitute the main theoretical contributions of the paper.

Numerous efforts have lately been devoted to studying the

properties of variable selection with ultrahigh dimensionality

and signiﬁcant progress has been made. Meinshausen and

Bühlmann (2006), Zhao and Yu (2006), and Zhang and Huang

(2008) investigate the issue of model selection consistency for

LASSO under different setups when the number of variables is

of a greater order than the sample size. Candes and Tao (2007)

introduce the Dantzig selector to handle the NP-dimensional

variable selection problem, which was shown to behave simi-

larly to Lasso by Bickel

et al. (2009). Zhang (2010) is among

the ﬁrst to study the nonconvex penalized least-squares esti-

mator with NP-dimensionality and demonstrates its advantages

over LASSO. He also developed the PLUS algorithm to ﬁnd the

solution path that has the desired sampling properties. Fan and

Lv (2008) and Huang et al. (2008) introduce the independence

screening procedure to reduce the dimensionality in the context

of least-squares. The former establishes the sure screening

property with NP-dimensionality and the latter also studies the

bridge regression, a folded-concave penalty approach. Hall and

Miller (2009) introduce feature ranking using a generalized cor-

relation, and Hall et al. (2009) propose independence screening

using tilting methods and empirical likelihood. Fan and Fan

(2008) investigate the impact of dimensionality on ultrahigh

dimensional classiﬁcation and establish an oracle property

for features annealed independence rules. Lv and Fan (2009)

make important connections between model selection and

sparse recovery using folded-concave penalties and establish

a nonasymptotic weak oracle property for the penalized least

squares estimator with NP-dimensionality. There are also a

number of important papers on establishing the oracle inequal-

ities for penalized empirical risk minimization. For example,

Bunea et al. (2007) establish sparsity oracle inequalities for the

Lasso under quadratic loss in the context of least-squares; van

de Geer (2008) obtains a nonasymptotic oracle inequality for

the empirical risk minimizer with the

-penalty in the context

of GLMs; Koltchinskii (2008) proves oracle inequalities for

penalized least squares with entropy penalization.

The rest of the paper is organized as follows. In Section II,

we discuss the choice of penalty functions and characterize the

nonconcave penalized likelihood estimator and its global opti-

mality. We study the nonasymptotic weak oracle properties and

oracle properties of nonconcave penalized likelihood estimator

in Sections III and IV, respectively. Section V introduces a co-

ordinate optimization algorithm, the iterative coordinate ascent

(ICA) algorithm, to solve regularization problems with concave

penalties. In Section VI, we present three numerical examples

using both simulated and real data sets. We provide some discus-

sions of our results and their implications in Section VII. Proofs

are presented in Section VIII. Technical details are relegated to

the Appendix.

II. N

ONCONCAVE

PENALIZED

LIKELIHOOD ESTIMATION

In this section, we discuss the choice of penalty functions in

regularization methods and characterize the nonconcave penal-

ized likelihood estimator as well as its global optimality.

A. Penalty Function

For any penalty function

, let .For

simplicity, we will drop its dependence on

and write as

when there is no confusion. Many penalty functions have

been proposed in the literature for regularization. For example,

the best subset selection amounts to using the

penalty. The

ridge regression uses the

penalty. The penalty

for bridges these two cases (Frank and Friedman,

1993). Breiman (1995) introduces the non-negative garrote for

shrinkage estimation and variable selection. Lasso (Tibshirani,

1996) uses the

-penalized least squares. The SCAD penalty

(Fan, 1997; Fan and Li, 2001) is the function whose derivative

is given by

(4)

where often is used, and MCP (Zhang, 2010) is deﬁned

through

. Clearly the SCAD penalty takes

off at the origin as the

penalty and then levels off, and MCP

translates the ﬂat part of the derivative of SCAD to the origin.

A family of folded concave penalties that bridge the

and

penalties were studied by Lv and Fan (2009).

Hereafter we consider penalty functions

that satisfy the

following condition:

Condition 1:

is increasing and concave in ,

and has a continuous derivative

with .In

addition,

is increasing in and is

independent of

The above class of penalty functions has been considered by

Lv and Fan (2009). Clearly the

penalty is a convex function

that falls at the boundary of the class of penalty functions satis-

fying Condition 1. Fan and Li (2001) advocate penalty functions

that give estimators with three desired properties: unbiasedness,

sparsity and continuity, and provide insights into them (see also

Antoniadis and Fan, 2001). SCAD satisﬁes Condition 1 and the

above three properties simultaneously. The

penalty and MCP

also satisfy Condition 1, but

does not enjoy the unbiased-

ness due to its constant rate of penalty and MCP violates the

continuity property. However, our results are applicable to the

-penalized and MCP regression. Condition 1 is needed for

FAN AND LV: NONCONCAVE PENALIZED LIKELIHOOD WITH NP-DIMENSIONALITY 5469

establishing the oracle properties of nonconcave penalized like-

lihood estimator.

B. Nonconcave Penalized Likelihood Estimator

It is generally difﬁcult to study the global maximizer of

the penalized likelihood analytically without concavity. As

is common in the literature, we study the behavior of local

maximizers.

We introduce some notation to simplify our presentation. For

any

, deﬁne

and

(5)

It is known that the

-dimensional response vector following

the distribution in (1) has mean vector

and covariance ma-

trix

, where . Let , and

, , where de-

notes the sign function. We denote by

the norm of a

vector or matrix for

. Following Lv and Fan (2009)

and Zhang (2010), deﬁne the local concavity of the penalty

with as

(6)

By the concavity of

in Condition 1, we have .

It is easy to show by the mean-value theorem that

provided that the second derivative of

is continuous. For the SCAD penalty, unless some

component of

takes values in . In the latter case,

Throughout the paper, we use

and to repre-

sent the smallest and largest eigenvalues of a symmetric matrix,

respectively.

The following theorem gives a sufﬁcient condition on the

strict local maximizer of the penalized likelihood

in (3).

Theorem 1 (Characterization of PMLE): Assume that

sat-

isﬁes Condition 1. Then

is a strict local maximizer of

the nonconcave penalized likelihood

deﬁned by (3) if

(7)

(8)

(9)

where

and respectively denote the submatrices of

formed by columns in and its complement, ,

is a subvector of formed by all nonzero components, and

. On the other hand, if is a local

maximizer of

, then it must satisfy (7) – (9) with strict

inequalities replaced by nonstrict inequalities.

There is only a tiny gap (nonstrict versus strict inequalities)

between the necessary condition for local maximizer and sufﬁ-

cient condition for strict local maximizer. Conditions (7) and (9)

ensure that

is a strict local maximizer of (3) when constrained

on the

-dimensional subspace of ,

where

denotes the subvector of formed by components in

the complement of

. Condition (8) makes sure that the

sparse vector

is indeed a strict local maximizer of (3) on the

whole space

When

is the penalty, the penalized likelihood function

in (3) is concave in . Then the classical convex op-

timization theory applies to show that

a global maximizer if and only if there exists a subgradient

such that

(10)

that is, it satisﬁes the Karush-Kuhn-Tucker (KKT) condi-

tions, where the subdifferential of the

penalty is given

for and . Thus, condition

(10) reduces to (7) and (8) with strict inequality replaced by

nonstrict inequality. Since

for the -penalty,

condition (9) holds provided that

is nonsingular.

However, to ensure that

is the strict maximizer we need the

strict inequality in (8).

C. Global Optimality

A natural question is when the nonconcave penalized max-

imum likelihood estimator (NCPMLE)

is a global maximizer

of the penalized likelihood

. We characterize such a prop-

erty from two perspectives.

1) Global Optimality: Assume that the

design matrix

has a full column rank . This implies that . Since

is always positive, it is easy to show that the Hessian matrix of

is always positive deﬁnite, which entails that the log-

likelihood function

is strictly concave in . Thus, there

exists a unique maximizer

of . Let

be a sublevel set of for some and

be the maximum concavity of the penalty function . For the

penalty, SCAD and MCP, we have , , and

, respectively. The following proposition gives a sufﬁcient

condition on the global optimality of NCPMLE.

Proposition 1 (Global Optimality): Assume that

has rank

and satisﬁes

(11)

Then the NCPMLE

is a global maximizer of the penalized

likelihood

if .

Note that for penalized least-squares, (11) reduces to

(12)

This condition holds for sufﬁciently large

in SCAD and MCP,

when the correlation between covariates is not too strong. The

latter holds for design matrices constructed by using spline

bases to approximate a nonparametric function. According

to Proposition 1, under (12), the penalized least-squares with

folded-concave penalty is a global minimum.

The proposition below gives a condition under which the

penalty term in (3) does not change the global maximizer. It

5470 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

will be used to derive the condition under which the PMLE is

the same as the oracle estimator in Proposition 3(b). Here for

simplicity we consider the SCAD penalty

given by (4), and

the technical arguments are applicable to other folded-concave

penalties as well.

Proposition 2 (Robustness): Assume that

has rank

with and there exists some such that

for some . Then

the SCAD penalized likelihood estimator

is the global maxi-

mizer and equals

if and ,

where

2) Restricted Global Optimality: When

, it is hard

to show the global optimality of a local maximizer. However,

we can study the global optimality of the NCPMLE

on the

union of coordinate subspaces. A subspace of

is called co-

ordinate subspace if it is spanned by a subset of the natural basis

, where each is the -vector with th component

1 and 0 elsewhere. Here each

corresponds to the th predictor

. We will investigate the global optimality of on the union

of all -dimensional coordinate subspaces of in Proposi-

tion 3(a).

Of particular interest is to derive the conditions under which

the PMLE is also an oracle estimator, in addition to possessing

the above restricted global optimal estimator on

. To this

end, we introduce an identiﬁability condition on the true model

. The true model is called -identiﬁable for some

(13)

where

. In other words, is the best subset of size

, with a margin at least . The following proposition is an easy

consequence of Propositions 1 and 2.

Proposition 3 (Global Optimality on

a) If the conditions of Proposition 1 are satisﬁed for each

submatrix of , then the NCPMLE is a global

maximizer of

on .

b) Assume that the conditions of Proposition 2 are satis-

ﬁed for the

submatrix of formed by columns

, the true model is -identiﬁable for some

, and . Then the

SCAD penalized likelihood estimator

is the global max-

imizer on

and equals to the oracle maximum likelihood

estimator

On the event that the PMLE estimator is the same as the oracle

estimator, it possesses of course the oracle property.

III. N

ONASYMPTOTIC WEAK ORACLE PROPERTIES

In this section, we study a nonasymptotic property of the non-

concave penalized likelihood estimator

, called the weak or-

acle property introduced by Lv and Fan (2009) in the setting

of penalized least squares. The weak oracle property means

sparsity in the sense of

with probability tending to 1

, and consistency under the loss, where

and is a subvector of formed by components in

. This property is weaker than the oracle

property introduced by Fan and Li (2001).

A. Regularity Conditions

As mentioned before, we condition on the design matrix

and use the penalty in the class satisfying Condition 1. Let

and respectively be the submatrices of the design

matrix

formed by columns in and

its complement, and

. To simplify the presentation,

we assume without loss of generality that each covariate

has

been standardized so that

. If the covariates have

not been standardized, the results still hold with

assumed

to be in the order of

. Let

(14)

be half of the minimum signal. We make the following assump-

tions on the design matrix and the distribution of the response.

Let

be a diverging sequence of positive numbers that de-

pends on the nonsparsity size

and hence depends on . Recall

that

is the nonvanishing components of the true parameter

Condition 2: The design matrix

satisﬁes

(15)

(16)

(17)

where the

norm of a matrix is the maximum of the norm

of each row,

, ,

, the derivative is taken componentwise, and

denotes the Hadamard (componentwise) product.

Here and below,

is associated with regularization param-

eter

satisfying (18) unless speciﬁed otherwise. For the clas-

sical Gaussian linear regression model, we have

and

. In this case, since we will assume that , condi-

tion (15) usually holds with

. In fact, Wainwright (2009)

shows that

if the rows of are

i.i.d. Gaussian vectors with

.In

general, since

we can take if . More

generally, (15) can be bounded as

and the above remark for the multiple regression model applies

to the submatrix

, which consists of rows of the samples

with

for some .

The left hand side of (16) is the multiple regression co-

efﬁcients of each unimportant variable in

on , using

the weighted least squares with weights

. The order

FAN AND LV: NONCONCAVE PENALIZED LIKELIHOOD WITH NP-DIMENSIONALITY 5471

is mainly technical and can be relaxed, whereas

the condition

is genuine. When the

penalty is used, the upper bound in (16) is more restric-

tive, requiring uniformly less than 1. This condition is the

same as the strong irrepresentable condition of Zhao and Yu

(2006) for the consistency of the LASSO estimator, namely

. It is a drawback of the

penalty. In constrast, when a folded-concave penalty is used,

the upper bound on the right hand side of (16) can grow to

at rate .

Condition (16) controls the uniform growth rate of the

-norm of these multiple regression coefﬁcients, a notion of

weak correlation between

and . If each element of the

multiple regression coefﬁcients is of order

, then the

norm is of order . Hence, we can handle the nonsparse

dimensionality

, by (16), as long as the ﬁrst term in

(16) dominates, which occurs for SCAD type of penalty with

. Of course, the actual dimensionality can be higher

or lower, depending on the correlation between

and ,but

for ﬁnite nonsparse dimensionality

, (16) is usually

satisﬁed.

For the Gaussian linear regression model, condition (17)

holds automatically.

We now choose the regularization parameter

and intro-

duce Condition 3. We will assume that half of the minimum

signal

for some . Take satis-

fying

and (18)

where

and is associated with the

nonsparsity size

Condition 3: Assume that

and

In addition, assume that

satisﬁes (18) and

, where and

, and that

if the responses are

unbounded.

The condition

is needed to ensure condition

(9). The condition always holds when

and is satisﬁed

for the SCAD type of penalty when

In view of (7) and (8), to study the nonconcave penalized like-

lihood estimator

we need to analyze the deviation of the -di-

mensional random vector

from its mean , where

denotes the -dimensional random re-

sponse vector in the GLM (1). The following proposition, whose

proof is given in Section VIII.E, characterizes such deviation

for the case of bounded responses and the case of unbounded

responses satisfying a moment condition, respectively.

Proposition 4 (Deviation): Let

be the

-dimensional independent random response vector and

. Then

a) If

are bounded in for some ,

then for any

(19)

b) If

are unbounded and there exist some

such that

(20)

with

, then for any

(21)

In light of (1), it is known that for the exponential family, the

moment-generating function of

is given by

where is in the domain of . Thus, the moment

condition (20) is reasonable. It is easy to show that condition

(20) holds for the Gaussian linear regression model and for the

Poisson regression model with bounded mean responses. Sim-

ilar probability bounds also hold for sub-Gaussian errors.

We now express the results in Proposition 4 in a uniﬁed form.

For the case of bounded responses, we deﬁne

for , where . For the case of un-

bounded responses satisfying the moment condition (20), we

deﬁne

, where . Then the

exponential bounds in (19) and (21) can be expressed as

(22)

where

if the responses are bounded and

if the responses are unbounded.

B. Weak Oracle Properties

Theorem 2 (Weak Oracle Property): Assume that Conditions

1 – 3 and the probability bound (22) are satisﬁed,

, and

. Then there exists a nonconcave penalized

likelihood estimator

such that for sufﬁciently large , with

probability at least

satisﬁes:

a) (Sparsity).

;

b) (

loss). ,

where

and are respectively the subvectors of and

formed by components in .

Under the given regularity conditions, the dimensionality

allowed to grow up to exponentially fast with the sample size

The growth rate of

is controlled by . It also enters

the nonasymptotic probability bound. This probability tends to

1 under our technical assumptions. From the proof of Theorem

2, we see that with asymptotic probability one, the

estima-

tion loss of the nonconcave penalized likelihood estimator

bounded from above by three terms (see (45)), where the second

term

is associated with

the penalty function

. For the penalty, the ratio

is equal to one, and for other concave penalties,

it can be (much) smaller than one. This is in line with the

Nonconcave Penalized Likelihood With NP-Dimensionality

Figures

Citations

Regularization Methods for High-Dimensional Instrumental Variables Regression With an Application to Genetical Genomics

Oracle model selection for nonlinear models based on weighted composite quantile regression

Variable Selection With Prior Information for Generalized Linear Models via the Prior LASSO Method

Nonconcave penalized composite conditional likelihood estimation of sparse Ising models

On Time Varying Undirected Graphs

References

Regression Shrinkage and Selection via the Lasso

Generalized Linear Models

Generalized Linear Models

Generalized Linear Models

Probability Inequalities for sums of Bounded Random Variables

Related Papers (5)

Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties

Regression Shrinkage and Selection via the Lasso

The adaptive lasso and its oracle properties

On Model Selection Consistency of Lasso

Regularization and variable selection via the elastic net

Frequently Asked Questions (10)

Q1. What are the contributions mentioned in the paper "Nonconcave penalized likelihood with np-dimensionality" ?

Q2. What is the condition for the local maximizer of the nonconcave penalized likelihood?

Q3. What is the dimensionality of the penalized least squares?

Q4. What is the condition of the Gaussian linear regression model?

Q5. What is the definition of a coordinate subspace?

Q6. what is the maxi-mizer of the penalized likelihood?

Q7. What is the simplest way to show that the second derivative of the penalty function does not exist?

Q8. What is the concavity of the convex set?

Q9. What is the second order approximation in ICA?

Q10. Why do the authors examine the implications of Theorem 2?