scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Nonconcave Penalized Likelihood With NP-Dimensionality

01 Aug 2011-IEEE Transactions on Information Theory (NIH Public Access)-Vol. 57, Iss: 8, pp 5467-5484
TL;DR: It is shown that in the context of generalized linear models, such methods possess model selection consistency with oracle properties even for dimensionality of nonpolynomial order of sample size, for a class of penalized likelihood approaches using folded-concave penalty functions, which were introduced to ameliorate the bias problems of convex penalty functions.
Abstract: Penalized likelihood methods are fundamental to ultrahigh dimensional variable selection. How high dimensionality such methods can handle remains largely unknown. In this paper, we show that in the context of generalized linear models, such methods possess model selection consistency with oracle properties even for dimensionality of nonpolynomial (NP) order of sample size, for a class of penalized likelihood approaches using folded-concave penalty functions, which were introduced to ameliorate the bias problems of convex penalty functions. This fills a long-standing gap in the literature where the dimensionality is allowed to grow slowly with the sample size. Our results are also applicable to penalized likelihood with the L1-penalty, which is a convex function at the boundary of the class of folded-concave penalty functions under consideration. The coordinate optimization is implemented for finding the solution paths, whose performance is evaluated by a few simulation examples and the real data analysis.

Summary (5 min read)

Introduction

  • The penalty functions that they used are not any nonconvex functions, but really the folded-concave functions.
  • These constitute the main theoretical contributions of the paper.

A. Penalty Function

  • For simplicity, the authors will drop its dependence on and write as when there is no confusion.
  • Many penalty functions have been proposed in the literature for regularization.
  • The penalty for bridges these two cases (Frank and Friedman, 1993).
  • Hereafter the authors consider penalty functions that satisfy the following condition: Condition 1: is increasing and concave in , and has a continuous derivative with .
  • Clearly the penalty is a convex function that falls at the boundary of the class of penalty functions satisfying Condition 1. Fan and Li (2001) advocate penalty functions that give estimators with three desired properties: unbiasedness, sparsity and continuity, and provide insights into them (see also Antoniadis and Fan, 2001).

C. Global Optimality

  • A natural question is when the nonconcave penalized maximum likelihood estimator is a global maximizer of the penalized likelihood .
  • Since is always positive, it is easy to show that the Hessian matrix of is always positive definite, which entails that the loglikelihood function is strictly concave in .
  • The proposition below gives a condition under which the penalty term in (3) does not change the global maximizer.
  • Of particular interest is to derive the conditions under which the PMLE is also an oracle estimator, in addition to possessing the above restricted global optimal estimator on .
  • Assume that the conditions of Proposition 2 are satisfied for the submatrix of formed by columns in , the true model is -identifiable for some , and .

III. NONASYMPTOTIC WEAK ORACLE PROPERTIES

  • The authors study a nonasymptotic property of the nonconcave penalized likelihood estimator , called the weak oracle property introduced by Lv and Fan (2009) in the setting of penalized least squares.
  • The weak oracle property means sparsity in the sense of with probability tending to 1 as , and consistency under the loss, where and is a subvector of formed by components in .
  • This property is weaker than the oracle property introduced by Fan and Li (2001).

A. Regularity Conditions

  • As mentioned before, the authors condition on the design matrix and use the penalty in the class satisfying Condition 1.
  • To simplify the presentation, the authors assume without loss of generality that each covariate has been standardized so that .
  • Since the authors will assume that , condition (15) usually holds with .
  • For the Gaussian linear regression model, condition (17) holds automatically.
  • For the case of unbounded responses satisfying the moment condition (20), the authors define , where .

B. Weak Oracle Properties

  • Theorem 2 (Weak Oracle Property): Assume that Conditions 1 – 3 and the probability bound (22) are satisfied, , and .
  • Then there exists a nonconcave penalized likelihood estimator such that for sufficiently large , with probability at least , satisfies: a) . ; b) ( loss). , where and are respectively the subvectors of and formed by components in .
  • It also enters the nonasymptotic probability bound.
  • The value of can be taken as large as for concave penalties.
  • The large value of puts more stringent condi- tion on the design matrix.

C. Sampling Properties of -Based PMLE

  • When the -penalty is applied, the penalized likelihood in (3) is concave.
  • The local maximizer in Theorems 1 and 2 becomes the global maximizer.
  • Due to its popularity, the authors now examine the implications of Theorem 2 in the context of penalized least-squares and penalized likelihood.
  • As a corollary of Theorem 2, the authors have Corollary 1 (Penalized Estimator): Under Conditions 2 and 3 and probability bound (22), if and , then the penalized likelihood estimator has model selection consistency with rate .
  • For the penalized least-squares, Corollary 1 continues to hold without normality assumption, as long as probability bound (22) holds.

IV. ORACLE PROPERTIES

  • 2001) of the nonconcave penalized likelihood estimator .the authors.
  • Thus, Condition 5 is less restrictive for SCAD-like penalties, since for sufficiently large .
  • Theorem 3 can be thought of as answering the question that given the dimensionality, how strong the minimum signal should be in order for the penalized likelihood estimator to have some nice properties, through Conditions 4 and 5.
  • Specifically, for each coordinate within each iteration, ICA uses the second order approximation of at the -vector from the previous step along that coordinate and maximizes the univariate penalized quadratic approximation.
  • When the penalty is used, it is known that the choice of ensures that is the global maximizer of (3).

A. Logistic Regression

  • The authors demonstrate the performance of nonconcave penalized likelihood methods in logistic regression.
  • Thus, the authors used five-fold cross-validation (CV) based on prediction error to select the tuning parameter.
  • Table II and Fig. 2 report the comparison results given by PE, loss, loss, deviance, #S, and FN.
  • It is clear from Table II that LASSO selects far larger model size than SCAD and MCP.
  • Since the coefficients of the sixth through tenth covariates are significantly smaller than other nonzero coefficients and the covariates are independent, the distribution of the response can be well approximated by the sparse model with the five small nonzero coefficients set to be zero.

B. Poisson Regression

  • The authors demonstrate the performance of nonconcave penalized likelihood methods in Poisson regression.
  • The authors set TABLE V MEDIANS AND ROBUST STANDARD DEVIATIONS (IN PARENTHESES) OF PE, LOSS, LOSS, DEVIANCE, #S, AND FN OVER 100 SIMULATIONS FOR ALL METHODS IN POISSON REGRESSION BY BIC AND CV, WHERE AND 1000 and chose the true regression coefficients vector by setting .
  • Lasso, SCAD and MCP had over 100 simulations.
  • The BIC and five-fold CV were used to select the regularization parameter.
  • Table VI presents the comparison results given by the PE, loss, loss, deviance, #S, and FN.

C. Real Data Analysis

  • The authors apply nonconcave penalized likelihood methods to the neuroblastoma data set, which was studied by Oberthuer et al. (2006).
  • The patients at diagnosis were aged from 0 to 296 months with a median age of 15 months.
  • The study aimed to develop a gene expression-based classifier for neuroblastoma patients that can reliably predict courses of the disease.
  • The authors applied Lasso, SCAD and MCP using the logistic regression model.
  • For the 3-year EFS classification, the authors randomly selected 125 subjects (25 positives and 100 negatives) as the training set and the rest as the test set.

VII. DISCUSSIONS

  • The authors have studied penalized likelihood methods for ultrahigh dimensional variable selection.
  • In the context of GLMs, the authors have shown that such methods have model selection consistency with oracle properties even for NP-dimensionality, for a class of nonconcave penalized likelihood approaches.
  • The authors results are consistent with a known fact in the literature that concave penalties can reduce the bias problems of convex penalties.
  • The authors have exploited the coordinate optimization with the ICA algorithm to find the solution paths and illustrated the performance of nonconcave penalized likelihood methods with numerical studies.
  • The authors results show that the coordinate optimization works equally well and efficiently for producing the entire solution paths for concave penalties.

A. Proof of Theorem 1

  • The authors will first derive the necessary condition.
  • It follows from the classical optimization theory that if is a local maximizer of the penalized likelihood (3), it satisfies the Karush-Kuhn-Tucker (KKT) conditions, i.e., there exists some such that (31) where , for , and for .
  • Let be the projection of onto the subspace .
  • Note that the components of are zero for the indices in and the sign of is the same as that of for , where and are the th components of and , respectively.
  • By condition (8) and the continuity of and , there exists some such that for any in a ball in centered at with radius (37).

B. Proof of Proposition 1

  • By the concavity of , the authors can easily show that for , is a closed convex set with and being its interior points and the level set is its boundary.
  • The authors now show that the global maximizer of the penalized likelihood belongs to .
  • This follows easily from the definition of , , and , where .

C. Proof of Proposition 2

  • Since , from the proof of Proposition 1 the authors know that the global maximizer of the penalized likelihood belongs to .
  • Note that by assumption, the SCAD penalized likelihood estimator and .
  • The key idea is to use a first order Taylor expansion of around and retain the Lagrange remainder term.
  • This can easily been shown from the analytical solution to (38).
  • Thus, it suffices to prove on the interval .

D. Proof of Proposition 3

  • Let be any -dimensional coordinate subspace different from .
  • Clearly is a -dimensional coordinate subspace with .
  • Then part a) follows easily from the assumptions and Proposition 1.
  • Part b) is an easy consequence of Proposition 2 in view of the assumptions and the fact that for the SCAD penalty given by (4).

E. Proof of Proposition 4

  • Part a) follows easily from a simple application of Hoeffding’s inequality (Hoeffding, 1963), since are independent bounded random variables, where .
  • In view of condition (20), are independent random variables with mean zero and satisfy Thus, an application of Bernstein’s inequality (see, e.g., Bennett, 1962 or van der Vaart and Wellner, 1996) yields which concludes the proof.

F. Proof of Theorem 2

  • The authors break the whole proof into several steps.
  • Since , it follows from Bonferroni’s inequality and (22) that (39) where and for unbounded responses, which is guaranteed for sufficiently large by Condition 3.
  • To this end, the authors represent by using a second order Taylor expansion around with the Lagrange remainder term componentwise and obtain (42) where and for each with some -vector lying on the line segment joining and .
  • Thus, the authors have shown that (7) indeed has a solution in .
  • It remains to bound the second term of (48).

G. Proof of Theorem 3

  • To prove the conclusions, it suffices to show that under the given regularity conditions, there exists a strict local maximizer of the penalized likelihood in (3) such that 1) with probability tending to 1 as (i.e., sparsity), and 2) (i.e., -consistency).
  • Step 1: Consistency in the -Dimensional Subspace:.
  • The authors now show that there exists a strict local maximizer of such that .
  • To this end, the authors define an event where denotes the boundary of the closed set and .
  • Then for sufficiently large , by (26) and in Conditions 4 and 5 the authors have Thus, by (53), they have which along with Markov’s inequality entails that It follows from , , and Conditions 4 and 5 that since is decreasing in .

H. Proof of Theorem 4

  • On the event defined in the proof of Theorem 3, it has been shown that is a strict local maximizer of and .
  • This along with the first part of (26) in Condition 4 entails (57) where and the small order term is understood under the norm.
  • The authors are now ready to show the asymptotic normality of .
  • Logistic regression model, and Poisson regression model.the authors.
  • Thus, maximizing becomes the penalized least squares problem.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011 5467
Nonconcave Penalized Likelihood
With NP-Dimensionality
Jianqing Fan and Jinchi Lv
Abstract—Penalized likelihood methods are fundamental to
ultrahigh dimensional variable selection. How high dimension-
ality such methods can handle remains largely unknown. In this
paper, we show that in the context of generalized linear models,
such methods possess model selection consistency with oracle
properties even for dimensionality of nonpolynomial (NP) order
of sample size, for a class of penalized likelihood approaches
using folded-concave penalty functions, which were introduced to
ameliorate the bias problems of convex penalty functions. This
fills a long-standing gap in the literature where the dimensionality
is allowed to grow slowly with the sample size. Our results are also
applicable to penalized likelihood with the
L
1
-penalty, which is
a convex function at the boundary of the class of folded-concave
penalty functions under consideration. The coordinate opti-
mization is implemented for finding the solution paths, whose
performance is evaluated by a few simulation examples and the
real data analysis.
Index Terms—Coordinate optimization, folded-concave penalty,
high dimensionality, Lasso, nonconcave penalized likelihood, or-
acle property, SCAD, variable selection, weak oracle property.
I. INTRODUCTION
T
HE analysis of data sets with the number of variables
comparable to or much larger than the sample size fre-
quently arises nowadays in many fields ranging from genomics
and health sciences to economics and machine learning. The
data that we collect is usually of the type
,
where the
’s are independent observations of the response
variable
given its covariates, or explanatory variables,
. Generalized linear models (GLMs) provide a
flexible parametric approach to estimating the covariate effects
(McCullagh and Nelder, 1989). In this paper we consider the
variable selection problem of nonpolynomial (NP) dimension-
ality in the context of GLMs. By NP-dimensionality we mean
that
for some . See Fan and Lv (2010)
Manuscript received January 13, 2010; revised February 23, 2011; accepted
March 02, 2011. Date of current version July 29, 2011. J. Fan was supported
in part by NSF Grants DMS-0704337 and DMS-0714554 and in part by NIH
Grant R01-GM072611 from the National Institute of General Medical Sciences.
J. Lv was supported in part by NSF CAREER Award DMS-0955316, in part by
NSF Grant DMS-0806030, and in part by the 2008 Zumberge Individual Award
from USC’s James H. Zumberge Faculty Research and Innovation Fund.
J. Fan is with the Department of Operations Research and Financial
Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail:
jqfan@princeton.edu).
J. Lv is with the Information and Operations Management Department, Mar-
shall School of Business, University of Southern California, Los Angeles, CA
90089 USA (e-mail: jinchilv@marshall.usc.edu).
Communicated by A. Krzyzak, Associate Editor for Pattern Recognition, Sta-
tistical Learning and Inference.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIT.2011.2158486
for an overview of recent developments in high dimensional
variable selection.
We denote by
the design matrix with
, and
the -dimensional response vector. Throughout the paper we
consider deterministic design matrix. With a canonical link, the
conditional distribution of
given belongs to the canonical
exponential family, having the following density function with
respect to some fixed measure
(1)
where
is an unknown -dimensional vector
of regression coefficients,
is a family of dis-
tributions in the regular exponential family with dispersion pa-
rameter
, and . As is common
in GLM, the function
is implicitly assumed to be twice con-
tinuously differentiable with
always positive. In the sparse
modeling, we assume that majority of the true regression coeffi-
cients
are exactly zero. Without loss of
generality, assume that
with each component
of
nonzero and . Hereafter we refer to the support
as the true underlying sparse model of
the indices. Variable selection aims at locating those predictors
with nonzero and giving an efficient estimate of .
In view of (1), the log-likelihood
of the
sample is given, up to an affine transformation, by
(2)
where
for .We
consider the following penalized likelihood
(3)
where
is a penalty function and is a regularization
parameter.
In a pioneering paper, Fan and Li (2001) build the theoret-
ical foundation of nonconcave penalized likelihood for vari-
able selection. The penalty functions that they used are not any
nonconvex functions, but really the folded-concave functions.
For this reason, we will call them more precisely folded-con-
cave penalties. The paper also introduces the oracle property for
model selection. An estimator
is said to have
the oracle property (Fan and Li, 2001) if it enjoys the model
selection consistency in the sense of
with probability
tending to 1 as
, and it attains an information bound
0018-9448/$26.00 © 2011 IEEE

5468 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011
mimicking that of the oracle estimator, where is a subvector
of
formed by its first components and the oracle knew the
true model
ahead of time. Fan and Li
(2001) study the oracle properties of nonconcave penalized like-
lihood estimators in the finite-dimensional setting. Their results
were extended later by Fan and Peng (2004) to the setting of
or in a general likelihood framework.
How large can the dimensionality
be, compared with
the sample size
, such that the oracle property continues to
hold in penalized likelihood estimation? What role does the
penalty function play? In this paper, we provide an answer to
these long-standing questions for a class of penalized likeli-
hood methods using folded-concave penalties in the context
of GLMs with NP-dimensionality. We also characterize the
nonasymptotic weak oracle property and the global optimality
of the nonconcave penalized maximum likelihood estimator.
Our theory applies to the
-penalty as well, but its conditions
are far more stringent than those for other members of the class.
These constitute the main theoretical contributions of the paper.
Numerous efforts have lately been devoted to studying the
properties of variable selection with ultrahigh dimensionality
and significant progress has been made. Meinshausen and
Bühlmann (2006), Zhao and Yu (2006), and Zhang and Huang
(2008) investigate the issue of model selection consistency for
LASSO under different setups when the number of variables is
of a greater order than the sample size. Candes and Tao (2007)
introduce the Dantzig selector to handle the NP-dimensional
variable selection problem, which was shown to behave simi-
larly to Lasso by Bickel
et al. (2009). Zhang (2010) is among
the first to study the nonconvex penalized least-squares esti-
mator with NP-dimensionality and demonstrates its advantages
over LASSO. He also developed the PLUS algorithm to find the
solution path that has the desired sampling properties. Fan and
Lv (2008) and Huang et al. (2008) introduce the independence
screening procedure to reduce the dimensionality in the context
of least-squares. The former establishes the sure screening
property with NP-dimensionality and the latter also studies the
bridge regression, a folded-concave penalty approach. Hall and
Miller (2009) introduce feature ranking using a generalized cor-
relation, and Hall et al. (2009) propose independence screening
using tilting methods and empirical likelihood. Fan and Fan
(2008) investigate the impact of dimensionality on ultrahigh
dimensional classification and establish an oracle property
for features annealed independence rules. Lv and Fan (2009)
make important connections between model selection and
sparse recovery using folded-concave penalties and establish
a nonasymptotic weak oracle property for the penalized least
squares estimator with NP-dimensionality. There are also a
number of important papers on establishing the oracle inequal-
ities for penalized empirical risk minimization. For example,
Bunea et al. (2007) establish sparsity oracle inequalities for the
Lasso under quadratic loss in the context of least-squares; van
de Geer (2008) obtains a nonasymptotic oracle inequality for
the empirical risk minimizer with the
-penalty in the context
of GLMs; Koltchinskii (2008) proves oracle inequalities for
penalized least squares with entropy penalization.
The rest of the paper is organized as follows. In Section II,
we discuss the choice of penalty functions and characterize the
nonconcave penalized likelihood estimator and its global opti-
mality. We study the nonasymptotic weak oracle properties and
oracle properties of nonconcave penalized likelihood estimator
in Sections III and IV, respectively. Section V introduces a co-
ordinate optimization algorithm, the iterative coordinate ascent
(ICA) algorithm, to solve regularization problems with concave
penalties. In Section VI, we present three numerical examples
using both simulated and real data sets. We provide some discus-
sions of our results and their implications in Section VII. Proofs
are presented in Section VIII. Technical details are relegated to
the Appendix.
II. N
ONCONCAVE
PENALIZED
LIKELIHOOD ESTIMATION
In this section, we discuss the choice of penalty functions in
regularization methods and characterize the nonconcave penal-
ized likelihood estimator as well as its global optimality.
A. Penalty Function
For any penalty function
, let .For
simplicity, we will drop its dependence on
and write as
when there is no confusion. Many penalty functions have
been proposed in the literature for regularization. For example,
the best subset selection amounts to using the
penalty. The
ridge regression uses the
penalty. The penalty
for bridges these two cases (Frank and Friedman,
1993). Breiman (1995) introduces the non-negative garrote for
shrinkage estimation and variable selection. Lasso (Tibshirani,
1996) uses the
-penalized least squares. The SCAD penalty
(Fan, 1997; Fan and Li, 2001) is the function whose derivative
is given by
(4)
where often is used, and MCP (Zhang, 2010) is defined
through
. Clearly the SCAD penalty takes
off at the origin as the
penalty and then levels off, and MCP
translates the flat part of the derivative of SCAD to the origin.
A family of folded concave penalties that bridge the
and
penalties were studied by Lv and Fan (2009).
Hereafter we consider penalty functions
that satisfy the
following condition:
Condition 1:
is increasing and concave in ,
and has a continuous derivative
with .In
addition,
is increasing in and is
independent of
.
The above class of penalty functions has been considered by
Lv and Fan (2009). Clearly the
penalty is a convex function
that falls at the boundary of the class of penalty functions satis-
fying Condition 1. Fan and Li (2001) advocate penalty functions
that give estimators with three desired properties: unbiasedness,
sparsity and continuity, and provide insights into them (see also
Antoniadis and Fan, 2001). SCAD satisfies Condition 1 and the
above three properties simultaneously. The
penalty and MCP
also satisfy Condition 1, but
does not enjoy the unbiased-
ness due to its constant rate of penalty and MCP violates the
continuity property. However, our results are applicable to the
-penalized and MCP regression. Condition 1 is needed for

FAN AND LV: NONCONCAVE PENALIZED LIKELIHOOD WITH NP-DIMENSIONALITY 5469
establishing the oracle properties of nonconcave penalized like-
lihood estimator.
B. Nonconcave Penalized Likelihood Estimator
It is generally difficult to study the global maximizer of
the penalized likelihood analytically without concavity. As
is common in the literature, we study the behavior of local
maximizers.
We introduce some notation to simplify our presentation. For
any
, define
and
(5)
It is known that the
-dimensional response vector following
the distribution in (1) has mean vector
and covariance ma-
trix
, where . Let , and
, , where de-
notes the sign function. We denote by
the norm of a
vector or matrix for
. Following Lv and Fan (2009)
and Zhang (2010), define the local concavity of the penalty
at
with as
(6)
By the concavity of
in Condition 1, we have .
It is easy to show by the mean-value theorem that
provided that the second derivative of
is continuous. For the SCAD penalty, unless some
component of
takes values in . In the latter case,
.
Throughout the paper, we use
and to repre-
sent the smallest and largest eigenvalues of a symmetric matrix,
respectively.
The following theorem gives a sufficient condition on the
strict local maximizer of the penalized likelihood
in (3).
Theorem 1 (Characterization of PMLE): Assume that
sat-
isfies Condition 1. Then
is a strict local maximizer of
the nonconcave penalized likelihood
defined by (3) if
(7)
(8)
(9)
where
and respectively denote the submatrices of
formed by columns in and its complement, ,
is a subvector of formed by all nonzero components, and
. On the other hand, if is a local
maximizer of
, then it must satisfy (7) (9) with strict
inequalities replaced by nonstrict inequalities.
There is only a tiny gap (nonstrict versus strict inequalities)
between the necessary condition for local maximizer and suffi-
cient condition for strict local maximizer. Conditions (7) and (9)
ensure that
is a strict local maximizer of (3) when constrained
on the
-dimensional subspace of ,
where
denotes the subvector of formed by components in
the complement of
. Condition (8) makes sure that the
sparse vector
is indeed a strict local maximizer of (3) on the
whole space
.
When
is the penalty, the penalized likelihood function
in (3) is concave in . Then the classical convex op-
timization theory applies to show that
is
a global maximizer if and only if there exists a subgradient
such that
(10)
that is, it satisfies the Karush-Kuhn-Tucker (KKT) condi-
tions, where the subdifferential of the
penalty is given
by
for and . Thus, condition
(10) reduces to (7) and (8) with strict inequality replaced by
nonstrict inequality. Since
for the -penalty,
condition (9) holds provided that
is nonsingular.
However, to ensure that
is the strict maximizer we need the
strict inequality in (8).
C. Global Optimality
A natural question is when the nonconcave penalized max-
imum likelihood estimator (NCPMLE)
is a global maximizer
of the penalized likelihood
. We characterize such a prop-
erty from two perspectives.
1) Global Optimality: Assume that the
design matrix
has a full column rank . This implies that . Since
is always positive, it is easy to show that the Hessian matrix of
is always positive definite, which entails that the log-
likelihood function
is strictly concave in . Thus, there
exists a unique maximizer
of . Let
be a sublevel set of for some and
be the maximum concavity of the penalty function . For the
penalty, SCAD and MCP, we have , , and
, respectively. The following proposition gives a sufficient
condition on the global optimality of NCPMLE.
Proposition 1 (Global Optimality): Assume that
has rank
and satisfies
(11)
Then the NCPMLE
is a global maximizer of the penalized
likelihood
if .
Note that for penalized least-squares, (11) reduces to
(12)
This condition holds for sufficiently large
in SCAD and MCP,
when the correlation between covariates is not too strong. The
latter holds for design matrices constructed by using spline
bases to approximate a nonparametric function. According
to Proposition 1, under (12), the penalized least-squares with
folded-concave penalty is a global minimum.
The proposition below gives a condition under which the
penalty term in (3) does not change the global maximizer. It

5470 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011
will be used to derive the condition under which the PMLE is
the same as the oracle estimator in Proposition 3(b). Here for
simplicity we consider the SCAD penalty
given by (4), and
the technical arguments are applicable to other folded-concave
penalties as well.
Proposition 2 (Robustness): Assume that
has rank
with and there exists some such that
for some . Then
the SCAD penalized likelihood estimator
is the global maxi-
mizer and equals
if and ,
where
.
2) Restricted Global Optimality: When
, it is hard
to show the global optimality of a local maximizer. However,
we can study the global optimality of the NCPMLE
on the
union of coordinate subspaces. A subspace of
is called co-
ordinate subspace if it is spanned by a subset of the natural basis
, where each is the -vector with th component
1 and 0 elsewhere. Here each
corresponds to the th predictor
. We will investigate the global optimality of on the union
of all -dimensional coordinate subspaces of in Proposi-
tion 3(a).
Of particular interest is to derive the conditions under which
the PMLE is also an oracle estimator, in addition to possessing
the above restricted global optimal estimator on
. To this
end, we introduce an identifiability condition on the true model
. The true model is called -identifiable for some
if
(13)
where
. In other words, is the best subset of size
, with a margin at least . The following proposition is an easy
consequence of Propositions 1 and 2.
Proposition 3 (Global Optimality on
):
a) If the conditions of Proposition 1 are satisfied for each
submatrix of , then the NCPMLE is a global
maximizer of
on .
b) Assume that the conditions of Proposition 2 are satis-
fied for the
submatrix of formed by columns
in
, the true model is -identifiable for some
, and . Then the
SCAD penalized likelihood estimator
is the global max-
imizer on
and equals to the oracle maximum likelihood
estimator
.
On the event that the PMLE estimator is the same as the oracle
estimator, it possesses of course the oracle property.
III. N
ONASYMPTOTIC WEAK ORACLE PROPERTIES
In this section, we study a nonasymptotic property of the non-
concave penalized likelihood estimator
, called the weak or-
acle property introduced by Lv and Fan (2009) in the setting
of penalized least squares. The weak oracle property means
sparsity in the sense of
with probability tending to 1
as
, and consistency under the loss, where
and is a subvector of formed by components in
. This property is weaker than the oracle
property introduced by Fan and Li (2001).
A. Regularity Conditions
As mentioned before, we condition on the design matrix
and use the penalty in the class satisfying Condition 1. Let
and respectively be the submatrices of the design
matrix
formed by columns in and
its complement, and
. To simplify the presentation,
we assume without loss of generality that each covariate
has
been standardized so that
. If the covariates have
not been standardized, the results still hold with
assumed
to be in the order of
. Let
(14)
be half of the minimum signal. We make the following assump-
tions on the design matrix and the distribution of the response.
Let
be a diverging sequence of positive numbers that de-
pends on the nonsparsity size
and hence depends on . Recall
that
is the nonvanishing components of the true parameter
.
Condition 2: The design matrix
satisfies
(15)
(16)
(17)
where the
norm of a matrix is the maximum of the norm
of each row,
, ,
, the derivative is taken componentwise, and
denotes the Hadamard (componentwise) product.
Here and below,
is associated with regularization param-
eter
satisfying (18) unless specified otherwise. For the clas-
sical Gaussian linear regression model, we have
and
. In this case, since we will assume that , condi-
tion (15) usually holds with
. In fact, Wainwright (2009)
shows that
if the rows of are
i.i.d. Gaussian vectors with
.In
general, since
we can take if . More
generally, (15) can be bounded as
and the above remark for the multiple regression model applies
to the submatrix
, which consists of rows of the samples
with
for some .
The left hand side of (16) is the multiple regression co-
efficients of each unimportant variable in
on , using
the weighted least squares with weights
. The order

FAN AND LV: NONCONCAVE PENALIZED LIKELIHOOD WITH NP-DIMENSIONALITY 5471
is mainly technical and can be relaxed, whereas
the condition
is genuine. When the
penalty is used, the upper bound in (16) is more restric-
tive, requiring uniformly less than 1. This condition is the
same as the strong irrepresentable condition of Zhao and Yu
(2006) for the consistency of the LASSO estimator, namely
. It is a drawback of the
penalty. In constrast, when a folded-concave penalty is used,
the upper bound on the right hand side of (16) can grow to
at rate .
Condition (16) controls the uniform growth rate of the
-norm of these multiple regression coefficients, a notion of
weak correlation between
and . If each element of the
multiple regression coefficients is of order
, then the
norm is of order . Hence, we can handle the nonsparse
dimensionality
, by (16), as long as the first term in
(16) dominates, which occurs for SCAD type of penalty with
. Of course, the actual dimensionality can be higher
or lower, depending on the correlation between
and ,but
for finite nonsparse dimensionality
, (16) is usually
satisfied.
For the Gaussian linear regression model, condition (17)
holds automatically.
We now choose the regularization parameter
and intro-
duce Condition 3. We will assume that half of the minimum
signal
for some . Take satis-
fying
and (18)
where
and is associated with the
nonsparsity size
.
Condition 3: Assume that
and
.
In addition, assume that
satisfies (18) and
, where and
, and that
if the responses are
unbounded.
The condition
is needed to ensure condition
(9). The condition always holds when
and is satisfied
for the SCAD type of penalty when
.
In view of (7) and (8), to study the nonconcave penalized like-
lihood estimator
we need to analyze the deviation of the -di-
mensional random vector
from its mean , where
denotes the -dimensional random re-
sponse vector in the GLM (1). The following proposition, whose
proof is given in Section VIII.E, characterizes such deviation
for the case of bounded responses and the case of unbounded
responses satisfying a moment condition, respectively.
Proposition 4 (Deviation): Let
be the
-dimensional independent random response vector and
. Then
a) If
are bounded in for some ,
then for any
(19)
b) If
are unbounded and there exist some
such that
(20)
with
, then for any
(21)
In light of (1), it is known that for the exponential family, the
moment-generating function of
is given by
where is in the domain of . Thus, the moment
condition (20) is reasonable. It is easy to show that condition
(20) holds for the Gaussian linear regression model and for the
Poisson regression model with bounded mean responses. Sim-
ilar probability bounds also hold for sub-Gaussian errors.
We now express the results in Proposition 4 in a unified form.
For the case of bounded responses, we define
for , where . For the case of un-
bounded responses satisfying the moment condition (20), we
define
, where . Then the
exponential bounds in (19) and (21) can be expressed as
(22)
where
if the responses are bounded and
if the responses are unbounded.
B. Weak Oracle Properties
Theorem 2 (Weak Oracle Property): Assume that Conditions
1 3 and the probability bound (22) are satisfied,
, and
. Then there exists a nonconcave penalized
likelihood estimator
such that for sufficiently large , with
probability at least
,
satisfies:
a) (Sparsity).
;
b) (
loss). ,
where
and are respectively the subvectors of and
formed by components in .
Under the given regularity conditions, the dimensionality
is
allowed to grow up to exponentially fast with the sample size
.
The growth rate of
is controlled by . It also enters
the nonasymptotic probability bound. This probability tends to
1 under our technical assumptions. From the proof of Theorem
2, we see that with asymptotic probability one, the
estima-
tion loss of the nonconcave penalized likelihood estimator
is
bounded from above by three terms (see (45)), where the second
term
is associated with
the penalty function
. For the penalty, the ratio
is equal to one, and for other concave penalties,
it can be (much) smaller than one. This is in line with the

Citations
More filters
Journal Article
TL;DR: In this paper, a brief account of the recent developments of theory, methods, and implementations for high-dimensional variable selection is presented, with emphasis on independence screening and two-scale methods.
Abstract: High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods.

892 citations

Book
11 Apr 2019
TL;DR: This book provides a self-contained introduction to the area of high-dimensional statistics, aimed at the first-year graduate level, and includes chapters that are focused on core methodology and theory - including tail bounds, concentration inequalities, uniform laws and empirical process, and random matrices.
Abstract: Recent years have witnessed an explosion in the volume and variety of data collected in all scientific disciplines and industrial settings. Such massive data sets present a number of challenges to researchers in statistics and machine learning. This book provides a self-contained introduction to the area of high-dimensional statistics, aimed at the first-year graduate level. It includes chapters that are focused on core methodology and theory - including tail bounds, concentration inequalities, uniform laws and empirical process, and random matrices - as well as chapters devoted to in-depth exploration of particular model classes - including sparse linear models, matrix models with rank constraints, graphical models, and various types of non-parametric models. With hundreds of worked examples and exercises, this text is intended both for courses and for self-study by graduate students and researchers in statistics, machine learning, and related fields who must understand, apply, and adapt modern statistical methods suited to large-scale data.

748 citations

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a unified framework named detecting contiguous outliers in the LOw-rank representation (DECOLOR), which integrates object detection and background learning into a single process of optimization, which can be solved by an alternating algorithm.
Abstract: Object detection is a fundamental step for automated video analysis in many vision applications. Object detection in a video is usually performed by object detectors or background subtraction techniques. Often, an object detector requires manually labeled examples to train a binary classifier, while background subtraction needs a training sequence that contains no objects to build a background model. To automate the analysis, object detection without a separate training phase becomes a critical task. People have tried to tackle this task by using motion information. But existing motion-based methods are usually limited when coping with complex scenarios such as nonrigid motion and dynamic background. In this paper, we show that the above challenges can be addressed in a unified framework named DEtecting Contiguous Outliers in the LOw-rank Representation (DECOLOR). This formulation integrates object detection and background learning into a single process of optimization, which can be solved by an alternating algorithm efficiently. We explain the relations between DECOLOR and other sparsity-based methods. Experiments on both simulated data and real sequences demonstrate that DECOLOR outperforms the state-of-the-art approaches and it can work effectively on a wide range of complex scenarios.

579 citations

Posted Content
TL;DR: This paper presents a unified framework named DEtecting Contiguous Outliers in the LOw-rank Representation (DECOLOR), which integrates object detection and background learning into a single process of optimization, which can be solved by an alternating algorithm efficiently.
Abstract: Object detection is a fundamental step for automated video analysis in many vision applications. Object detection in a video is usually performed by object detectors or background subtraction techniques. Often, an object detector requires manually labeled examples to train a binary classifier, while background subtraction needs a training sequence that contains no objects to build a background model. To automate the analysis, object detection without a separate training phase becomes a critical task. People have tried to tackle this task by using motion information. But existing motion-based methods are usually limited when coping with complex scenarios such as nonrigid motion and dynamic background. In this paper, we show that above challenges can be addressed in a unified framework named DEtecting Contiguous Outliers in the LOw-rank Representation (DECOLOR). This formulation integrates object detection and background learning into a single process of optimization, which can be solved by an alternating algorithm efficiently. We explain the relations between DECOLOR and other sparsity-based methods. Experiments on both simulated data and real sequences demonstrate that DECOLOR outperforms the state-of-the-art approaches and it can work effectively on a wide range of complex scenarios.

509 citations

Journal ArticleDOI
TL;DR: In this article, a discrete extension of modern first-order continuous optimization methods is proposed to find high quality feasible solutions that are used as warm starts to a MIO solver that finds provably optimal solutions.
Abstract: In the period 1991–2015, algorithmic advances in Mixed Integer Optimization (MIO) coupled with hardware improvements have resulted in an astonishing 450 billion factor speedup in solving MIO problems. We present a MIO approach for solving the classical best subset selection problem of choosing $k$ out of $p$ features in linear regression given $n$ observations. We develop a discrete extension of modern first-order continuous optimization methods to find high quality feasible solutions that we use as warm starts to a MIO solver that finds provably optimal solutions. The resulting algorithm (a) provides a solution with a guarantee on its suboptimality even if we terminate the algorithm early, (b) can accommodate side constraints on the coefficients of the linear regression and (c) extends to finding best subset solutions for the least absolute deviation loss function. Using a wide variety of synthetic and real datasets, we demonstrate that our approach solves problems with $n$ in the 1000s and $p$ in the 100s in minutes to provable optimality, and finds near optimal solutions for $n$ in the 100s and $p$ in the 1000s in minutes. We also establish via numerical experiments that the MIO approach performs better than Lasso and other popularly used sparse learning procedures, in terms of achieving sparse solutions with good predictive power.

441 citations

References
More filters
Journal ArticleDOI
TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Abstract: SUMMARY We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.

40,785 citations


"Nonconcave Penalized Likelihood Wit..." refers methods in this paper

  • ...In this section we discuss the choice of penalty functions in regularization methods and characterize the non-concave penalized likelihood estimator as well as its global optimality....

    [...]

  • ...Lasso (Tibshirani, 1996) uses the L1-penalized least squares....

    [...]

Book
01 Jan 1983
TL;DR: In this paper, a generalization of the analysis of variance is given for these models using log- likelihoods, illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables), and gamma (variance components).
Abstract: The technique of iterative weighted linear regression can be used to obtain maximum likelihood estimates of the parameters with observations distributed according to some exponential family and systematic effects that can be made linear by a suitable transformation. A generalization of the analysis of variance is given for these models using log- likelihoods. These generalized linear models are illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables) and gamma (variance components).

23,215 citations

Journal ArticleDOI
TL;DR: This is the Ž rst book on generalized linear models written by authors not mostly associated with the biological sciences, and it is thoroughly enjoyable to read.
Abstract: This is the Ž rst book on generalized linear models written by authors not mostly associated with the biological sciences. Subtitled “With Applications in Engineering and the Sciences,” this book’s authors all specialize primarily in engineering statistics. The Ž rst author has produced several recent editions of Walpole, Myers, and Myers (1998), the last reported by Ziegel (1999). The second author has had several editions of Montgomery and Runger (1999), recently reported by Ziegel (2002). All of the authors are renowned experts in modeling. The Ž rst two authors collaborated on a seminal volume in applied modeling (Myers and Montgomery 2002), which had its recent revised edition reported by Ziegel (2002). The last two authors collaborated on the most recent edition of a book on regression analysis (Montgomery, Peck, and Vining (2001), reported by Gray (2002), and the Ž rst author has had multiple editions of his own regression analysis book (Myers 1990), the latest of which was reported by Ziegel (1991). A comparable book with similar objectives and a more speciŽ c focus on logistic regression, Hosmer and Lemeshow (2000), reported by Conklin (2002), presumed a background in regression analysis and began with generalized linear models. The Preface here (p. xi) indicates an identical requirement but nonetheless begins with 100 pages of material on linear and nonlinear regression. Most of this will probably be a review for the readers of the book. Chapter 2, “Linear Regression Model,” begins with 50 pages of familiar material on estimation, inference, and diagnostic checking for multiple regression. The approach is very traditional, including the use of formal hypothesis tests. In industrial settings, use of p values as part of a risk-weighted decision is generally more appropriate. The pedagologic approach includes formulas and demonstrations for computations, although computing by Minitab is eventually illustrated. Less-familiar material on maximum likelihood estimation, scaled residuals, and weighted least squares provides more speciŽ c background for subsequent estimation methods for generalized linear models. This review is not meant to be disparaging. The authors have packed a wealth of useful nuggets for any practitioner in this chapter. It is thoroughly enjoyable to read. Chapter 3, “Nonlinear Regression Models,” is arguably less of a review, because regression analysis courses often give short shrift to nonlinear models. The chapter begins with a great example on the pitfalls of linearizing a nonlinear model for parameter estimation. It continues with the effective balancing of explicit statements concerning the theoretical basis for computations versus the application and demonstration of their use. The details of maximum likelihood estimation are again provided, and weighted and generalized regression estimation are discussed. Chapter 4 is titled “Logistic and Poisson Regression Models.” Logistic regression provides the basic model for generalized linear models. The prior development for weighted regression is used to motivate maximum likelihood estimation for the parameters in the logistic model. The algebraic details are provided. As in the development for linear models, some of the details are pushed into an appendix. In addition to connecting to the foregoing material on regression on several occasions, the authors link their development forward to their following chapter on the entire family of generalized linear models. They discuss score functions, the variance-covariance matrix, Wald inference, likelihood inference, deviance, and overdispersion. Careful explanations are given for the values provided in standard computer software, here PROC LOGISTIC in SAS. The value in having the book begin with familiar regression concepts is clearly realized when the analogies are drawn between overdispersion and nonhomogenous variance, or analysis of deviance and analysis of variance. The authors rely on the similarity of Poisson regression methods to logistic regression methods and mostly present illustrations for Poisson regression. These use PROC GENMOD in SAS. The book does not give any of the SAS code that produces the results. Two of the examples illustrate designed experiments and modeling. They include discussion of subset selection and adjustment for overdispersion. The mathematic level of the presentation is elevated in Chapter 5, “The Family of Generalized Linear Models.” First, the authors unify the two preceding chapters under the exponential distribution. The material on the formal structure for generalized linear models (GLMs), likelihood equations, quasilikelihood, the gamma distribution family, and power functions as links is some of the most advanced material in the book. Most of the computational details are relegated to appendixes. A discussion of residuals returns one to a more practical perspective, and two long examples on gamma distribution applications provide excellent guidance on how to put this material into practice. One example is a contrast to the use of linear regression with a log transformation of the response, and the other is a comparison to the use of a different link function in the previous chapter. Chapter 6 considers generalized estimating equations (GEEs) for longitudinal and analogous studies. The Ž rst half of the chapter presents the methodology, and the second half demonstrates its application through Ž ve different examples. The basis for the general situation is Ž rst established using the case with a normal distribution for the response and an identity link. The importance of the correlation structure is explained, the iterative estimation procedure is shown, and estimation for the scale parameters and the standard errors of the coefŽ cients is discussed. The procedures are then generalized for the exponential family of distributions and quasi-likelihood estimation. Two of the examples are standard repeated-measures illustrations from biostatistical applications, but the last three illustrations are all interesting reworkings of industrial applications. The GEE computations in PROC GENMOD are applied to account for correlations that occur with multiple measurements on the subjects or restrictions to randomizations. The examples show that accounting for correlation structure can result in different conclusions. Chapter 7, “Further Advances and Applications in GLM,” discusses several additional topics. These are experimental designs for GLMs, asymptotic results, analysis of screening experiments, data transformation, modeling for both a process mean and variance, and generalized additive models. The material on experimental designs is more discursive than prescriptive and as a result is also somewhat theoretical. Similar comments apply for the discussion on the quality of the asymptotic results, which wallows a little too much in reports on various simulation studies. The examples on screening and data transformations experiments are again reworkings of analyses of familiar industrial examples and another obvious motivation for the enthusiasm that the authors have developed for using the GLM toolkit. One can hope that subsequent editions will similarly contain new examples that will have caused the authors to expand the material on generalized additive models and other topics in this chapter. Designating myself to review a book that I know I will love to read is one of the rewards of being editor. I read both of the editions of McCullagh and Nelder (1989), which was reviewed by Schuenemeyer (1992). That book was not fun to read. The obvious enthusiasm of Myers, Montgomery, and Vining and their reliance on their many examples as a major focus of their pedagogy make Generalized Linear Models a joy to read. Every statistician working in any area of applied science should buy it and experience the excitement of these new approaches to familiar activities.

10,520 citations


"Nonconcave Penalized Likelihood Wit..." refers background in this paper

  • ...Generalized linear models (GLMs) provide a flexible parametric approach to estimating the covariate effects (McCullagh and Nelder, 1989)....

    [...]

  • ...Jinchi Lv is Assistant Professor of Statistics, Information and Operations Management Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089, USA (e-mail: jinchilv@marshall.usc.edu)....

    [...]

Book ChapterDOI
TL;DR: In this article, upper bounds for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt are derived for certain sums of dependent random variables such as U statistics.
Abstract: Upper bounds are derived for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt. It is assumed that the range of each summand of S is bounded or bounded above. The bounds for Pr {S – ES ≥ nt} depend only on the endpoints of the ranges of the summands and the mean, or the mean and the variance of S. These results are then used to obtain analogous inequalities for certain sums of dependent random variables such as U statistics and the sum of a random sample without replacement from a finite population.

8,655 citations


"Nonconcave Penalized Likelihood Wit..." refers background or methods in this paper

  • ...Then the exponential bounds in (19) and (21) can be expressed as...

    [...]

  • ...Part a) follows easily from a simple application of Hoeffding’s inequality (Hoeffding, 1963), since a1Y1, · · · , anYn are n independent bounded random variables, where a = (a1, · · · , an)T ....

    [...]

Frequently Asked Questions (10)
Q1. What are the contributions mentioned in the paper "Nonconcave penalized likelihood with np-dimensionality" ?

In this paper, the authors show that in the context of generalized linear models, such methods possess model selection consistency with oracle properties even for dimensionality of nonpolynomial ( NP ) order of sample size, for a class of penalized likelihood approaches using folded-concave penalty functions, which were introduced to ameliorate the bias problems of convex penalty functions. 

Then is a strict local maximizer of the nonconcave penalized likelihood defined by (3) if(7)(8)(9)where and respectively denote the submatrices of formed by columns in and its complement, ,is a subvector of formed by all nonzero components, and . 

In this case, the dimensionality that the penalized least-squares can handle is as high as when, which is usually smaller than that for the case of . 

Condition (16) controls the uniform growth rate of the -norm of these multiple regression coefficients, a notion of weak correlation between and . 

A subspace of is called coordinate subspace if it is spanned by a subset of the natural basis , where each is the -vector with th component 1 and 0 elsewhere. 

Then there exists a strict local maxi-mizer of the penalized likelihood such that with probability tending to 1 as and, where is a subvector of formed by components in . 

More generally, when the second derivative of the penalty function does not necessarily exist, it is easy to show that the second part of the matrix can be replaced by a diagonal matrix with maximum absolute element bounded by . 

By the concavity of , the authors can easily show that for , is a closed convex set with and being its interior points and the level set is its boundary. 

When is quadratic in , e.g., for the Gaussian linear regression model, the second order approximation in ICA is exact at each step. 

Due to its popularity, the authors now examine the implications of Theorem 2 in the context of penalized least-squares and penalized likelihood.