scispace - formally typeset
Open AccessJournal ArticleDOI

Variable Selection in Generalized Functional Linear Models.

Reads0
Chats0
TLDR
This paper proposes a variable selection technique, based on adopting a generalized functional linear model framework and using a penalized likelihood method that simultaneously controls the sparsity of the model and the smoothness of the corresponding coefficient functions by adequate penalization.
Abstract
Modern research data, where a large number of functional predictors is collected on few subjects are becoming increasingly common. In this paper we propose a variable selection technique, when the predictors are functional and the response is scalar. Our approach is based on adopting a generalized functional linear model framework and using a penalized likelihood method that simultaneously controls the sparsity of the model and the smoothness of the corresponding coefficient functions by adequate penalization. The methodology is characterized by high predictive accuracy, and yields interpretable models, while retaining computational efficiency. The proposed method is investigated numerically in finite samples, and applied to a diffusion tensor imaging tractography data set and a chemometric data set.

read more

Content maybe subject to copyright    Report

Stat
The ISI’s Journal for the Rapid
Dissemination of Statistics Research
(wileyonlinelibrary.com) DOI: 10.100X/sta.0000
.................................................................................................
Variable Selection in Generalized Functional
Linear Models
J. Gertheiss
a
,A.Maity
b
,A.-M.Staicu
b
Received 00 Month 2013; Accepted 00 Month ????
Modern research data, where a large number of functional predictors is collected on few subjects are becoming
increasingly common. In this paper we propose a variable selection technique, when the predictors are
functional and the response is scalar. Our approach is based on adopting a generalized functional linear model
framework and using a penalized likelihood method that simultaneously controls the sparsity of the model
and the smoothness of the corresponding coefficient functions by adequate penalization. The methodology
is characterized by high predictive accuracy, and yields interpretable models, while retaining computational
efficiency. The proposed method is investigated numerically in finite samples, and applied to a diffusion tensor
imaging tractography data set and a chemometric data set. Copyright
c
2013 John Wiley & Sons, Ltd.
Keywords: group lasso; multiple functional predictors; penalized estimation; variable selection
..................................................................................................
1. Introduction
Functional linear regression (Ramsay & Dalzell, 1991) is widely used to explore the relationship between a scalar
response and functional predictors. In particular, (generalized) functional linear models are popular in areas including
biomedical studies, chemometrics and commerce. In such models, the effect of each functional predictor on the response
is modeled through the inner product of the functional covariate and an unknown smooth coefficient function. In this
article, we consider the situation where multiple functional predictors are observed, but only a few of these predictors
are actually useful in predicting the response. We develop a variable selection procedure to select the important
functional predictors and estimate the corresponding coefficient functions simultaneously. Our procedure controls both
the sparseness of the regression model and the smoothness of the coefficient functions. Furthermore, the methods
can accommodate functional predictors that are measured with error or are observed in a sparse set of points. We
investigate the finite sample performance of our procedure via a simulation study and illustrate our method by applying
it to a diffusion tensor imaging data set with 30 covariates and a chemometric data set with seven covariates.
Let Y
i
denote the scalar response for subject i,andX
i1
,...,X
ip
denote the independent realizations of the squared
integrable random curves X
1
,...,X
p
, respectively, where X
j
: D
j
R R,fori =1,...,n. Without loss of generality,
..................................................................................................
a
Department of Animal Sciences, Georg-August-Universit¨at ottingen, Germany
b
Department of Statistics, North Carolina State University, USA
Email: jgerthe@uni-goettingen.de
..................................................................................................
Stat 2013, 00 1–?? 1 Cop yright
c
2013 John Wiley & Sons, Ltd.
Prepared using staauth.cls [Version: 2012/05/12 v1.00]

Stat J Gertheiss, A Maity and A-M Staicu
assume that the underlying trajectories X
j
’s have mean function equal to zero. For simplicity of exposition, consider
first the case when X
ij
’s are observed at a dense grid of time points {t
j1
,...,t
jN
j
} anddonotcontainmeasurement
error; this limitation is relaxed in Section 3. In its generality, it is assumed that, given X
i1
,...,X
ip
, the distribution
of the outcome Y
i
is in the exponential family with linear predictor η
i
and dispersion parameter φ, denoted here by
EF(η
i
). The linear predictor is set to have the following form
η
i
= α +
p
j=1
D
j
X
ij
(t)β
j
(t) dt, (1)
with E[Y
i
|X
i1
,...,X
ip
]=μ
i
= h
1
(η
i
), where h(·) is a known link function. The coefficient functions β
j
(t) are assumed
as smooth, squared integrable, and represent the main object of interest. Function β
j
(t) quantifies the effect of the
functional predictor X
ij
(t) on the mean response μ
i
. For convenience, throughout the paper the domain of integration
D
j
in the integral
D
j
X
ij
(t)β
j
(t)dt if often suppressed. A special case of model (1) corresponds to the identity link
function h(μ
i
)=μ
i
. The resulting model is known as the functional linear model, and can be written alternatively as
Y
i
= α +
p
j=1
X
ij
(t)β
j
(t) dt +
i
(2)
where Y
i
is the scalar response for observation i, i =1,...,n,and
i
are independent random errors with mean 0 and
variance σ
2
, typically assumed to be normally distributed. For simplicity of exposition we present the main ideas for
the continuous response case (2) first, and discuss the modifications required by a generalized response thereafter.
Current literature in generalized functional linear models is focused primarily on the estimation of the model
components, the coefficient functions β
j
(·). For example, Goldsmith et al. (2011a), James (2002), James et al.
(2009), Marx & Eilers (1999), uller & Stadtm¨uller (2005)andTutz & Gertheiss (2010) discuss estimation and/or
inference of the smooth effects functions β
j
(·), in the case of one or multiple functional predictors. More recently,
Kong et al. (2013)andSwihart et al. (2013), discuss hypothesis testing procedures when the number of functional
predictors is assumed small. These methods perform well when the response variable is indeed related to most of the
functional covariates. However, there is limited literature on the situation where the number of functional predictors
available may be unnecessary large and many of them are unrelated to the response; see for example the data
illustrations in Section 5. In these cases, typical association models that relate the response to all the measured
predictors lead to unnecessary complex models that do not accurately describe the data generating process, and have
low predictive capabilities. The focus of this paper is to develop procedures that perform variable selection when the
predictors are curves, as well as estimation of the corresponding smooth effects of the selected functional variables.
Variable selection, especially in multivariate statistics, has attracted considerable attention over the recent years.
In this context, penalized likelihood methods have emerged as highly successful selection techniques in presence of
high dimensional vector-valued predictors; see for example LASSO (Tibshirani, 1996), SCAD (Fan & Li, 2001), the
adaptive LASSO (Zou, 2006)andOSCAR(Bondell & Reich, 2008). The general idea of the penalized approaches is
to impose various constraints on the coefficients, such as using L
1
norm or pairwise L
norm etc., which result in
sets of coefficients becoming identically zero. However, extension of these ideas to the setting of functional predictors
is not straightforward. Usually the complexity of variable selection problems in functional regression is much greater
than the usual multivariate variable selection problems. To fix the ideas, let us consider the diffusion tensor imaging
tractography data that we use as an illustrative example in Section 5.1. In this data set, there are five tracts and six
different measurements recorded for each tract, resulting in 30 functional predictors. The complexity arises because
(i) each functional covariate is measured with error, (ii) some of the covariates are measured on irregularly spaced grid
points, and (iii) not all the covariates have the same domain. Thus any variable selection procedure intended for this
type of data set needs to account for all the above mentioned issues.
A direct generalization of multivariate variable selection ideas to setting (2) would require two steps. First, represent the
integrals
X
ij
(t)β
j
(t) dt
l
X
ij
(t
l
)β
j
(t
l
) using Riemann integration techniques and reformulate the problem in the
..................................................................................................
Copyright
c
2013 John Wiley & Sons, Ltd. 2 Stat 2013,001??
Prepared using staauth.cls

Selection of Functional Covariates Stat
typical linear form. The second step is to apply classical variable selection procedures with high dimensional “artificial”
predictors, {X
j,l
= X
j
(t
l
):j, l}. The main difficulty is the expected high correlation among predictors obtained from
evaluating a single functional predictor, say X
j
at different time points t
l
,say{X
j,1
,X
j,2
,...}. The challenge can be
bypassed in turn by using methods developed for group variable selection, including group LARS and group LASSO
(see, for example, Yuan & Lin, 2006), which target highly-correlated scalar predictors. These penalties, however, do
not impose smoothness of the coefficients β
j,l
= β
j
(t
l
), viewed as functions of t. Without the smoothing property,
interpretation of the functional predictors influence on the response is meaningless.
In this article, we consider a penalized likelihood approach that combines selection of the functional predictors and
estimation of the smooth effects for the chosen subset of predictors. Specifically, for a functional regression model
(2), the estimates of the coefficient functions β
j
(·) are the minimizers of
n
i=1
{Y
i
α
p
j=1
X
ij
(t)β
j
(t) dt}
2
+
p
j=1
P
λ,ϕ
(β
j
), where P
λ,ϕ
is a penalty on the coefficient function β
j
and λ and ϕ are global non-negative penalization
parameters that control both the sparseness and the smoothness of the solution, respectively. More generally, when the
response is generalized, the penalized criterion is similar to the above, with the difference that the first term is replaced
by minus twice the corresponding (log-)likelihood function. We propose a penalty function that is inspired from the
penalization proposed in high-dimensional additive models (Meier et al., 2009). Thus the proposed approach combines
functional data analysis and variable selection techniques in high-dimensional additive models. Recently Fan & James
(2012) (unpublished manuscript) investigated variable selection in linear and non-linear functional regression for
continuous scalar response. They use a ‘basis approach’ modeling of the coefficient functions β
j
(·), where the number
of basis functions gives the smoothness of β
j
(·), and is a tuning parameter, and a penalty term constructed on the
norm of the entire effect,
X
j
(t)β
j
(t) dt. In contrast, in this article, we propose a ‘penalization approach’ by modeling
the coefficient functions using a rich function basis, and use a penalty term that explicitly controls the smoothness of
β
j
(t) and sparseness of the model. Plus, our procedure is also developed for generalized functional linear models.
The remainder of the paper is organized as follows. Section 2 presents the methodology for simultaneous variable
selection and estimation of smooth effects, when the predictors are functions observed without noise at a dense grid of
points. Section 3 presents the extension to functional predictors corrupted by noise, measured at dense or sparse grid
of points. The proposed methods are illustrated numerically in simulations experiments in Section 4 andontworeal
data examples in Section 5: the sugar spectra and tractography data sets. Section 6 concludes with a brief discussion.
2. Methodology
2.1. Parameters Modeling
Our approach to estimating the coefficient functions β
j
is to use a pre-set basis functions expansion, such as a B-spline
basis or a truncated polynomial basis. Specifically, let b
j
(t)={b
j1
(t),...,b
jq
(t)} be such a finite basis, and consider
the approximation as β
j
(t)=
q
r=1
γ
jr
b
jr
(t), where γ
jr
are the corresponding basis coefficients. The choice of the
basis functions is related to characteristics of the coefficient functions β
j
, such as differentiability, while the number of
basis functions is related to the coefficient’s smoothness. In particular a small number of basis functions lead to a very
smooth solutions, while a large number of basis functions results in very wiggly solutions. In this paper we allow the
number of basis functions to be large enough to capture the complexity of the function, and we control the smoothness
of the fit using an additional parameter that penalizes the roughness of the fit, and is chosen data-adaptively.
When the functional predictors, X
ij
(·), are observed without measurement error and at an equally spaced dense grid
of points, {t
j1
,...,t
jN
j
}, the integral in (1)or(2) can be computed/approximated by the Riemann sum
X
ij
(t)β
j
(t) dt
r
Δ
j
l
X
ij
(t
jl
)b
jr
(t
jl
)
γ
jr
= Z
ij
γ
j
,
where γ
j
=(γ
j1
,...,γ
jq
)
, Z
ij
=(Z
ij1
,...,Z
ijq
)
, Z
ijr
j
l
X
ij
(t
jl
)b
jr
(t
jl
), and Δ
j
= t
jl
t
j,l1
denotes the
..................................................................................................
Stat 2013, 00 1–?? 3 Cop yright
c
2013 John Wiley & Sons, Ltd.
Prepared using staauth.cls

Stat J Gertheiss, A Maity and A-M Staicu
distance between two adjacent measurement points. Thus, model (2) can be approximated by a typical linear regression
model Y
i
= α +
p
j=1
Z
ij
γ
j
+
i
,whereZ
ij
are known quantities, and α and γ
j
’s are unknown regression coefficients.
Nevertheless, using a classical penalty function on the model parameters γ
jr
, in particular a sparseness inducing penalty,
is not directly applicable to this setting because the parameters γ
jr
have a preset grouping structure. Group variable
selection, on the other hand, is an attractive method to impose sparseness. However, it implicitly assumes that the
smoothness of the coefficient functions β
j
is specifically controlled by the number of basis functions. For example,
Matsui & Konishi (2011)andLian (2013) discussed a groupwise SCAD penalty, by employing Gaussian basis functions,
or the truncated eigenbasis of functional covariates, respectively, and controlling the smoothness of the corresponding
functional coefficient by using only a small number of these basis functions; Zhu & Cox (2009) used a simple group lasso
penalty and a very small number of orthonormal basis functions. When using only a small number of basis functions,
however, the shape of the fitted functions is strongly influenced be the concrete number, the type and the placing of
the basis functions. Our paper relaxes this limitation, and allows to approximate the coefficient functions using a large
number of basis functions. This makes the approach more flexible, but does not fully describe the functions’ inherent
smoothness. Therefore a different type of penalty is required.
2.2. Penalized Estimation
We consider a so called sparsity-smoothness penalty technique, which results in simultaneous estimation of the
parameter functions and sparseness of the solution, when the covariates are curves. The proposed penalty function
was introduced by Meier et al. (2009) for variable selection in high-dimensional additive modeling. Specifically, let
P
λ,ϕ
(β
j
)=λ(β
j
2
+ ϕβ

j
2
)
1/2
, (3)
where β
j
2
=
D
j
{β
j
(t)}
2
dt is the L
2
norm of β
j
and β

j
(t)=
2
β
j
(t)/∂t
2
. Penalties such as (3) ensure that estimates
of the quantity appearing inside the square-root may be set to zero (see, e.g., Yuan & Lin (2006), Meier et al. (2008)).
As ϕ 0, this is equivalent to setting β
j
(t)=0forallt, which implies that predictor X
j
is excluded from model (1). If
ϕ>0, also the second order derivative of β
j
(t) is penalized, which ensures smoothing of non-zero coefficient functions.
Although β
j
(t) is assumed as a smooth function, fitted non-zero curves may become rough if the penalty β

j
2
on the
second order derivative is not included in (3). As rough coefficient functions are hard to interpret, ϕ>0 is strongly
recommended. The actual extent of smoothing is controlled by the actual value of ϕ.
Using a rich pre-set basis function expansion for β
j
, the corresponding penalty can be represented as
P
λ,ϕ
(β
j
)=λ(γ
j
j
+ ϕΩ
j
)γ
j
)
1/2
,
where Ψ
j
is the q × q matrix with the (r,k) element equal to
j
)
rk
=
b
jr
(t)b
jk
(t) dt, r,k =1,...,q,an
j
is
the q × q matrix with the (r,k) element equal to
j
)
rk
=
b

jr
(t)b

jk
(t) dt, r, k =1,...,q. Furthermore, the penalty
can be re-written in a more convenient way as P
λ,ϕ
(β
j
)=λ(γ
j
K
ϕ,j
γ
j
)
1/2
, which is a general group lasso-type penalty
(Yuan & Lin, 2006) on vector γ
j
,whereK
ϕ,j
j
+ ϕΩ
j
is a symmetric and positive definite q × q matrix.
Consequently, when the functional linear model (2) is assumed, the estimates of the intercept α and the coefficient
functions β
j
are obtained as
α and
β
j
(t)=
q
r=1
b
jr
(t)
γ
jr
,forj =1,...,p,where
α and ˆγ
j
’s are the minimizers of
p
i=1
(Y
i
α Z
ij
γ
j
)
2
+ λ
p
j=1
(γ
j
K
ϕ,j
γ
j
)
1/2
, (4)
where λ is a sparseness/tuning parameter and ϕ is a smoothing/tuning parameter. If K
ϕ,j
= I
q
for all j =1,...,p,
where I
q
denotes the q × q identity matrix, then the estimation (4) resembles the one for the ordinary group lasso
for which built in software already exist. More generally, by using appropriate re-parametrization one can still employ
existing software, as we show next; the ideas were also pointed out by Meier et al. (2009) in a related context.
..................................................................................................
Copyright
c
2013 John Wiley & Sons, Ltd. 4 Stat 2013,001??
Prepared using staauth.cls

Selection of Functional Covariates Stat
Let K
ϕ,j
= R
ϕ,j
R
ϕ,j
be the Cholesky decomposition where R
ϕ,j
is non-singular lower triangular matrix and define
˜γ
j
= R
ϕ,j
γ
j
and
Z
j
= R
1
ϕ,j
Z
j
. Then, the penalized criterion (4) reduces to
p
i=1
(Y
i
α
Z
ij
γ
j
)
2
+ λ
p
j=1
γ
j
,where
γ
j
is the Euclidean norm in R
q
. Thus, for a given value of the smoothness parameter ϕ, the minimizers of (4)can
be formulated as the parameter estimates in an appropriate linear model, via a penalized likelihood criterion using the
ordinary group lasso penalty (Meier et al., 2008; Meier, 2009). As a result, for fixed ϕ and λ, the estimates
ˆ
˜γ can
be computed using any existing software that can accommodate a group lasso penalty, for example, the R package
grplasso. In practice, of course, the appropriate sparseness and smoothing parameters λ and ϕ are unknown and
need to be estimated from the data; for example by cross-validation. This is discussed in detail in Section 2.4.
When the response is generalized and a generalized functional linear model as in (1) is assumed, the estimation
procedure is similar to the one above with the exception that the first quadratic term of (4) is replaced by the
(log-)likelihood function corresponding to Y
i
as specified by (1).
2.3. Adaptive Penalized Estimation
Next we discuss an alternative penalty function based on adaptive weights for the penalized estimation criterion. We
define the adaptive penalization scheme, similar to the adaptive lasso (Zou, 2006), by introducing weights w
j
and v
j
in the penalty function (3). Specifically, the adaptive penalization approach uses the penalty
P
λ,ϕ
(β
j
)=λ(w
j
β
j
2
+ ϕv
j
β

j
2
)
1/2
, (5)
where the weights w
j
and v
j
are chosen in a data-adaptive way (Meier et al., 2009). The choice of weights is meant
to reflect some subjectivity about the true parameter functions and to allow for different shrinkage and smoothness
for the different covariates. One possibility for choosing the weights is to use initial parameters estimates, based
on smoothing solely, but without using sparseness-assumptions. Consider a generalized functional linear model with
multiple functional covariates, and denote by
˘
β
j
’s, the estimated coefficient functions β
j
’s, for example, by using the
approach described in Goldsmith et al. (2011a) and implemented in the R package refund (Crainiceanu et al., 2012).
Then, the adaptive weights can be defined as w
j
=1/
˘
β
j
and v
j
=1/
˘
β

j
. Adaptive estimation has been shown
to reduce the number of false positives considerably in penalty-based variable selection; see Meier et al. (2009)or
Gertheiss & Tutz (2012) to name a few. As illustrated by the simulation studies in Section 4, adaptive estimation with
the choice of weights above typically leads to improved selection performance in the functional model, too. Given the
computation of the initial estimates
˘
β
j
is not time-consuming, the computational burden for the adaptive penalty (5)
does not change very much compared with the standard (non-adaptive) penalty (3).
2.4. Choosing the tuning parameters
So far, the sparseness parameter λ and the smoothness parameter ϕ involved in the penalties (3)and(5) were assumed
known. In practice, however, these parameters need to be selected. We consider K-fold cross-validation to select λ and
ϕ. Specifically, the original sample is (randomly) divided into K roughly equal-sized subsamples. For each k =1,...,K,
the kth subsample is retained as a validation data for evaluating the prediction error of the model, and the remaining
K 1 subsamples are used a training data. The K estimates of prediction error are averaged; the criterion selects the
values of the tuning parameters that minimize the overall prediction error. Although the optimal value of K is still an
open problem, typical values of K used in the literature are 5 and 10 (see also Hastie et al. (2009)).
As a measure of prediction accuracy, we use the predictive deviance D = 2φ
n
i=1
(l
i
μ
i
) l
i
(Y
i
)), with l
i
μ
i
)denoting
the individual log-likelihood (for observation i )atthefittedmeanvalueˆμ
i
and l
i
(Y
i
) being the corresponding log-
likelihood where μ
i
is replaced by the observed value Y
i
(i.e., the maximum likelihood achievable). For the functional
linear model (2)withnormal
i
, the predictive deviance simplifies to the the sum of squared errors
i
(Y
i
ˆμ
i
)
2
.
..................................................................................................
Stat 2013, 00 1–?? 5 Cop yright
c
2013 John Wiley & Sons, Ltd.
Prepared using staauth.cls

Figures
Citations
More filters
Journal ArticleDOI

Methods for scalar-on-function regression

TL;DR: Some of the main approaches to how to fit regression models with scalar responses and functional data points as predictors are reviewed, categorizing the basic model types as linear, nonlinear and nonparametric.
Journal ArticleDOI

A general framework for functional regression modelling

TL;DR: A comprehensive framework for additive (mixed) models for functional responses and/or functional covariates based on the guiding principle of reframing functional regression in terms of corresponding models for scalar data is discussed, allowing the adaptation of a large body of existing methods for these novel tasks.
Journal ArticleDOI

Generalized Scalar-on-Image Regression Models via Total Variation

TL;DR: A class of generalized scalar-on-image regression models via total variation (GSIRM-TV), in the sense of generalized linear models, for scalar response and imaging predictor with the presence of scalar covariates with superior performance against many existing approaches is developed.
Journal ArticleDOI

Bayesian function‐on‐function regression for multilevel functional data

TL;DR: This work proposes a general function‐on‐function regression model for repeatedly sampled functional data on a fine grid, presenting a simple model as well as a more extensive mixed model framework, and introducing various functional Bayesian inferential procedures that account for multiple testing.
Journal ArticleDOI

Variable selection in infinite-dimensional problems

TL;DR: In this article, the explanatory variable is a function and the question is to look for which among the p n discretized values of the function must be incorporated in the model, and the aim is to show how the continuous structure of the data allows to develop new specific variable selection procedures, which improve the rates of convergence of the estimated parameters and need much less restrictive assumptions on p n.
References
More filters
Journal ArticleDOI

Regression Shrinkage and Selection via the Lasso

TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Journal ArticleDOI

Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties

TL;DR: In this article, penalized likelihood approaches are proposed to handle variable selection problems, and it is shown that the newly proposed estimators perform as well as the oracle procedure in variable selection; namely, they work as well if the correct submodel were known.
Journal ArticleDOI

Model selection and estimation in regression with grouped variables

TL;DR: In this paper, instead of selecting factors by stepwise backward elimination, the authors focus on the accuracy of estimation and consider extensions of the lasso, the LARS algorithm and the non-negative garrotte for factor selection.
Journal ArticleDOI

The adaptive lasso and its oracle properties

TL;DR: A new version of the lasso is proposed, called the adaptive lasso, where adaptive weights are used for penalizing different coefficients in the ℓ1 penalty, and the nonnegative garotte is shown to be consistent for variable selection.
Journal ArticleDOI

MR diffusion tensor spectroscopy and imaging.

TL;DR: Once Deff is estimated from a series of NMR pulsed-gradient, spin-echo experiments, a tissue's three orthotropic axes can be determined and the effective diffusivities along these orthotropic directions are the eigenvalues of Deff.
Related Papers (5)