Variable Selection in Generalized Functional Linear Models.

doi:10.1002/STA4.20

Stat

The ISI’s Journal for the Rapid

Dissemination of Statistics Research

(wileyonlinelibrary.com) DOI: 10.100X/sta.0000

.................................................................................................

Variable Selection in Generalized Functional

Linear Models

J. Gertheiss

a∗

,A.Maity

b

,A.-M.Staicu

b

Received 00 Month 2013; Accepted 00 Month ????

Modern research data, where a large number of functional predictors is collected on few subjects are becoming

increasingly common. In this paper we propose a variable selection technique, when the predictors are

functional and the response is scalar. Our approach is based on adopting a generalized functional linear model

framework and using a penalized likelihood method that simultaneously controls the sparsity of the model

and the smoothness of the corresponding coeﬃcient functions by adequate penalization. The methodology

is characterized by high predictive accuracy, and yields interpretable models, while retaining computational

eﬃciency. The proposed method is investigated numerically in ﬁnite samples, and applied to a diﬀusion tensor

imaging tractography data set and a chemometric data set. Copyright

c

 2013 John Wiley & Sons, Ltd.

Keywords: group lasso; multiple functional predictors; penalized estimation; variable selection

..................................................................................................

1. Introduction

Functional linear regression (Ramsay & Dalzell, 1991) is widely used to explore the relationship between a scalar

response and functional predictors. In particular, (generalized) functional linear models are popular in areas including

biomedical studies, chemometrics and commerce. In such models, the eﬀect of each functional predictor on the response

is modeled through the inner product of the functional covariate and an unknown smooth coeﬃcient function. In this

article, we consider the situation where multiple functional predictors are observed, but only a few of these predictors

are actually useful in predicting the response. We develop a variable selection procedure to select the important

functional predictors and estimate the corresponding coeﬃcient functions simultaneously. Our procedure controls both

the sparseness of the regression model and the smoothness of the coeﬃcient functions. Furthermore, the methods

can accommodate functional predictors that are measured with error or are observed in a sparse set of points. We

investigate the ﬁnite sample performance of our procedure via a simulation study and illustrate our method by applying

it to a diﬀusion tensor imaging data set with 30 covariates and a chemometric data set with seven covariates.

Let Y

i

denote the scalar response for subject i,andX

i1

,...,X

ip

denote the independent realizations of the squared

integrable random curves X

1

,...,X

p

, respectively, where X

j

: D

j

⊂ R → R,fori =1,...,n. Without loss of generality,

..................................................................................................

a

Department of Animal Sciences, Georg-August-Universit¨at G¨ottingen, Germany

b

Department of Statistics, North Carolina State University, USA

∗

Email: jgerthe@uni-goettingen.de

..................................................................................................

Stat 2013, 00 1–?? 1 Cop yright

c

 2013 John Wiley & Sons, Ltd.

Prepared using staauth.cls [Version: 2012/05/12 v1.00]

Stat J Gertheiss, A Maity and A-M Staicu

assume that the underlying trajectories X

j

’s have mean function equal to zero. For simplicity of exposition, consider

ﬁrst the case when X

ij

’s are observed at a dense grid of time points {t

j1

,...,t

jN

j

} anddonotcontainmeasurement

error; this limitation is relaxed in Section 3. In its generality, it is assumed that, given X

i1

,...,X

ip

, the distribution

of the outcome Y

i

is in the exponential family with linear predictor η

i

and dispersion parameter φ, denoted here by

EF(η

i

,φ). The linear predictor is set to have the following form

η

i

= α +



p

j=1



D

j

X

ij

(t)β

j

(t) dt, (1)

with E[Y

i

|X

i1

,...,X

ip

]=μ

i

= h

−1

(η

i

), where h(·) is a known link function. The coeﬃcient functions β

j

(t) are assumed

as smooth, squared integrable, and represent the main object of interest. Function β

j

(t) quantiﬁes the eﬀect of the

functional predictor X

ij

(t) on the mean response μ

i

. For convenience, throughout the paper the domain of integration

D

j

in the integral



D

j

X

ij

(t)β

j

(t)dt if often suppressed. A special case of model (1) corresponds to the identity link

function h(μ

i

)=μ

i

. The resulting model is known as the functional linear model, and can be written alternatively as

Y

i

= α +



p

j=1



X

ij

(t)β

j

(t) dt + 

i

(2)

where Y

i

is the scalar response for observation i, i =1,...,n,and

i

are independent random errors with mean 0 and

variance σ

2

, typically assumed to be normally distributed. For simplicity of exposition we present the main ideas for

the continuous response case (2) ﬁrst, and discuss the modiﬁcations required by a generalized response thereafter.

Current literature in generalized functional linear models is focused primarily on the estimation of the model

components, the coeﬃcient functions β

j

(·). For example, Goldsmith et al. (2011a), James (2002), James et al.

(2009), Marx & Eilers (1999), M¨uller & Stadtm¨uller (2005)andTutz & Gertheiss (2010) discuss estimation and/or

inference of the smooth eﬀects functions β

j

(·), in the case of one or multiple functional predictors. More recently,

Kong et al. (2013)andSwihart et al. (2013), discuss hypothesis testing procedures when the number of functional

predictors is assumed small. These methods perform well when the response variable is indeed related to most of the

functional covariates. However, there is limited literature on the situation where the number of functional predictors

available may be unnecessary large and many of them are unrelated to the response; see for example the data

illustrations in Section 5. In these cases, typical association models that relate the response to all the measured

predictors lead to unnecessary complex models that do not accurately describe the data generating process, and have

low predictive capabilities. The focus of this paper is to develop procedures that perform variable selection when the

predictors are curves, as well as estimation of the corresponding smooth eﬀects of the selected functional variables.

Variable selection, especially in multivariate statistics, has attracted considerable attention over the recent years.

In this context, penalized likelihood methods have emerged as highly successful selection techniques in presence of

high dimensional vector-valued predictors; see for example LASSO (Tibshirani, 1996), SCAD (Fan & Li, 2001), the

adaptive LASSO (Zou, 2006)andOSCAR(Bondell & Reich, 2008). The general idea of the penalized approaches is

to impose various constraints on the coeﬃcients, such as using L

1

norm or pairwise L

∞

norm etc., which result in

sets of coeﬃcients becoming identically zero. However, extension of these ideas to the setting of functional predictors

is not straightforward. Usually the complexity of variable selection problems in functional regression is much greater

than the usual multivariate variable selection problems. To ﬁx the ideas, let us consider the diﬀusion tensor imaging

tractography data that we use as an illustrative example in Section 5.1. In this data set, there are ﬁve tracts and six

diﬀerent measurements recorded for each tract, resulting in 30 functional predictors. The complexity arises because

(i) each functional covariate is measured with error, (ii) some of the covariates are measured on irregularly spaced grid

points, and (iii) not all the covariates have the same domain. Thus any variable selection procedure intended for this

type of data set needs to account for all the above mentioned issues.

A direct generalization of multivariate variable selection ideas to setting (2) would require two steps. First, represent the

integrals



X

ij

(t)β

j

(t) dt ≈



l

X

ij

(t

l

)β

j

(t

l

) using Riemann integration techniques and reformulate the problem in the

..................................................................................................

Copyright

c

 2013 John Wiley & Sons, Ltd. 2 Stat 2013,001–??

Prepared using staauth.cls

Selection of Functional Covariates Stat

typical linear form. The second step is to apply classical variable selection procedures with high dimensional “artiﬁcial”

predictors, {X

j,l

= X

j

(t

l

):j, l}. The main diﬃculty is the expected high correlation among predictors obtained from

evaluating a single functional predictor, say X

j

at diﬀerent time points t

l

,say{X

j,1

,X

j,2

,...}. The challenge can be

bypassed in turn by using methods developed for group variable selection, including group LARS and group LASSO

(see, for example, Yuan & Lin, 2006), which target highly-correlated scalar predictors. These penalties, however, do

not impose smoothness of the coeﬃcients β

j,l

= β

j

(t

l

), viewed as functions of t. Without the smoothing property,

interpretation of the functional predictors’ inﬂuence on the response is meaningless.

In this article, we consider a penalized likelihood approach that combines selection of the functional predictors and

estimation of the smooth eﬀects for the chosen subset of predictors. Speciﬁcally, for a functional regression model

(2), the estimates of the coeﬃcient functions β

j

(·) are the minimizers of



n

i=1

{Y

i

− α −



p

j=1



X

ij

(t)β

j

(t) dt}

2

+



p

j=1

P

λ,ϕ

(β

j

), where P

λ,ϕ

is a penalty on the coeﬃcient function β

j

and λ and ϕ are global non-negative penalization

parameters that control both the sparseness and the smoothness of the solution, respectively. More generally, when the

response is generalized, the penalized criterion is similar to the above, with the diﬀerence that the ﬁrst term is replaced

by minus twice the corresponding (log-)likelihood function. We propose a penalty function that is inspired from the

penalization proposed in high-dimensional additive models (Meier et al., 2009). Thus the proposed approach combines

functional data analysis and variable selection techniques in high-dimensional additive models. Recently Fan & James

(2012) (unpublished manuscript) investigated variable selection in linear and non-linear functional regression for

continuous scalar response. They use a ‘basis approach’ modeling of the coeﬃcient functions β

j

(·), where the number

of basis functions gives the smoothness of β

j

(·), and is a tuning parameter, and a penalty term constructed on the

norm of the entire eﬀect,



X

j

(t)β

j

(t) dt. In contrast, in this article, we propose a ‘penalization approach’ by modeling

the coeﬃcient functions using a rich function basis, and use a penalty term that explicitly controls the smoothness of

β

j

(t) and sparseness of the model. Plus, our procedure is also developed for generalized functional linear models.

The remainder of the paper is organized as follows. Section 2 presents the methodology for simultaneous variable

selection and estimation of smooth eﬀects, when the predictors are functions observed without noise at a dense grid of

points. Section 3 presents the extension to functional predictors corrupted by noise, measured at dense or sparse grid

of points. The proposed methods are illustrated numerically in simulations experiments in Section 4 andontworeal

data examples in Section 5: the sugar spectra and tractography data sets. Section 6 concludes with a brief discussion.

2. Methodology

2.1. Parameters Modeling

Our approach to estimating the coeﬃcient functions β

j

is to use a pre-set basis functions expansion, such as a B-spline

basis or a truncated polynomial basis. Speciﬁcally, let b

j

(t)={b

j1

(t),...,b

jq

(t)} be such a ﬁnite basis, and consider

the approximation as β

j

(t)=



q

r=1

γ

jr

b

jr

(t), where γ

jr

are the corresponding basis coeﬃcients. The choice of the

basis functions is related to characteristics of the coeﬃcient functions β

j

, such as diﬀerentiability, while the number of

basis functions is related to the coeﬃcient’s smoothness. In particular a small number of basis functions lead to a very

smooth solutions, while a large number of basis functions results in very wiggly solutions. In this paper we allow the

number of basis functions to be large enough to capture the complexity of the function, and we control the smoothness

of the ﬁt using an additional parameter that penalizes the roughness of the ﬁt, and is chosen data-adaptively.

When the functional predictors, X

ij

(·), are observed without measurement error and at an equally spaced dense grid

of points, {t

j1

,...,t

jN

j

}, the integral in (1)or(2) can be computed/approximated by the Riemann sum



X

ij

(t)β

j

(t) dt ≈



r



Δ

j



l

X

ij

(t

jl

)b

jr

(t

jl

)



γ

jr

= Z



ij

γ

j

,

where γ

j

=(γ

j1

,...,γ

jq

)



, Z

ij

=(Z

ij1

,...,Z

ijq

)



, Z

ijr

=Δ

j



l

X

ij

(t

jl

)b

jr

(t

jl

), and Δ

j

= t

jl

− t

j,l−1

denotes the

..................................................................................................

Stat 2013, 00 1–?? 3 Cop yright

c

 2013 John Wiley & Sons, Ltd.

Prepared using staauth.cls

Stat J Gertheiss, A Maity and A-M Staicu

distance between two adjacent measurement points. Thus, model (2) can be approximated by a typical linear regression

model Y

i

= α +



p

j=1

Z



ij

γ

j

+ 

i

,whereZ

ij

are known quantities, and α and γ

j

’s are unknown regression coeﬃcients.

Nevertheless, using a classical penalty function on the model parameters γ

jr

, in particular a sparseness inducing penalty,

is not directly applicable to this setting because the parameters γ

jr

have a preset grouping structure. Group variable

selection, on the other hand, is an attractive method to impose sparseness. However, it implicitly assumes that the

smoothness of the coeﬃcient functions β

j

is speciﬁcally controlled by the number of basis functions. For example,

Matsui & Konishi (2011)andLian (2013) discussed a groupwise SCAD penalty, by employing Gaussian basis functions,

or the truncated eigenbasis of functional covariates, respectively, and controlling the smoothness of the corresponding

functional coeﬃcient by using only a small number of these basis functions; Zhu & Cox (2009) used a simple group lasso

penalty and a very small number of orthonormal basis functions. When using only a small number of basis functions,

however, the shape of the ﬁtted functions is strongly inﬂuenced be the concrete number, the type and the placing of

the basis functions. Our paper relaxes this limitation, and allows to approximate the coeﬃcient functions using a large

number of basis functions. This makes the approach more ﬂexible, but does not fully describe the functions’ inherent

smoothness. Therefore a diﬀerent type of penalty is required.

2.2. Penalized Estimation

We consider a so called sparsity-smoothness penalty technique, which results in simultaneous estimation of the

parameter functions and sparseness of the solution, when the covariates are curves. The proposed penalty function

was introduced by Meier et al. (2009) for variable selection in high-dimensional additive modeling. Speciﬁcally, let

P

λ,ϕ

(β

j

)=λ(β

j



2

+ ϕβ



j



2

)

1/2

, (3)

where β

j



2

=



D

j

{β

j

(t)}

2

dt is the L

2

norm of β

j

and β



j

(t)=∂

2

β

j

(t)/∂t

2

. Penalties such as (3) ensure that estimates

of the quantity appearing inside the square-root may be set to zero (see, e.g., Yuan & Lin (2006), Meier et al. (2008)).

As ϕ ≥ 0, this is equivalent to setting β

j

(t)=0forallt, which implies that predictor X

j

is excluded from model (1). If

ϕ>0, also the second order derivative of β

j

(t) is penalized, which ensures smoothing of non-zero coeﬃcient functions.

Although β

j

(t) is assumed as a smooth function, ﬁtted non-zero curves may become rough if the penalty β



j



2

on the

second order derivative is not included in (3). As rough coeﬃcient functions are hard to interpret, ϕ>0 is strongly

recommended. The actual extent of smoothing is controlled by the actual value of ϕ.

Using a rich pre-set basis function expansion for β

j

, the corresponding penalty can be represented as

P

λ,ϕ

(β

j

)=λ(γ



j

(Ψ

j

+ ϕΩ

j

)γ

j

)

1/2

,

where Ψ

j

is the q × q matrix with the (r,k) element equal to (Ψ

j

)

rk

=



b

jr

(t)b

jk

(t) dt, r,k =1,...,q,andΩ

j

is

the q × q matrix with the (r,k) element equal to (Ω

j

)

rk

=



b



jr

(t)b



jk

(t) dt, r, k =1,...,q. Furthermore, the penalty

can be re-written in a more convenient way as P

λ,ϕ

(β

j

)=λ(γ



j

K

ϕ,j

γ

j

)

1/2

, which is a general group lasso-type penalty

(Yuan & Lin, 2006) on vector γ

j

,whereK

ϕ,j

=Ψ

j

+ ϕΩ

j

is a symmetric and positive deﬁnite q × q matrix.

Consequently, when the functional linear model (2) is assumed, the estimates of the intercept α and the coeﬃcient

functions β

j

are obtained as



α and



β

j

(t)=



q

r=1

b

jr

(t)



γ

jr

,forj =1,...,p,where



α and ˆγ

j

’s are the minimizers of



p

i=1

(Y

i

− α − Z



ij

γ

j

)

2

+ λ



p

j=1

(γ



j

K

ϕ,j

γ

j

)

1/2

, (4)

where λ is a sparseness/tuning parameter and ϕ is a smoothing/tuning parameter. If K

ϕ,j

= I

q

for all j =1,...,p,

where I

q

denotes the q × q identity matrix, then the estimation (4) resembles the one for the ordinary group lasso

for which built in software already exist. More generally, by using appropriate re-parametrization one can still employ

existing software, as we show next; the ideas were also pointed out by Meier et al. (2009) in a related context.

..................................................................................................

Copyright

c

 2013 John Wiley & Sons, Ltd. 4 Stat 2013,001–??

Prepared using staauth.cls

Selection of Functional Covariates Stat

Let K

ϕ,j

= R

ϕ,j

R



ϕ,j

be the Cholesky decomposition where R

ϕ,j

is non-singular lower triangular matrix and deﬁne

˜γ

j

= R



ϕ,j

γ

j

and



Z

j

= R

−1

ϕ,j

Z

j

. Then, the penalized criterion (4) reduces to



p

i=1

(Y

i

− α −



Z



ij



γ

j

)

2

+ λ



p

j=1



γ

j

,where



γ

j

 is the Euclidean norm in R

q

. Thus, for a given value of the smoothness parameter ϕ, the minimizers of (4)can

be formulated as the parameter estimates in an appropriate linear model, via a penalized likelihood criterion using the

ordinary group lasso penalty (Meier et al., 2008; Meier, 2009). As a result, for ﬁxed ϕ and λ, the estimates

ˆ

˜γ can

be computed using any existing software that can accommodate a group lasso penalty, for example, the R package

grplasso. In practice, of course, the appropriate sparseness and smoothing parameters λ and ϕ are unknown and

need to be estimated from the data; for example by cross-validation. This is discussed in detail in Section 2.4.

When the response is generalized and a generalized functional linear model as in (1) is assumed, the estimation

procedure is similar to the one above with the exception that the ﬁrst quadratic term of (4) is replaced by the

(log-)likelihood function corresponding to Y

i

as speciﬁed by (1).

2.3. Adaptive Penalized Estimation

Next we discuss an alternative penalty function based on adaptive weights for the penalized estimation criterion. We

deﬁne the adaptive penalization scheme, similar to the adaptive lasso (Zou, 2006), by introducing weights w

j

and v

j

in the penalty function (3). Speciﬁcally, the adaptive penalization approach uses the penalty

P

λ,ϕ

(β

j

)=λ(w

j

β

j



2

+ ϕv

j

β



j



2

)

1/2

, (5)

where the weights w

j

and v

j

are chosen in a data-adaptive way (Meier et al., 2009). The choice of weights is meant

to reﬂect some subjectivity about the true parameter functions and to allow for diﬀerent shrinkage and smoothness

for the diﬀerent covariates. One possibility for choosing the weights is to use initial parameters estimates, based

on smoothing solely, but without using sparseness-assumptions. Consider a generalized functional linear model with

multiple functional covariates, and denote by

˘

β

j

’s, the estimated coeﬃcient functions β

j

’s, for example, by using the

approach described in Goldsmith et al. (2011a) and implemented in the R package refund (Crainiceanu et al., 2012).

Then, the adaptive weights can be deﬁned as w

j

=1/

˘

β

j

 and v

j

=1/

˘

β



j

. Adaptive estimation has been shown

to reduce the number of false positives considerably in penalty-based variable selection; see Meier et al. (2009)or

Gertheiss & Tutz (2012) to name a few. As illustrated by the simulation studies in Section 4, adaptive estimation with

the choice of weights above typically leads to improved selection performance in the functional model, too. Given the

computation of the initial estimates

˘

β

j

is not time-consuming, the computational burden for the adaptive penalty (5)

does not change very much compared with the standard (non-adaptive) penalty (3).

2.4. Choosing the tuning parameters

So far, the sparseness parameter λ and the smoothness parameter ϕ involved in the penalties (3)and(5) were assumed

known. In practice, however, these parameters need to be selected. We consider K-fold cross-validation to select λ and

ϕ. Speciﬁcally, the original sample is (randomly) divided into K roughly equal-sized subsamples. For each k =1,...,K,

the kth subsample is retained as a validation data for evaluating the prediction error of the model, and the remaining

K − 1 subsamples are used a training data. The K estimates of prediction error are averaged; the criterion selects the

values of the tuning parameters that minimize the overall prediction error. Although the optimal value of K is still an

open problem, typical values of K used in the literature are 5 and 10 (see also Hastie et al. (2009)).

As a measure of prediction accuracy, we use the predictive deviance D = −2φ



n

i=1

(l

i

(ˆμ

i

) − l

i

(Y

i

)), with l

i

(ˆμ

i

)denoting

the individual log-likelihood (for observation i )attheﬁttedmeanvalueˆμ

i

and l

i

(Y

i

) being the corresponding log-

likelihood where μ

i

is replaced by the observed value Y

i

(i.e., the maximum likelihood achievable). For the functional

linear model (2)withnormal

i

, the predictive deviance simpliﬁes to the the sum of squared errors



i

(Y

i

− ˆμ

i

)

2

.

..................................................................................................

Stat 2013, 00 1–?? 5 Cop yright

c

 2013 John Wiley & Sons, Ltd.

Prepared using staauth.cls

Variable Selection in Generalized Functional Linear Models.

Figures

Citations

Methods for scalar-on-function regression

A general framework for functional regression modelling

Generalized Scalar-on-Image Regression Models via Total Variation

Bayesian function‐on‐function regression for multilevel functional data

Variable selection in infinite-dimensional problems

References

Regression Shrinkage and Selection via the Lasso

Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties

Model selection and estimation in regression with grouped variables

The adaptive lasso and its oracle properties

MR diffusion tensor spectroscopy and imaging.

Related Papers (5)

Functional Data Analysis

Regression Shrinkage and Selection via the Lasso

Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties

Model selection and estimation in regression with grouped variables

Functional Data Analysis for Sparse Longitudinal Data