scispace - formally typeset
Open AccessJournal ArticleDOI

A Modified Principal Component Technique Based on the LASSO

TLDR
The least absolute shrinkage and selection approach (LASSO) as mentioned in this paper is a technique for interpreting multiple regression equations, which is based on principal component analysis (PCA) in the context of multiple regression.
Abstract
In many multivariate statistical techniques, a set of linear functions of the original p variables is produced. One of the more difficult aspects of these techniques is the interpretation of the linear functions, as these functions usually have nonzero coefficients on all p variables. A common approach is to effectively ignore (treat as zero) any coefficients less than some threshold value, so that the function becomes simple and the interpretation becomes easier for the users. Such a procedure can be misleading. There are alternatives to principal component analysis which restrict the coefficients to a smaller number of possible values in the derivation of the linear functions, or replace the principal components by “principal variables.” This article introduces a new technique, borrowing an idea proposed by Tibshirani in the context of multiple regression where similar problems arise in interpreting regression equations. This approach is the so-called LASSO, the “least absolute shrinkage and selection o...

read more

Content maybe subject to copyright    Report

Open Research Online
The Open University’s repository of research publications
and other research outputs
A modified principal component technique based on
the LASSO
Journal Item
How to cite:
Jolliffe, I.T.; Trendafilov, N.T. and Uddin, M. (2003). A modified principal component technique based on
the LASSO. Journal of Computational and Graphical Statistics, 12(3) pp. 531–547.
For guidance on citations see FAQs.
c
[not recorded]
Version: [not recorded]
Link(s) to article on publisher’s website:
http://dx.doi.org/doi:10.1198/1061860032148
Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright
owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies
page.
oro.open.ac.uk

A Modi ed Principal Component Technique
Based on the LASSO
Ian T. JOLLIFFE , Nickolay T. T RENDAFILOV , and Mudassir UDDIN
In many multivariate statistical techniques, a set of linear functions of the original p
variables is produced. One of the more dif cult aspects of these techniques is the inter-
pretation of the linear f unctions, as these functions usually h ave non zero coef cients on
all p variables. A common approach is to effectively ignore (treat as zero) any coe f cients
less than some threshold value, so that the function becomes simple and the interpretation
becomes easier for the users. Such a procedure can be misleading. There are alternatives to
principal component analysis which restrict the coef cients to a smaller number of possible
values in the derivationof the linear functions,or replacethe princ ipalcomponentsby “prin-
cipal variables. This article introduces a new technique, borrowing an idea proposed by
Tibshirani in the context of multiple regressionwhere similar problems arise in interpret ing
regression equations. This approach is the so-called LASSO, the “leas t absolute shrinkage
and selection operator, in which a bound is introduced on the sum of the absolute values
of the coef cients, and in which so me coef cients consequently be come zero. We explore
some of the propertiesof the new technique,both theoreticallyand using simulationstudies,
and apply it to an example.
Key Words: Interpretation;Principal component analysis; Simpli catio n.
1. INTRODUCTION
Principal component analysis (PCA), like several other mult ivariate statistical tech-
niques, replaces a set of p measu red variables by a small set of derived variables. The
derived variables, the principal component s, are linear combinations of the p variables. The
dimension reduction achieved by PCA is especially useful if the components can be readily
interpreted, and this is somet imes the case; see, for ex ample, Jolliffe (2002, chap 4). In
other examples, particularly where a compone nt has nontrivial l oadings on a substantial
Ian T. Jolliffe is Professor, Department of Mathematical Sciences, University of Aberdeen, Meston Building,
King’s College, Aberdeen AB24 3UE, Scotland, UK (E-mail: itj@math s.abdn.ac.uk). Nickolay T. Trenda lov is
Senior Lecturer, Faculty of Computing, Engineering and Mathematical Sciences, University of the West of Eng-
land, Bristol, BS16 1QY, UK (E-mail: Nickolay.Trenda lov@uwe.ac.uk). Mudassir Uddin is Associate Professor,
Department of Statistics, University of Karachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail .com).
c
® 2003 American Statistical Association, Institute of Mathematical Statistics,
and Interface Foundation of North America
Journal of Computational and Graphical Statistics, Volume 12, Number 3, Pages 531547
DOI: 10.1198/106186003 2148
531

532 I. T. JOLLIFFE, N. T. TRENDAFILOV, AND M. UDDIN
proportion of the p variables, interpretation can be dif cult, detracting from the value of the
analysis.
A number of methods are available to aid interpretation. Rotation, which is common-
place in factor analysis, can be applied to PCA, but has its drawbacks (Jolliffe 1989, 1995).
A frequently used informal approach is to ignore all loadings small er than some thresh-
old absolute value, effectively treating them as zero. This can be misleading (Cadima and
Jolliffe 1995). A more formal way of making some of the loadings zero is to restrict the
allowable loadings to a small set of values; for example,
¡
1; 0; 1 (Hausman 1982). Vines
(2000) described a variation on this theme. One further strategy is to select a subset of the
variablesthemselves, which satisfy similar optimalitycriterion to the principal co mponents,
as in McCabe’s (1984) principal variables.
This article introduces a new technique which shares an idea central to both Hausman’s
(1982) and Vine s’s (2000) work. Th is idea is that we choose linear combinations of the
measured variables which su ccessively maximizes variance, as in PCA, but we impose
extra constraints, which sacri ces some variance in order to improve interpretability. In
our techniqu e the extra constraint is in the form of a bound on the sum of the absolute
values of the loadings in that component. This type of bound has been used in regression
(Tibshirani 1996), where similar problems of in terpretation occur, and is known there as the
LASSO (least absolute shrinkage and selection operator). As with the methods of Hausman
(1982) and Vines (2000), and unlike ro tation, ou r technique usua lly produces some exactly
zero loadings in the components. In contrast to Hausman (1982) and Vines (2000) it does
not restrict the nonzero loadings to a discrete set of values. This article shows, throu gh
simulationsand an example, that the new techniqueis a valuableadditionaltoolfor exploring
the structure of multivariate data.
Section 2 establishes the notation and terminology of PCA, and introduces an exam-
ple in which interpretation of principal components is not straightforward. The most usual
approach to simplifying interpretation, the rotation of PCs, is shown to have drawbacks.
Section 3 introduces the new technique and describes some of its properties. Section 4 re-
visits the exampl e of Section 2, and demonstrates the practical usefulness of the technique.
A simulation study, which investigates the a bility of the technique to recover known un-
derlying structures in a dataset, is summariz ed in Section 5. The article ends with further
discussion in Section 6, including some modi cations, complications, and open questions.
2. A MOTIVATING EXAMPLE
Consider the classic example, rst introduced by Jeffers (1967), in which a PCA was
done on the correlation matrix of 13 physical measurements, listed in Table 1, made on a
sample of 180 pitprops cut from Corsican pi ne timber.
Let x
i
be the vector of 13 variables for the ith pitprop, where each variable has been
standardized to have unit variance. What PCA does, when based on the correlation mat rix,
is to nd linear functions a
0
1
x; a
0
2
x; : : : ; a
0
p
x which successively have maximum sample
variance, subject to a
0
h
a
k
= 0 for k
2, an d h < k. In addition, a normalization constraint

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 533
Table 1. De nitions of Variables in Jeffers’ Pitprop Data
Vari able De nition
x
1
Top diameter in inches
x
2
Length in inch es
x
3
Moisture content, % of dry weight
x
4
Specic gravity at time of test
x
5
Oven-dry specic gravity
x
6
Number of annual rings at top
x
7
Number of annual rings at bottom
x
8
Maximum bow in inches
x
9
Distance of point of maximum bow from top in inches
x
10
Number of knot whorls
x
11
Length of clear prop from top in inches
x
12
Average number of knots per whorl
x
13
Average diameter of the knots in inches
a
0
k
a
k
= 1 is necessary t o get a bo unded solution. The derived variable a
0
k
x is the kth
principal component (PC). It turns out that a
k
, the vector of coef cients or loadings for
the kth PC is the eigenvector of the sample corre lation matrix R corresponding to the kth
largest eigenvalue l
k
. In addition the sample variance of a
0
k
x is equal to l
k
. Because of
the successive maximization property, the rst few PCs will often account for most of the
sample variation in all the standardized measured variables. In the pitprop example, Jeffers
(1967) was interested in the rst six PCs, which together account for 87% of the total
variance. The loadings in each of these six components are given in Table 2, tog ether with
the individual and cu mulative percentage of variance in all 13 variables, accounted for by
1; 2; : : : ; 6 PCs.
PCs are easiest to interpret if the p attern of loadin gs is cl ear-cut, with a few large
Table 2. Loadings for Correlation PCA for Jeffers Pitprop Data
Component
Vari able (1) (2) (3) (4) (5) (6)
x
1
0.404 0.212 ¡0.219 ¡0.027 ¡0.141 ¡0.086
x
2
0.406 0.180 ¡0.245 ¡0.025 ¡0.188 ¡0.111
x
3
0.125 0.546 0.114 0.015 0.433 0.120
x
4
0.173 0.468 0.328 0.010 0.361 ¡0.090
x
5
0.057 ¡0.138 0.493 0.254 ¡0.122 ¡0.560
x
6
0.284 ¡0.002 0.476 ¡0.153 ¡0.269 0.032
x
7
0.400 ¡0.185 0.261 ¡0.125 ¡0.176 0.030
x
8
0.294 ¡0.198 ¡0.222 0.294 0.203 0.103
x
9
0.357 0.010 ¡0.202 0.132 ¡0.117 0.103
x
10
0.379 ¡0.252 ¡0.120 ¡0.201 0.173 ¡0.019
x
11
¡0.008 0.187 0.021 0.805 ¡0.302 0.178
x
12
¡0.115 0.348 0.066 ¡0.303 ¡0.537 0.371
x
13
¡0.112 0.304 ¡0.352 ¡0.098 ¡0.209 ¡0.671
Simplicity factor (varimax) 0.059 0.103 0.082 0.397 0.086 0.266
Variance (%) 32.4 18.2 14.4 8.9 7.0 6.3
Cumulative Variance (%) 32.4 50.7 65.0 74.0 80.9 87.2

534 I. T. JOLLIFFE, N. T. TRENDAFILOV, AND M. UDDIN
Table 3.
Loadings for Rotated Correlation PCA, Using the Varimax Cr iterion, for J effers Pitprop Data.
Component
Vari able (1) (2) (3) (4) (5) (6)
x
1
¡0.019 0.074 0.043 ¡0.027 ¡0.519 ¡0.077
x
2
¡0.018 0.015 0.048 ¡0.024 ¡0.540 ¡0.102
x
3
¡0.024 0.705 ¡0.128 0.003 ¡0.059 0.107
x
4
0.029 0.689 0.112 0.001 0.014 ¡0.087
x
5
0.258 0.009 0.477 0.218 0.205 ¡0.524
x
6
¡0.185 0.061 0.604 ¡0.005 ¡0.032 0.012
x
7
0.031 ¡0.069 0.512 ¡0.102 ¡0.151 0.092
x
8
0.440 ¡0.042 ¡0.072 0.083 ¡0.221 0.239
x
9
0.097 ¡0.058 0.045 0.094 ¡0.408 0.141
x
10
0.271 ¡0.054 0.129 ¡0.367 ¡0.216 0.135
x
11
0.057 ¡0.022 ¡0.029 0.882 ¡0.137 0.075
x
12
¡0.776 ¡0.056 0.091 0.079 ¡0.123 0.145
x
13
¡0.120 ¡0.049 ¡0.280 ¡0.077 ¡0.269 ¡0.748
Simplicity factor (varimax) 0.362 0.428 0.199 0.595 0.131 0.343
variance (% ) 13.0 14.6 18.4 9.7 23.9 7.6
cumulative variance (% ) 13.0 27.6 46.0 55.7 79.6 87.2
(absolute) values and many small loadings in each PC. Although Jeffers (1967) makes an
attempt to interpret all six components, some are, to say the least, messy and h e ignores
some i ntermediate loadings. For example, PC2 has the largest loadings on x
3
; x
4
, with small
loadings on x
6
; x
9
, but a wh ole range of intermediate values on other variables.
A traditional way to simplify loadings is by rotation. If A is the (13
£
6) matrix whose
kth column is a
k
, then A is post-multipliedby a matrix T to give rotated load ingsB = AT.
If b
k
is the kth column of B then b
0
k
x is the kth rotated compo nent. T he matrix T is chosen
so as to optimize some simplicitycriterion. Variou s criteria have been proposed, all o f which
attempt to create vectors of loadings whose elements are close to zero or far from zero, with
few intermediate values. The idea is that each variable should be either clearly important
or clearly unimp ortant in a rotated component, with as few cases as possible of borderline
importance. Varimax is the most widely used rotation criterion and, like most other such
criteria, it tends to drive at least some of the loadings in each compo nent towards zero. This
is not the only possible type of simplicity. A component whose lo adings are all roughly
equal is easy to interpret but will be avoided by most stan dard rotation criteria. It is dif cult
to envisage any criterion which could encompass all possible types of simplicity, and we
concentrate here on simplicity as de ned by varimax.
Table 3 gives the rotat ed loadings for six components in the correlation PCA of the
pitprop data, tog etherwith the percentage of total variance accounted for by each rotated PC
(RPC). T he rotation criterion used in Table 3 i s varimax (Krzanowski and Marriott 1995 , p.
138), which is the most frequent choice (often the default i n software), but other criteria give
similar results. Varimax rotation aims to maximize the sum, over rota ted components, of a
criterion which takes values between zero and one. A value of zero occu rs when all loadings
in the component are equal, whereas a component with only one nonzero loading produces
a value of unity. This criterion, or “simplicity factor, is given for each component in Tables
2 and 3, and it can be seen that its values are larger for most of the rotated components than

Citations
More filters
Journal ArticleDOI

Principal component analysis: a review and recent developments

TL;DR: The basic ideas of PCA are introduced, discussing what it can and cannot do, and some variants of the technique have been developed that are tailored to various different data types and structures.
Book

Applied Predictive Modeling

Max Kuhn, +1 more
TL;DR: This research presents a novel and scalable approach called “Smartfitting” that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of designing and implementing statistical models for regression models.
Journal ArticleDOI

Sparse Principal Component Analysis

TL;DR: This work introduces a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings and shows that PCA can be formulated as a regression-type optimization problem.
Journal ArticleDOI

Regression shrinkage and selection via the lasso: a retrospective

TL;DR: In this article, the authors give a brief review of the basic idea and some history and then discuss some developments since the original paper on regression shrinkage and selection via the lasso.
Journal ArticleDOI

Online Learning for Matrix Factorization and Sparse Coding

TL;DR: In this paper, a new online optimization algorithm based on stochastic approximations is proposed to solve the large-scale matrix factorization problem, which scales up gracefully to large data sets with millions of training samples.
References
More filters
Journal ArticleDOI

Regression Shrinkage and Selection via the Lasso

TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Book

Principal Component Analysis

TL;DR: In this article, the authors present a graphical representation of data using Principal Component Analysis (PCA) for time series and other non-independent data, as well as a generalization and adaptation of principal component analysis.
Journal ArticleDOI

Global Optimization of Statistical Functions with Simulated Annealing

TL;DR: This implementation of simulated annealing was used in "Global Optimization of Statistical Functions with Simulated Annealing," Goffe, Ferrier and Rogers, Journal of Econometrics, vol.
Journal ArticleDOI

Penalized Regressions: The Bridge versus the Lasso

TL;DR: It is shown that the bridge regression performs well compared to the lasso and ridge regression, and is demonstrated through an analysis of a prostate cancer data.
Journal ArticleDOI

Better subset regression using the nonnegative garrote

TL;DR: In this paper, a new method called nonnegative garrote (NN) was proposed for doing subset regression, which both shrinks and zerosizes coefficients and produces lower prediction error than ordinary subset selection.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the contributions mentioned in the paper "A modified principal component technique based on the lasso" ?

In this paper, the authors proposed a new technique for principal component analysis ( PCA ), which is based on a bound on the sum of the absolute values of the loadings in a component, which is known as the least absolute shrinkage and selection operator. 

The explanation of this anomaly is in the projected gradient method used for numerical solution of the problem, which approximates the LASSO constraint with a certain smooth function and thus the zero-loadings produced may be also approximate. 

In their technique the extra constraint is in the form of a bound on the sum of the absolute values of the loadings in that component. 

Given a vector l of positive real numbers and an orthogonal matrix A, the authors can attempt to nd a covariance matrix or correlation matrix whose eigenvalues are the elements of l, and whose eigenvectorsare the column of A. 

The solution of this modi ed maximization problem is then found as an ascent gradient vector ow onto the p-dimensional unit sphere following the standard projected gradient formalism (Chu and Trenda lov 2001; Helmke and Moore 1994). 

Other rotation criteria, such as quartimax, can in theory nd uniform vectors of loadings, but they were tried and also found to be unsuccessful in their simulations. 

To achieve an appropriate solution, a number of parameters of the projected gradient method (e.g., starting points, absolute and relative tolerances) also need to be de ned. 

An equivalent way of deriving LASSO estimates is to minimize the residual sum of squares with the addition of a penalty function based on pj = 1 j jj. 

This is because SCoTLASS is implemented subject to an extra restriction on PCA and the authors lose the advantage of calculation via the singular value decomposition which makes the PCA algorithm fast. 

One can see that the solution with t = 1:50 contains a total of 56 loadings with less than 0.005 magnitude, compared to 42 in the case t = 1:75. 

It is preferred in many respects to rotated principal components, as a means of simplifying interpretation compared to principal component analysis. 

An alternative approach to the simulation study would be to replace the near-zero loadings by exact zeros and the nearly equal loadingsby exact equalities. 

These problems may occur due to the instability of the regression coefcients in the presence of collinearity, or simply because of the large number of variables included in the regression equation. 

SCoTLASS looks for simple sources of variation and, likePCA, aims for highvariance,but becauseof simplicityconsiderationsthesimpli ed components can, in theory, be moderately different from the PCs.