What are the contributions mentioned in the paper "A modified principal component technique based on the lasso" ?

In this paper, the authors proposed a new technique for principal component analysis ( PCA ), which is based on a bound on the sum of the absolute values of the loadings in a component, which is known as the least absolute shrinkage and selection operator.

What is the explanation of this anomaly?

The explanation of this anomaly is in the projected gradient method used for numerical solution of the problem, which approximates the LASSO constraint with a certain smooth function and thus the zero-loadings produced may be also approximate.

What is the simplest way to construct a correlation matrix?

Given a vector l of positive real numbers and an orthogonal matrix A, the authors can attempt to nd a covariance matrix or correlation matrix whose eigenvalues are the elements of l, and whose eigenvectorsare the column of A.

How is the solution of the modi ed maximization problem found?

The solution of this modi ed maximization problem is then found as an ascent gradient vector ow onto the p-dimensional unit sphere following the standard projected gradient formalism (Chu and Trenda lov 2001; Helmke and Moore 1994).

What are the other criteria used in the simulations?

Other rotation criteria, such as quartimax, can in theory nd uniform vectors of loadings, but they were tried and also found to be unsuccessful in their simulations.

How many parameters of the projected gradient method need to be dened?

To achieve an appropriate solution, a number of parameters of the projected gradient method (e.g., starting points, absolute and relative tolerances) also need to be de ned.

What is the equivalent way of deriving LASSO estimates?

An equivalent way of deriving LASSO estimates is to minimize the residual sum of squares with the addition of a penalty function based on pj = 1 j jj.

Why is the PCA algorithm slower than SCoTLASS?

This is because SCoTLASS is implemented subject to an extra restriction on PCA and the authors lose the advantage of calculation via the singular value decomposition which makes the PCA algorithm fast.

How many zeros are in the solution with t = 1:50?

One can see that the solution with t = 1:50 contains a total of 56 loadings with less than 0.005 magnitude, compared to 42 in the case t = 1:75.

Why is it preferred to rotated principal components?

It is preferred in many respects to rotated principal components, as a means of simplifying interpretation compared to principal component analysis.

What is the alternative approach to the simulation study?

An alternative approach to the simulation study would be to replace the near-zero loadings by exact zeros and the nearly equal loadingsby exact equalities.

Why is the problem in the regression tree?

These problems may occur due to the instability of the regression coefcients in the presence of collinearity, or simply because of the large number of variables included in the regression equation.

What is the difference between the PC and the SCoTLASS?

SCoTLASS looks for simple sources of variation and, likePCA, aims for highvariance,but becauseof simplicityconsiderationsthesimpli ed components can, in theory, be moderately different from the PCs.

(Open Access) A Modified Principal Component Technique Based on the LASSO (2003) | Ian T. Jolliffe

Q: What is the extra constraint in the technique?

In their technique the extra constraint is in the form of a bound on the sum of the absolute values of the loadings in that component.

Open Research Online

The Open University’s repository of research publications

and other research outputs

A modiﬁed principal component technique based on

the LASSO

Journal Item

How to cite:

Jolliﬀe, I.T.; Trendaﬁlov, N.T. and Uddin, M. (2003). A modiﬁed principal component technique based on

the LASSO. Journal of Computational and Graphical Statistics, 12(3) pp. 531–547.

For guidance on citations see FAQs.

 [not recorded]

Version: [not recorded]

Link(s) to article on publisher’s website:

http://dx.doi.org/doi:10.1198/1061860032148

owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies

page.

oro.open.ac.uk

A Modi ed Principal Component Technique

Based on the LASSO

Ian T. JOLLIFFE , Nickolay T. T RENDAFILOV , and Mudassir UDDIN

In many multivariate statistical techniques, a set of linear functions of the original p

variables is produced. One of the more dif cult aspects of these techniques is the inter-

pretation of the linear f unctions, as these functions usually h ave non zero coef cients on

all p variables. A common approach is to effectively ignore (treat as zero) any coe f cients

less than some threshold value, so that the function becomes simple and the interpretation

becomes easier for the users. Such a procedure can be misleading. There are alternatives to

principal component analysis which restrict the coef cients to a smaller number of possible

values in the derivationof the linear functions,or replacethe princ ipalcomponentsby “prin-

cipal variables.” This article introduces a new technique, borrowing an idea proposed by

Tibshirani in the context of multiple regressionwhere similar problems arise in interpret ing

regression equations. This approach is the so-called LASSO, the “leas t absolute shrinkage

and selection operator,” in which a bound is introduced on the sum of the absolute values

of the coef cients, and in which so me coef cients consequently be come zero. We explore

some of the propertiesof the new technique,both theoreticallyand using simulationstudies,

and apply it to an example.

Key Words: Interpretation;Principal component analysis; Simpli catio n.

1. INTRODUCTION

Principal component analysis (PCA), like several other mult ivariate statistical tech-

niques, replaces a set of p measu red variables by a small set of derived variables. The

derived variables, the principal component s, are linear combinations of the p variables. The

dimension reduction achieved by PCA is especially useful if the components can be readily

interpreted, and this is somet imes the case; see, for ex ample, Jolliffe (2002, chap 4). In

other examples, particularly where a compone nt has nontrivial l oadings on a substantial

Ian T. Jolliffe is Professor, Department of Mathematical Sciences, University of Aberdeen, Meston Building,

King’s College, Aberdeen AB24 3UE, Scotland, UK (E-mail: itj@math s.abdn.ac.uk). Nickolay T. Trenda lov is

Senior Lecturer, Faculty of Computing, Engineering and Mathematical Sciences, University of the West of Eng-

land, Bristol, BS16 1QY, UK (E-mail: Nickolay.Trenda lov@uwe.ac.uk). Mudassir Uddin is Associate Professor,

Department of Statistics, University of Karachi, Karachi-75270, Pakistan (E-mail: mudassir2000@hotmail .com).

® 2003 American Statistical Association, Institute of Mathematical Statistics,

and Interface Foundation of North America

Journal of Computational and Graphical Statistics, Volume 12, Number 3, Pages 531–547

DOI: 10.1198/106186003 2148

531

532 I. T. JOLLIFFE, N. T. TRENDAFILOV, AND M. UDDIN

proportion of the p variables, interpretation can be dif cult, detracting from the value of the

analysis.

A number of methods are available to aid interpretation. Rotation, which is common-

place in factor analysis, can be applied to PCA, but has its drawbacks (Jolliffe 1989, 1995).

A frequently used informal approach is to ignore all loadings small er than some thresh-

old absolute value, effectively treating them as zero. This can be misleading (Cadima and

Jolliffe 1995). A more formal way of making some of the loadings zero is to restrict the

allowable loadings to a small set of values; for example,

1; 0; 1 (Hausman 1982). Vines

(2000) described a variation on this theme. One further strategy is to select a subset of the

variablesthemselves, which satisfy similar optimalitycriterion to the principal co mponents,

as in McCabe’s (1984) “principal variables.”

This article introduces a new technique which shares an idea central to both Hausman’s

(1982) and Vine s’s (2000) work. Th is idea is that we choose linear combinations of the

measured variables which su ccessively maximizes variance, as in PCA, but we impose

extra constraints, which sacri ces some variance in order to improve interpretability. In

our techniqu e the extra constraint is in the form of a bound on the sum of the absolute

values of the loadings in that component. This type of bound has been used in regression

(Tibshirani 1996), where similar problems of in terpretation occur, and is known there as the

LASSO (least absolute shrinkage and selection operator). As with the methods of Hausman

(1982) and Vines (2000), and unlike ro tation, ou r technique usua lly produces some exactly

zero loadings in the components. In contrast to Hausman (1982) and Vines (2000) it does

not restrict the nonzero loadings to a discrete set of values. This article shows, throu gh

simulationsand an example, that the new techniqueis a valuableadditionaltoolfor exploring

the structure of multivariate data.

Section 2 establishes the notation and terminology of PCA, and introduces an exam-

ple in which interpretation of principal components is not straightforward. The most usual

approach to simplifying interpretation, the rotation of PCs, is shown to have drawbacks.

Section 3 introduces the new technique and describes some of its properties. Section 4 re-

visits the exampl e of Section 2, and demonstrates the practical usefulness of the technique.

A simulation study, which investigates the a bility of the technique to recover known un-

derlying structures in a dataset, is summariz ed in Section 5. The article ends with further

discussion in Section 6, including some modi cations, complications, and open questions.

2. A MOTIVATING EXAMPLE

Consider the classic example,  rst introduced by Jeffers (1967), in which a PCA was

done on the correlation matrix of 13 physical measurements, listed in Table 1, made on a

sample of 180 pitprops cut from Corsican pi ne timber.

Let x

be the vector of 13 variables for the ith pitprop, where each variable has been

standardized to have unit variance. What PCA does, when based on the correlation mat rix,

is to  nd linear functions a

x; a

x; : : : ; a

x which successively have maximum sample

variance, subject to a

= 0 for k

2, an d h < k. In addition, a normalization constraint

MODIFIED PRINCIPAL COMPONENT TECHNIQUE BASED ON THE LASSO 533

Table 1. De nitions of Variables in Jeffers’ Pitprop Data

Vari able De nition

Top diameter in inches

Length in inch es

Moisture content, % of dry weight

Specic gravity at time of test

Oven-dry specic gravity

Number of annual rings at top

Number of annual rings at bottom

Maximum bow in inches

Distance of point of maximum bow from top in inches

Number of knot whorls

Length of clear prop from top in inches

Average number of knots per whorl

Average diameter of the knots in inches

= 1 is necessary t o get a bo unded solution. The derived variable a

x is the kth

principal component (PC). It turns out that a

, the vector of coef cients or loadings for

the kth PC is the eigenvector of the sample corre lation matrix R corresponding to the kth

largest eigenvalue l

. In addition the sample variance of a

x is equal to l

. Because of

the successive maximization property, the  rst few PCs will often account for most of the

sample variation in all the standardized measured variables. In the pitprop example, Jeffers

(1967) was interested in the  rst six PCs, which together account for 87% of the total

variance. The loadings in each of these six components are given in Table 2, tog ether with

the individual and cu mulative percentage of variance in all 13 variables, accounted for by

1; 2; : : : ; 6 PCs.

PCs are easiest to interpret if the p attern of loadin gs is cl ear-cut, with a few large

Table 2. Loadings for Correlation PCA for Jeffers’ Pitprop Data

Component

Vari able (1) (2) (3) (4) (5) (6)

0.404 0.212 ¡0.219 ¡0.027 ¡0.141 ¡0.086

0.406 0.180 ¡0.245 ¡0.025 ¡0.188 ¡0.111

0.125 0.546 0.114 0.015 0.433 0.120

0.173 0.468 0.328 0.010 0.361 ¡0.090

0.057 ¡0.138 0.493 0.254 ¡0.122 ¡0.560

0.284 ¡0.002 0.476 ¡0.153 ¡0.269 0.032

0.400 ¡0.185 0.261 ¡0.125 ¡0.176 0.030

0.294 ¡0.198 ¡0.222 0.294 0.203 0.103

0.357 0.010 ¡0.202 0.132 ¡0.117 0.103

0.379 ¡0.252 ¡0.120 ¡0.201 0.173 ¡0.019

¡0.008 0.187 0.021 0.805 ¡0.302 0.178

¡0.115 0.348 0.066 ¡0.303 ¡0.537 0.371

¡0.112 0.304 ¡0.352 ¡0.098 ¡0.209 ¡0.671

Simplicity factor (varimax) 0.059 0.103 0.082 0.397 0.086 0.266

Variance (%) 32.4 18.2 14.4 8.9 7.0 6.3

Cumulative Variance (%) 32.4 50.7 65.0 74.0 80.9 87.2

534 I. T. JOLLIFFE, N. T. TRENDAFILOV, AND M. UDDIN

Table 3.

Loadings for Rotated Correlation PCA, Using the Varimax Cr iterion, for J effers’ Pitprop Data.

Component

Vari able (1) (2) (3) (4) (5) (6)

¡0.019 0.074 0.043 ¡0.027 ¡0.519 ¡0.077

¡0.018 0.015 0.048 ¡0.024 ¡0.540 ¡0.102

¡0.024 0.705 ¡0.128 0.003 ¡0.059 0.107

0.029 0.689 0.112 0.001 0.014 ¡0.087

0.258 0.009 0.477 0.218 0.205 ¡0.524

¡0.185 0.061 0.604 ¡0.005 ¡0.032 0.012

0.031 ¡0.069 0.512 ¡0.102 ¡0.151 0.092

0.440 ¡0.042 ¡0.072 0.083 ¡0.221 0.239

0.097 ¡0.058 0.045 0.094 ¡0.408 0.141

0.271 ¡0.054 0.129 ¡0.367 ¡0.216 0.135

0.057 ¡0.022 ¡0.029 0.882 ¡0.137 0.075

¡0.776 ¡0.056 0.091 0.079 ¡0.123 0.145

¡0.120 ¡0.049 ¡0.280 ¡0.077 ¡0.269 ¡0.748

Simplicity factor (varimax) 0.362 0.428 0.199 0.595 0.131 0.343

variance (% ) 13.0 14.6 18.4 9.7 23.9 7.6

cumulative variance (% ) 13.0 27.6 46.0 55.7 79.6 87.2

(absolute) values and many small loadings in each PC. Although Jeffers (1967) makes an

attempt to interpret all six components, some are, to say the least, messy and h e ignores

some i ntermediate loadings. For example, PC2 has the largest loadings on x

; x

, with small

loadings on x

; x

, but a wh ole range of intermediate values on other variables.

A traditional way to simplify loadings is by rotation. If A is the (13

6) matrix whose

kth column is a

, then A is post-multipliedby a matrix T to give rotated load ingsB = AT.

If b

is the kth column of B then b

x is the kth rotated compo nent. T he matrix T is chosen

so as to optimize some simplicitycriterion. Variou s criteria have been proposed, all o f which

attempt to create vectors of loadings whose elements are close to zero or far from zero, with

few intermediate values. The idea is that each variable should be either clearly important

or clearly unimp ortant in a rotated component, with as few cases as possible of borderline

importance. Varimax is the most widely used rotation criterion and, like most other such

criteria, it tends to drive at least some of the loadings in each compo nent towards zero. This

is not the only possible type of simplicity. A component whose lo adings are all roughly

equal is easy to interpret but will be avoided by most stan dard rotation criteria. It is dif cult

to envisage any criterion which could encompass all possible types of simplicity, and we

concentrate here on simplicity as de ned by varimax.

Table 3 gives the rotat ed loadings for six components in the correlation PCA of the

pitprop data, tog etherwith the percentage of total variance accounted for by each rotated PC

(RPC). T he rotation criterion used in Table 3 i s varimax (Krzanowski and Marriott 1995 , p.

138), which is the most frequent choice (often the default i n software), but other criteria give

similar results. Varimax rotation aims to maximize the sum, over rota ted components, of a

criterion which takes values between zero and one. A value of zero occu rs when all loadings

in the component are equal, whereas a component with only one nonzero loading produces

a value of unity. This criterion, or “simplicity factor,” is given for each component in Tables

2 and 3, and it can be seen that its values are larger for most of the rotated components than

A Modified Principal Component Technique Based on the LASSO

Figures

Citations

Principal component analysis: a review and recent developments

Applied Predictive Modeling

Sparse Principal Component Analysis

Regression shrinkage and selection via the lasso: a retrospective

Online Learning for Matrix Factorization and Sparse Coding

References

Regression Shrinkage and Selection via the Lasso

Principal Component Analysis

Global Optimization of Statistical Functions with Simulated Annealing

Penalized Regressions: The Bridge versus the Lasso

Better subset regression using the nonnegative garrote

Related Papers (5)

Sparse Principal Component Analysis

Regression Shrinkage and Selection via the Lasso

A Direct Formulation for Sparse PCA Using Semidefinite Programming

Sparse principal component analysis via regularized low rank matrix approximation

A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis

Frequently Asked Questions (14)

Q1. What are the contributions mentioned in the paper "A modified principal component technique based on the lasso" ?

Q2. What is the explanation of this anomaly?

Q3. What is the extra constraint in the technique?

Q4. What is the simplest way to construct a correlation matrix?

Q5. How is the solution of the modi ed maximization problem found?

Q6. What are the other criteria used in the simulations?

Q7. How many parameters of the projected gradient method need to be dened?

Q8. What is the equivalent way of deriving LASSO estimates?

Q9. Why is the PCA algorithm slower than SCoTLASS?

Q10. How many zeros are in the solution with t = 1:50?

Q11. Why is it preferred to rotated principal components?

Q12. What is the alternative approach to the simulation study?

Q13. Why is the problem in the regression tree?

Q14. What is the difference between the PC and the SCoTLASS?