What have the authors contributed in "Prediction by supervised principal components" ?

The authors describe a technique called supervised principal components that can be applied to this type of problem. The authors also provide asymptotic consistency results to help support their empirical findings.

What are the future works in "Prediction by supervised principal components" ?

Further work is needed in the Cox model setting, because their results there are not yet rigorous.

(Open Access) Prediction by supervised principal components (2006) | Eric Bair

Prediction by Supervised Principal Components

Eric BAIR,TrevorHASTIE, Debashis PAUL, and Robert TIBSHIRANI

In regression problems where the number of predictors greatly exceeds the number of observations, conventional regression techniques may

produce unsatisfactory results. We describe a technique called supervised principal components that can be applied to this type of problem.

Supervised principal components is similar to conventional principal components analysis except that it uses a subset of the predictors

selected based on their association with the outcome. Supervised principal components can be applied to regression and generalized regres-

sion problems, such as survival analysis. It compares favorably to other techniques for this type of problem, and can also account for the

effects of other covariates and help identify which predictor variables are most important. We also provide asymptotic consistency results

to help support our empirical ﬁndings. These methods could become important tools for DNA microarray data, where they may be used to

more accurately diagnose and treat cancer.

KEY WORDS: Gene expression; Microarray; Regression; Survival analysis.

1. INTRODUCTION

In this article we study a method for predicting an outcome

variable Y from a set of predictor variables X

, X

,...,X

, mea-

sured on each of N individuals. In the typical scenario that we

have in mind, the number of measurements p is much larger

than N. In the example that motivated our work, X

, X

,...,X

are gene expression measurements from DNA microarrays. The

outcome Y might be a quantitative variable that we might as-

sume to be normally distributed. More commonly in microarray

studies, Y is a survival time, subject to censoring.

One approach to this kind of problem would be a supervised

prediction method. For example, we could use a form of re-

gression applicable when p > N; partial least squares (Wold

1975) would be one reasonable choice, as would ridge regres-

sion (Hoerl and Kennard 1970). However, Figure 1 illustrates

why a semisupervised approach may be more effective.

We imagine that there are two cell types, and that patients

with the good cell (2) type live longer on the average. However,

there is considerable overlap in the two sets of survival times.

We might think of survival time as a “noisy surrogate” for cell

type. A fully supervised approach would give the most weight

to those genes having the strongest relationship with survival.

These genes are partially, but not perfectly, related to cell type.

If we could instead discover the underlying cell types of the

patients (often reﬂected by a sizeable signature of genes acting

together in pathways), then we would do a better job of predict-

ing patient survival.

Now we can extract information about important cell types

from both the relationship between Y and X

, X

,...,X

and

the correlation among the predictors themselves. Principal com-

ponents analysis (PCA) is a standard method for modeling

correlation. Regression on the ﬁrst few principal components

would seem like a natural approach, but this might not always

Eric Bair is Post-Doctoral Fellow, Department of Statistics, Stanford Uni-

versity, Stanford, CA 94305, and Department of Neurology, UCSF (E-mail:

ebair@stat.stanford.edu). Trevor Hastie is Professor, Departments of Sta-

tistics and Health, Research & Policy (E-mail: hastie@stat.stanford.edu),

Debashis Paul is a Graduate Student, Department of Statistics (E-mail:

debashis@stat.stanford.edu), and Robert Tibshirani is Professor, Department

of Statistics and Health, Research & Policy (E-mail: tibs@stat.stanford.edu),

Stanford University, Stanford, CA 94305. The authors thank the editor, two

associate editors, and referees for suggestions that substantially improved this

article. Bair was supported in part by a National Science Foundation gradu-

ate research fellowship. Tibshirani was supported in part by National Science

Foundation grant DMS-99-71405 and National Institutes of Health contract

N01-HV-28183. Hastie was supported in part by National Science Foundation

grant DMS-02-04612 and National Institutes of Health grant 2R01 CA 72028-

07.

work well. The ﬁctitious data given in Figure 2 illustrate the

problem (if we were to use only the largest principal compo-

nent). This is a heatmap display with each gene represented by

a row and each column containing data from one patient on

one microarray. Gene expression is coded from blue (low) to

yellow (high). In this example, the largest variation is seen in

the genes marked A, with the second set of 10 patients hav-

ing higher expression in these genes than the ﬁrst 10. The set

of genes marked B show different variation, with the second

and fourth blocks of patients having higher expression in these

genes. The remainder of the genes show no systematic varia-

tion. At the bottom of the display, the red points are the ﬁrst two

singular vectors u

and u

(principal components) of the matrix

of expression values. In microarray studies these are sometimes

called “eigengenes” (Alter, Brown, and Botstein 2000). (The

broken lines represent the “true” grouping mechanism that gen-

erated the data in the two groups.) Now if the genes in A are

strongly related to the outcome Y, then Y will be highly cor-

related with the ﬁrst principal component. In this instance we

would expect a model that uses u

to predict Y to be very effec-

tive. However, the variation in genes A might reﬂect some bio-

logical process that is unrelated to the outcome Y. In that case,

Y might be more highly correlated with u

or some higher-order

principal component.

The supervised principal components technique that we

describe in this article is designed to uncover such structure

automatically. This technique was described in a biological set-

ting by Bair and Tibshirani (2004) in the context of a related

method known as “supervised clustering.” The supervised prin-

cipal component idea is simple: Rather than performing princi-

pal component analysis using all of the genes in a dataset, we

use only those genes with the strongest estimated correlation

with Y. In the scenario of Figure 2, if Y were highly corre-

lated with the second principal component u

, then the genes in

block B would have the highest correlation with Y. Hence we

would compute the ﬁrst principal component using just these

genes, and this would yield u

As this example shows, using principal components helps un-

cover groups of genes that express together. Biologically, one or

more cellular processes, accompanied by their cadre of express-

ing genes, determine the survival outcome. This same model

underlies other approaches to supervised learning in microarray

studies, including supervised gene shaving (Hastie et al. 2000)

Journal of the American Statistical Association

March 2006, Vol. 101, No. 473, Theory and Methods

DOI 10.1198/016214505000000628

119

120 Journal of the American Statistical Association, March 2006

Figure 1. Underlying Conceptual Model. There are two cell types;

patients with the good cell type live longer on average. However, there

is considerable overlap in the two sets of survival times. Hence it could

be advantageous to try to uncover the cell types and use these to predict

survival time, rather than to predict survival time directly.

and tree harvesting (Hastie, Tibshirani, Botstein, and Brown

2001). The supervised principal components procedure can be

viewed as a simple way to identify the clusters of relevant pre-

dictors by selection based on scores to remove the irrelevant

sources of variation and application of principal components to

identify the groups of coexpressing genes.

As far as we know, Bair and Tibshirani (2004) were the ﬁrst

to discuss the idea of supervised principal components in detail.

But other authors have presented related ideas. Ghosh (2002)

prescreened genes before extracting principal components, but

seemed to do so for computational reasons. Jiang et al. (2004)

used a similar idea in the context of merging the results from

two different datasets. Nguyen and Rocke (2002) and Hi and

Gui (2004) discussed partial least squares (PLS) approaches to

survival prediction from microarray data. As we discuss in this

article, this is a related but different method, and PLS did not

perform as well as supervised principal components in our tests.

PLS does not do an initial thresholding of features, and this is

the key aspect of our procedure that underlies its good perfor-

mance.

In the next section we deﬁne the supervised principal com-

ponents procedure. Section 3 gives a brief summary of our con-

sistency results, and Section 4 discusses an importance measure

for individual features and a reduced model. Section 5 gives an

example from a lymphoma study, Section 6 discusses alterna-

tive approaches to semisupervised prediction, including “gene

shaving,” and Section 7 presents a simulation study comparing

the various methods. Section 8 summarizes the results of super-

vised principal components on some survival studies. Section 9

gives details of the theoretical results. The article concludes

with some generalizations, including covariate adjustment and

the use of unlabeled data in Section 10 and a discussion of lim-

itations and future work in Section 11. The Appendix contains

details of some proofs for Section 9.

2. SUPERVISED PRINCIPAL COMPONENTS

2.1 Description

We assume that there are p features measured on N observa-

tions (e.g., patients). Let X be an N ×p matrix of feature mea-

surements (e.g., genes), and let y be the N-vector of outcome

measurements. We assume that the outcome is a quantitative

variable; we discuss other types of outcomes, such as censored

survival times. Here in a nutshell is the supervised principal

component proposal:

1. Compute (univariate) standard regression coefﬁcients for

each feature.

2. Form a reduced data matrix consisting of only those fea-

tures whose univariate coefﬁcient exceeds a threshold θ

in absolute value (θ is estimated by cross-validation).

3. Compute the ﬁrst (or ﬁrst few) principal components of

the reduced data matrix.

4. Use these principal component(s) in a regression model to

predict the outcome.

We now give details of the method. Assume that the columns

of X (variables) have been centered to have mean 0. Write the

singular value decomposition (SVD) of X as

X =UDV

, (1)

where U, D, and V are N × m, m × m, and m × p, and m =

min(N − 1, p) is the rank of X.HereD is a diagonal matrix

containing the singular values d

; and the columns of U are the

principal components u

, u

,...,u

;theseareassumedtobe

ordered, so that d

≥d

≥···≥d

≥0.

Let s be the p-vector of standardized regression coefﬁcients

for measuring the univariate effect of each gene separately

on y,

x



, (2)

with x

=

√

. Actually, a scale estimate ˆσ is miss-

ing in each of the s

’s, but because it is common to all,

we can omit it. Let C

be the collection of indices such

that |s

| >θ. We denote by X

the matrix consisting of

the columns of X corresponding to C

. The SVD of X

. (3)

Letting U

= (u

θ,1

, u

θ,2

,...,u

θ,m

), we call u

θ,1

the ﬁrst su-

pervised principal component of X, and so on. We now ﬁt a

univariate linear regression model with response y and predic-

tor u

θ,1

spc,θ

y +ˆγ ·u

θ,1

. (4)

Note that because u

θ,1

is a left singular vector of X

,it

has mean 0 and unit norm. Hence ˆγ = u

θ,1

y, and the inter-

cept is

y, the mean of y (expanded here as a vector of such

means).

We use cross-validation of the log-likelihood (or log partial-

likelihood) ratio statistic to estimate the best value of θ.Inmost

examples in this article we consider only the ﬁrst supervised

principal component; in the examples of Section 8, we allow

the possibility of using more than one component.

Note that, from (3),

= X

−1

= X

. (5)

So, for example, u

θ,1

is a linear combination of the columns

of X

: u

θ,1

. Hence our linear regression model esti-

mate can be viewed as a restricted linear model estimate using

all of the predictors in X

spc,θ

y +ˆγ ·X

θ,1

(6)

y +X

, (7)

Bair et al.: Supervised Principal Components 121

Figure 2. Fictitious Microarray Data for Illustration. A heatmap display with each gene represented by a row, and each column giving the data

from one patient on one microarray. Gene expression is coded from blue (low) to yellow (high). The largest variation is seen in the genes marked A,

with the second set of 10 patients having higher expression in these genes. The set of genes marked B show different variation, with the second

and fourth blocks of patients having higher expression in these genes. At the bottom of the display are shown the ﬁrst two singular vectors (principal

components) of the matrix of expression values (red points), and the actual grouping generators for the data (dashed lines). If the outcome is highly

correlated with either principal component, then the supervised principal component technique will discover this.

where

=ˆγ w

θ,1

. In fact, by padding w

θ,1

with 0’s (corre-

sponding to the genes excluded by C

), our estimate is linear in

all p genes.

Given a test feature vector x

∗

, we can make predictions from

our regression model as follows:

1. Center each component of x

∗

using the means we derived

on the training data, x

∗

←x

∗

−

∗

y +ˆγ ·x

∗

θ,1

y +x

∗

where x

∗

is the appropriate subvector of x

∗

In the case of uncorrelated predictors, it is easy to verify that

the supervised principal components procedure has the desired

behavior. It yields all predictors whose standardized univariate

coefﬁcients exceed θ in absolute value.

Our proposal is also applicable to generalized regression set-

tings, for example, survival data, classiﬁcation problems, or

data typically analyzed by a generalized linear model. In these

cases we use a score statistic in place of the standardized re-

gression coefﬁcients in (2) and use a proportional hazards or

appropriate generalized regression in (4). Let 

(β) be the log-

122 Journal of the American Statistical Association, March 2006

likelihood or partial likelihood relating the data for a single pre-

dictor X

and the outcome y, and let U

(β

) =d/dβ|

β=β

and

(β

) =−d



/dβ

β=β

. Then the score statistic for predic-

tor j has the form

(0)

. (8)

Of course, for the Gaussian log-likelihood, this quantity is

equivalent to the standardized regression coefﬁcient (2).

One could consider iterating the supervised principal com-

ponents procedure. Thus we would ﬁnd features whose inner

product with the current supervised principal components was

largest, use those features to compute the new principal com-

ponents, and so on. But this procedure will tend to converge to

the usual (unsupervised) principal components, because there

is nothing to keep it close to the outcome after the ﬁrst step. An

iterative procedure would make sense only if it were based on

a criterion involving both the variance of the features and the

goodness of ﬁt to the outcome. We consider such a criterion in

the next section, although we ultimately do not pursue it (for

reasons given there).

2.2 An Underlying Model

We now consider a model to support the supervised princi-

pal components method. Suppose that we have a response vari-

able Y that is related to an underlying latent variable U by a

linear model,

Y = β

+β

U +ε. (9)

In addition, we have expression measurements on a set of

genes X

indexed by j ∈P,forwhich

=α

+α

U +

, j ∈P. (10)

The errors ε and 

are assumed to have mean 0 and are inde-

pendent of all other random variables in their respective models.

We also have many additional genes X

, k /∈P, which are in-

dependent of U. We can think of U as a discrete or continuous

aspect of a cell type, which we do not measure directly. P repre-

sents a set of genes comprising a pathway or process associated

with this cell type, and the X

’s are noisy measurements of their

gene expression. We would like to identify P, estimate U, and

hence ﬁt the prediction model (9). This is a special case of a la-

tent structure model or single-component factor analysis model

(Mardia, Kent, and Bibby 1979).

The supervised principal components algorithm (SPCA) can

be seen as a method for ﬁtting this model:

1. The screening step estimates the set P by

P =C

2. Given

P, the SVD of X

estimates U in (10) by the largest

principal component u

θ,1

3. Finally, the regression ﬁt (4) estimates (9).

Step 1 is natural, because on average the regression coefﬁ-

cient S

= X

Y/X

 is non-0 only if α

is non-0 (assuming

that the genes have been centered). Hence this step should se-

lect the genes j ∈P. Step 2 is natural if we assume that the er-

rors 

have a Gaussian distribution, with the same variance. In

this case the SVD provides the maximum likelihood estimates

for the single-factor model (Mardia et al. 1979). The regression

in step 3 is an obvious ﬁnal step.

In fact, given P, the model deﬁned by (9) and (10) is a special

structured case of an errors-in-variables model (Miller 1986;

Huffel and Lemmerling 2002). One could set up a joint opti-

mization criterion,

min

,β

,{α

0,j

,α

1,j

},u

,...,u



i=1

( y

−β

)



j∈P



i=1

−α

)

. (11)

Then it is easy to show that (11) can be solved by an augmented

and weighted SVD problem. In detail, we form the augmented

data matrix

=( y:X), (12)

and assign weight ω

= σ

/σ

to the ﬁrst column and weight

=1 to the remaining columns. Then, with













, v













, (13)

(with q =|P|) the rank-1 weighted SVD X

≈ 1v

+ uv

solves the optimization problem in (11). Although this ap-

proach might seem more principled than our two-step proce-

dure, SPCA has a distinct advantage. Here

θ,1

, and

hence it can be deﬁned for future x

∗

data and used for pre-

dictions. In the errors-in-variables approach,

= X

which involves y as well and leaves no obvious estimate for

future data. We return to this model in Section 6.

This latent-variable model can be easily extended to accom-

modate multiple components U

,...,U

. One way of doing

this is to assume that

Y =β



m=1

+ε (14)

and

=α



m=1

1jm

+

, j ∈P. (15)

Fitting this model proceeds as before, except now we extract M

rather than one principal component from X

. We study this

model more deeply in Section 9.

2.3 An Example

The SPCA model anticipates other sources of variation in

the data, unrelated to the response. In fact these sources can be

even stronger than those driving the response, to the extent that

principal components would identify them ﬁrst. By guiding the

principal components, SPCA extracts the desired components.

We simulated data from a scenario like that of Figure 2. We

used 1,000 genes and 40 samples, all with base error model

being Gaussian with unit variance. We then deﬁned the mean

Bair et al.: Supervised Principal Components 123

vectors µ

and µ

as follows. We divide the samples into con-

secutive blocks of 10, denoted by the sets (a, b, c, d). Then



−2ifi ∈a ∪b

+2 otherwise

(16)

and



−1ifi ∈a ∪c

+1 otherwise.

(17)

The ﬁrst 200 genes have mean structure µ

=µ

+

, j =1,...,200, i =1,...,40. (18)

The next 50 genes have mean structure µ

=µ

+

, j =201,...,250, i =1,...,40. (19)

In all cases 

∼ N(0, 1), which is also how the remaining

750 genes are deﬁned. Finally, the outcome is generated as

= α · µ

+(1 −α) · µ

+ε

, where ε

is N(0, 1). The ﬁrst

two principal components of X are approximately µ

and µ

(see Fig. 2).

We tried various values of α ∈[0, 1], as shown in Figure 3.

Plotted is the correlation of the supervised principal compo-

nents predictor with an independent (test set) realization of y

as θ in the screening process |s

|>θ is varied. The number of

genes surviving the screening is shown on the horizontal axis.

The extreme right end of each plot represents standard princi-

pal components regression. When α = 0, so that the outcome

is correlated with the second principal component, supervised

PC easily improves on principal components regression. When

α reaches .5, the advantage disappears, but supervised PC does

no worse than principal components regression.

3. CONSISTENCY OF SUPERVISED

PRINCIPAL COMPONENTS

In Section 9 we show that the standard principal components

regression is not consistent as the sample size and number of

features grow, whereas supervised principal components is con-

sistent under appropriate assumptions. Because the details are

lengthy, we give a summary ﬁrst and defer the full discussion

until Section 9.

We consider a latent variable model of the form (9) and (10)

for data with N samples and p features. We denote the full N ×p

feature matrix by X, and the N × p

block of X by X

, corre-

sponding to the features j ∈ P. We assume that as N →∞,

p/N → γ ∈ (0, ∞) and p

/N → 0 “fast.” Note that p and p

may be ﬁxed or may approach ∞. Given this setup, we prove

the following:

• Let

U be the leading principal component of X and let

β be the regression coefﬁcient of Y on

U. Then

U is not

generally consistent for U, and likewise

β is not generally

(a) (b)

Figure 3. Correlation Between the First Supervised Principal Component u

and a Test Outcome y, as the Weight

Given to the First Principal

Component in the Data Generation Is Varied. The number of genes used by the procedure is shown on the horizontal axis in each panel. The sharp

switch (a) and (b) corresponds to the point at which the order of the principal components is reversed.

Prediction by supervised principal components

Figures

Citations

Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling

Deletion of IKZF1 and prognosis in acute lymphoblastic leukemia.

Sparse partial least squares regression for simultaneous dimension reduction and variable selection

A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data.

Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems

References

Regression Shrinkage and Selection via the Lasso

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Significance analysis of microarrays applied to the ionizing radiation response

Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications

Related Papers (5)

Regression Shrinkage and Selection via the Lasso

Regularization and variable selection via the elastic net

The Elements of Statistical Learning

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Regularization Paths for Generalized Linear Models via Coordinate Descent

Frequently Asked Questions (2)

Q1. What have the authors contributed in "Prediction by supervised principal components" ?

Q2. What are the future works in "Prediction by supervised principal components" ?