scispace - formally typeset
Open AccessJournal ArticleDOI

Prediction by supervised principal components

Reads0
Chats0
TLDR
Supervised Principal Component Analysis (SPCA) as mentioned in this paper is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome, and can be applied to regression and generalized regression problems, such as survival analysis.
Abstract
In regression problems where the number of predictors greatly exceeds the number of observations, conventional regression techniques may produce unsatisfactory results. We describe a technique called supervised principal components that can be applied to this type of problem. Supervised principal components is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome. Supervised principal components can be applied to regression and generalized regression problems, such as survival analysis. It compares favorably to other techniques for this type of problem, and can also account for the effects of other covariates and help identify which predictor variables are most important. We also provide asymptotic consistency results to help support our empirical findings. These methods could become important tools for DNA microarray data, where they may be used to more accurately diagnose and treat cancer.

read more

Content maybe subject to copyright    Report

Prediction by Supervised Principal Components
Eric BAIR,TrevorHASTIE, Debashis PAUL, and Robert TIBSHIRANI
In regression problems where the number of predictors greatly exceeds the number of observations, conventional regression techniques may
produce unsatisfactory results. We describe a technique called supervised principal components that can be applied to this type of problem.
Supervised principal components is similar to conventional principal components analysis except that it uses a subset of the predictors
selected based on their association with the outcome. Supervised principal components can be applied to regression and generalized regres-
sion problems, such as survival analysis. It compares favorably to other techniques for this type of problem, and can also account for the
effects of other covariates and help identify which predictor variables are most important. We also provide asymptotic consistency results
to help support our empirical findings. These methods could become important tools for DNA microarray data, where they may be used to
more accurately diagnose and treat cancer.
KEY WORDS: Gene expression; Microarray; Regression; Survival analysis.
1. INTRODUCTION
In this article we study a method for predicting an outcome
variable Y from a set of predictor variables X
1
, X
2
,...,X
p
, mea-
sured on each of N individuals. In the typical scenario that we
have in mind, the number of measurements p is much larger
than N. In the example that motivated our work, X
1
, X
2
,...,X
p
are gene expression measurements from DNA microarrays. The
outcome Y might be a quantitative variable that we might as-
sume to be normally distributed. More commonly in microarray
studies, Y is a survival time, subject to censoring.
One approach to this kind of problem would be a supervised
prediction method. For example, we could use a form of re-
gression applicable when p > N; partial least squares (Wold
1975) would be one reasonable choice, as would ridge regres-
sion (Hoerl and Kennard 1970). However, Figure 1 illustrates
why a semisupervised approach may be more effective.
We imagine that there are two cell types, and that patients
with the good cell (2) type live longer on the average. However,
there is considerable overlap in the two sets of survival times.
We might think of survival time as a “noisy surrogate” for cell
type. A fully supervised approach would give the most weight
to those genes having the strongest relationship with survival.
These genes are partially, but not perfectly, related to cell type.
If we could instead discover the underlying cell types of the
patients (often reflected by a sizeable signature of genes acting
together in pathways), then we would do a better job of predict-
ing patient survival.
Now we can extract information about important cell types
from both the relationship between Y and X
1
, X
2
,...,X
p
and
the correlation among the predictors themselves. Principal com-
ponents analysis (PCA) is a standard method for modeling
correlation. Regression on the first few principal components
would seem like a natural approach, but this might not always
Eric Bair is Post-Doctoral Fellow, Department of Statistics, Stanford Uni-
versity, Stanford, CA 94305, and Department of Neurology, UCSF (E-mail:
ebair@stat.stanford.edu). Trevor Hastie is Professor, Departments of Sta-
tistics and Health, Research & Policy (E-mail: hastie@stat.stanford.edu),
Debashis Paul is a Graduate Student, Department of Statistics (E-mail:
debashis@stat.stanford.edu), and Robert Tibshirani is Professor, Department
of Statistics and Health, Research & Policy (E-mail: tibs@stat.stanford.edu),
Stanford University, Stanford, CA 94305. The authors thank the editor, two
associate editors, and referees for suggestions that substantially improved this
article. Bair was supported in part by a National Science Foundation gradu-
ate research fellowship. Tibshirani was supported in part by National Science
Foundation grant DMS-99-71405 and National Institutes of Health contract
N01-HV-28183. Hastie was supported in part by National Science Foundation
grant DMS-02-04612 and National Institutes of Health grant 2R01 CA 72028-
07.
work well. The fictitious data given in Figure 2 illustrate the
problem (if we were to use only the largest principal compo-
nent). This is a heatmap display with each gene represented by
a row and each column containing data from one patient on
one microarray. Gene expression is coded from blue (low) to
yellow (high). In this example, the largest variation is seen in
the genes marked A, with the second set of 10 patients hav-
ing higher expression in these genes than the first 10. The set
of genes marked B show different variation, with the second
and fourth blocks of patients having higher expression in these
genes. The remainder of the genes show no systematic varia-
tion. At the bottom of the display, the red points are the first two
singular vectors u
1
and u
2
(principal components) of the matrix
of expression values. In microarray studies these are sometimes
called “eigengenes” (Alter, Brown, and Botstein 2000). (The
broken lines represent the “true” grouping mechanism that gen-
erated the data in the two groups.) Now if the genes in A are
strongly related to the outcome Y, then Y will be highly cor-
related with the first principal component. In this instance we
would expect a model that uses u
1
to predict Y to be very effec-
tive. However, the variation in genes A might reflect some bio-
logical process that is unrelated to the outcome Y. In that case,
Y might be more highly correlated with u
2
or some higher-order
principal component.
The supervised principal components technique that we
describe in this article is designed to uncover such structure
automatically. This technique was described in a biological set-
ting by Bair and Tibshirani (2004) in the context of a related
method known as “supervised clustering. The supervised prin-
cipal component idea is simple: Rather than performing princi-
pal component analysis using all of the genes in a dataset, we
use only those genes with the strongest estimated correlation
with Y. In the scenario of Figure 2, if Y were highly corre-
lated with the second principal component u
2
, then the genes in
block B would have the highest correlation with Y. Hence we
would compute the first principal component using just these
genes, and this would yield u
2
.
As this example shows, using principal components helps un-
cover groups of genes that express together. Biologically, one or
more cellular processes, accompanied by their cadre of express-
ing genes, determine the survival outcome. This same model
underlies other approaches to supervised learning in microarray
studies, including supervised gene shaving (Hastie et al. 2000)
© 2006 American Statistical Association
Journal of the American Statistical Association
March 2006, Vol. 101, No. 473, Theory and Methods
DOI 10.1198/016214505000000628
119

120 Journal of the American Statistical Association, March 2006
Figure 1. Underlying Conceptual Model. There are two cell types;
patients with the good cell type live longer on average. However, there
is considerable overlap in the two sets of survival times. Hence it could
be advantageous to try to uncover the cell types and use these to predict
survival time, rather than to predict survival time directly.
and tree harvesting (Hastie, Tibshirani, Botstein, and Brown
2001). The supervised principal components procedure can be
viewed as a simple way to identify the clusters of relevant pre-
dictors by selection based on scores to remove the irrelevant
sources of variation and application of principal components to
identify the groups of coexpressing genes.
As far as we know, Bair and Tibshirani (2004) were the first
to discuss the idea of supervised principal components in detail.
But other authors have presented related ideas. Ghosh (2002)
prescreened genes before extracting principal components, but
seemed to do so for computational reasons. Jiang et al. (2004)
used a similar idea in the context of merging the results from
two different datasets. Nguyen and Rocke (2002) and Hi and
Gui (2004) discussed partial least squares (PLS) approaches to
survival prediction from microarray data. As we discuss in this
article, this is a related but different method, and PLS did not
perform as well as supervised principal components in our tests.
PLS does not do an initial thresholding of features, and this is
the key aspect of our procedure that underlies its good perfor-
mance.
In the next section we define the supervised principal com-
ponents procedure. Section 3 gives a brief summary of our con-
sistency results, and Section 4 discusses an importance measure
for individual features and a reduced model. Section 5 gives an
example from a lymphoma study, Section 6 discusses alterna-
tive approaches to semisupervised prediction, including “gene
shaving, and Section 7 presents a simulation study comparing
the various methods. Section 8 summarizes the results of super-
vised principal components on some survival studies. Section 9
gives details of the theoretical results. The article concludes
with some generalizations, including covariate adjustment and
the use of unlabeled data in Section 10 and a discussion of lim-
itations and future work in Section 11. The Appendix contains
details of some proofs for Section 9.
2. SUPERVISED PRINCIPAL COMPONENTS
2.1 Description
We assume that there are p features measured on N observa-
tions (e.g., patients). Let X be an N ×p matrix of feature mea-
surements (e.g., genes), and let y be the N-vector of outcome
measurements. We assume that the outcome is a quantitative
variable; we discuss other types of outcomes, such as censored
survival times. Here in a nutshell is the supervised principal
component proposal:
1. Compute (univariate) standard regression coefficients for
each feature.
2. Form a reduced data matrix consisting of only those fea-
tures whose univariate coefficient exceeds a threshold θ
in absolute value (θ is estimated by cross-validation).
3. Compute the first (or first few) principal components of
the reduced data matrix.
4. Use these principal component(s) in a regression model to
predict the outcome.
We now give details of the method. Assume that the columns
of X (variables) have been centered to have mean 0. Write the
singular value decomposition (SVD) of X as
X =UDV
T
, (1)
where U, D, and V are N × m, m × m, and m × p, and m =
min(N 1, p) is the rank of X.HereD is a diagonal matrix
containing the singular values d
j
; and the columns of U are the
principal components u
1
, u
2
,...,u
m
;theseareassumedtobe
ordered, so that d
1
d
2
···d
m
0.
Let s be the p-vector of standardized regression coefficients
for measuring the univariate effect of each gene separately
on y,
s
j
=
x
T
j
y
x
j
, (2)
with x
j
=
x
T
j
x
j
. Actually, a scale estimate ˆσ is miss-
ing in each of the s
j
s, but because it is common to all,
we can omit it. Let C
θ
be the collection of indices such
that |s
j
| . We denote by X
θ
the matrix consisting of
the columns of X corresponding to C
θ
. The SVD of X
θ
is
X
θ
=U
θ
D
θ
V
T
θ
. (3)
Letting U
θ
= (u
θ,1
, u
θ,2
,...,u
θ,m
), we call u
θ,1
the first su-
pervised principal component of X, and so on. We now fit a
univariate linear regression model with response y and predic-
tor u
θ,1
,
ˆ
y
spc
=
¯
y γ ·u
θ,1
. (4)
Note that because u
θ,1
is a left singular vector of X
θ
,it
has mean 0 and unit norm. Hence ˆγ = u
T
θ,1
y, and the inter-
cept is
¯
y, the mean of y (expanded here as a vector of such
means).
We use cross-validation of the log-likelihood (or log partial-
likelihood) ratio statistic to estimate the best value of θ.Inmost
examples in this article we consider only the first supervised
principal component; in the examples of Section 8, we allow
the possibility of using more than one component.
Note that, from (3),
U
θ
= X
θ
V
θ
D
1
θ
= X
θ
W
θ
. (5)
So, for example, u
θ,1
is a linear combination of the columns
of X
θ
: u
θ,1
=X
θ
w
θ,1
. Hence our linear regression model esti-
mate can be viewed as a restricted linear model estimate using
all of the predictors in X
θ
,
ˆ
y
spc
=
¯
y γ ·X
θ
w
θ,1
(6)
=
¯
y +X
θ
ˆ
β
θ
, (7)

Bair et al.: Supervised Principal Components 121
Figure 2. Fictitious Microarray Data for Illustration. A heatmap display with each gene represented by a row, and each column giving the data
from one patient on one microarray. Gene expression is coded from blue (low) to yellow (high). The largest variation is seen in the genes marked A,
with the second set of 10 patients having higher expression in these genes. The set of genes marked B show different variation, with the second
and fourth blocks of patients having higher expression in these genes. At the bottom of the display are shown the first two singular vectors (principal
components) of the matrix of expression values (red points), and the actual grouping generators for the data (dashed lines). If the outcome is highly
correlated with either principal component, then the supervised principal component technique will discover this.
where
ˆ
β
θ
γ w
θ,1
. In fact, by padding w
θ,1
with 0’s (corre-
sponding to the genes excluded by C
θ
), our estimate is linear in
all p genes.
Given a test feature vector x
, we can make predictions from
our regression model as follows:
1. Center each component of x
using the means we derived
on the training data, x
j
x
j
¯
x
j
.
2.
ˆ
y
=
¯
y γ ·x
θ
T
w
θ,1
=
¯
y +x
θ
T
ˆ
β
θ
,
where x
θ
is the appropriate subvector of x
.
In the case of uncorrelated predictors, it is easy to verify that
the supervised principal components procedure has the desired
behavior. It yields all predictors whose standardized univariate
coefficients exceed θ in absolute value.
Our proposal is also applicable to generalized regression set-
tings, for example, survival data, classification problems, or
data typically analyzed by a generalized linear model. In these
cases we use a score statistic in place of the standardized re-
gression coefficients in (2) and use a proportional hazards or
appropriate generalized regression in (4). Let
j
) be the log-

122 Journal of the American Statistical Association, March 2006
likelihood or partial likelihood relating the data for a single pre-
dictor X
j
and the outcome y, and let U
j
0
) =d/dβ|
β=β
0
and
I
j
0
) =−d
2
j
/dβ
2
|
β=β
0
. Then the score statistic for predic-
tor j has the form
s
j
=
U
j
(0)
2
I
j
(0)
. (8)
Of course, for the Gaussian log-likelihood, this quantity is
equivalent to the standardized regression coefficient (2).
One could consider iterating the supervised principal com-
ponents procedure. Thus we would find features whose inner
product with the current supervised principal components was
largest, use those features to compute the new principal com-
ponents, and so on. But this procedure will tend to converge to
the usual (unsupervised) principal components, because there
is nothing to keep it close to the outcome after the first step. An
iterative procedure would make sense only if it were based on
a criterion involving both the variance of the features and the
goodness of fit to the outcome. We consider such a criterion in
the next section, although we ultimately do not pursue it (for
reasons given there).
2.2 An Underlying Model
We now consider a model to support the supervised princi-
pal components method. Suppose that we have a response vari-
able Y that is related to an underlying latent variable U by a
linear model,
Y = β
0
+β
1
U +ε. (9)
In addition, we have expression measurements on a set of
genes X
j
indexed by j P,forwhich
X
j
=α
0j
+α
1j
U +
j
, j P. (10)
The errors ε and
j
are assumed to have mean 0 and are inde-
pendent of all other random variables in their respective models.
We also have many additional genes X
k
, k /P, which are in-
dependent of U. We can think of U as a discrete or continuous
aspect of a cell type, which we do not measure directly. P repre-
sents a set of genes comprising a pathway or process associated
with this cell type, and the X
j
s are noisy measurements of their
gene expression. We would like to identify P, estimate U, and
hence fit the prediction model (9). This is a special case of a la-
tent structure model or single-component factor analysis model
(Mardia, Kent, and Bibby 1979).
The supervised principal components algorithm (SPCA) can
be seen as a method for fitting this model:
1. The screening step estimates the set P by
ˆ
P =C
θ
.
2. Given
ˆ
P, the SVD of X
θ
estimates U in (10) by the largest
principal component u
θ,1
.
3. Finally, the regression fit (4) estimates (9).
Step 1 is natural, because on average the regression coeffi-
cient S
j
= X
T
j
Y/X
j
is non-0 only if α
1j
is non-0 (assuming
that the genes have been centered). Hence this step should se-
lect the genes j P. Step 2 is natural if we assume that the er-
rors
j
have a Gaussian distribution, with the same variance. In
this case the SVD provides the maximum likelihood estimates
for the single-factor model (Mardia et al. 1979). The regression
in step 3 is an obvious final step.
In fact, given P, the model defined by (9) and (10) is a special
structured case of an errors-in-variables model (Miller 1986;
Huffel and Lemmerling 2002). One could set up a joint opti-
mization criterion,
min
β
0
1
,{α
0,j
1,j
},u
1
,...,u
N
N
i=1
( y
i
β
0
β
1
u
i
)
2
σ
2
Y
+
jP
N
i=1
(x
ij
α
0j
α
1j
u
i
)
2
σ
2
X
. (11)
Then it is easy to show that (11) can be solved by an augmented
and weighted SVD problem. In detail, we form the augmented
data matrix
X
a
=( y:X), (12)
and assign weight ω
1
= σ
2
X
2
Y
to the first column and weight
ω
j
=1 to the remaining columns. Then, with
v
0
=
β
0
α
0j
1
.
.
.
α
0j
q
, v
1
=
β
1
α
1j
1
.
.
.
α
1j
q
, (13)
(with q =|P|) the rank-1 weighted SVD X
a
1v
T
0
+ uv
T
1
solves the optimization problem in (11). Although this ap-
proach might seem more principled than our two-step proce-
dure, SPCA has a distinct advantage. Here
ˆ
u
θ,1
=X
θ
w
θ,1
, and
hence it can be defined for future x
data and used for pre-
dictions. In the errors-in-variables approach,
ˆ
u
EV
= X
A
w
EV
,
which involves y as well and leaves no obvious estimate for
future data. We return to this model in Section 6.
This latent-variable model can be easily extended to accom-
modate multiple components U
1
,...,U
m
. One way of doing
this is to assume that
Y =β
0
+
M
m=1
β
m
U
m
+ε (14)
and
X
j
=α
0j
+
M
m=1
α
1jm
U
m
+
j
, j P. (15)
Fitting this model proceeds as before, except now we extract M
rather than one principal component from X
θ
. We study this
model more deeply in Section 9.
2.3 An Example
The SPCA model anticipates other sources of variation in
the data, unrelated to the response. In fact these sources can be
even stronger than those driving the response, to the extent that
principal components would identify them first. By guiding the
principal components, SPCA extracts the desired components.
We simulated data from a scenario like that of Figure 2. We
used 1,000 genes and 40 samples, all with base error model
being Gaussian with unit variance. We then defined the mean

Bair et al.: Supervised Principal Components 123
vectors µ
1
and µ
2
as follows. We divide the samples into con-
secutive blocks of 10, denoted by the sets (a, b, c, d). Then
µ
1i
=
2ifi a b
+2 otherwise
(16)
and
µ
2i
=
1ifi a c
+1 otherwise.
(17)
The first 200 genes have mean structure µ
1
,
x
ij
=µ
1i
+
ij
, j =1,...,200, i =1,...,40. (18)
The next 50 genes have mean structure µ
2
,
x
ij
=µ
2i
+
ij
, j =201,...,250, i =1,...,40. (19)
In all cases
ij
N(0, 1), which is also how the remaining
750 genes are defined. Finally, the outcome is generated as
y
i
= α · µ
1
i
+(1 α) · µ
2i
+ε
i
, where ε
i
is N(0, 1). The first
two principal components of X are approximately µ
1
and µ
2
(see Fig. 2).
We tried various values of α ∈[0, 1], as shown in Figure 3.
Plotted is the correlation of the supervised principal compo-
nents predictor with an independent (test set) realization of y
as θ in the screening process |s
j
| is varied. The number of
genes surviving the screening is shown on the horizontal axis.
The extreme right end of each plot represents standard princi-
pal components regression. When α = 0, so that the outcome
is correlated with the second principal component, supervised
PC easily improves on principal components regression. When
α reaches .5, the advantage disappears, but supervised PC does
no worse than principal components regression.
3. CONSISTENCY OF SUPERVISED
PRINCIPAL COMPONENTS
In Section 9 we show that the standard principal components
regression is not consistent as the sample size and number of
features grow, whereas supervised principal components is con-
sistent under appropriate assumptions. Because the details are
lengthy, we give a summary first and defer the full discussion
until Section 9.
We consider a latent variable model of the form (9) and (10)
for data with N samples and p features. We denote the full N ×p
feature matrix by X, and the N × p
1
block of X by X
1
, corre-
sponding to the features j P. We assume that as N →∞,
p/N γ (0, ) and p
1
/N 0 “fast. Note that p and p
1
may be fixed or may approach . Given this setup, we prove
the following:
Let
˜
U be the leading principal component of X and let
˜
β be the regression coefficient of Y on
˜
U. Then
˜
U is not
generally consistent for U, and likewise
˜
β is not generally
(a) (b)
(c) (d)
Figure 3. Correlation Between the First Supervised Principal Component u
θ
,1
and a Test Outcome y, as the Weight
α
Given to the First Principal
Component in the Data Generation Is Varied. The number of genes used by the procedure is shown on the horizontal axis in each panel. The sharp
switch (a) and (b) corresponds to the point at which the order of the principal components is reversed.

Figures
Citations
More filters

Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling

TL;DR: Using singular value decomposition in transforming genome-wide expression data from genes x arrays space to reduced diagonalized "eigengenes" x "eigenarrays" space gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype.
Journal ArticleDOI

Sparse partial least squares regression for simultaneous dimension reduction and variable selection

TL;DR: This work provides an efficient implementation of sparse partial least squares regression and compares it with well‐known variable selection and dimension reduction approaches via simulation experiments and illustrates the practical utility in a joint analysis of gene expression and genomewide binding data.
Journal ArticleDOI

A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data.

TL;DR: Various ways of performing dimensionality reduction on high-dimensional microarray data are summarised to provide a clearer idea of when to use each one of them for saving computational time and resources.
Journal ArticleDOI

Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems

TL;DR: A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework and has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets.
References
More filters
Journal ArticleDOI

Regression Shrinkage and Selection via the Lasso

TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Book

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Journal ArticleDOI

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

TL;DR: A generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case and suggests a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.
Journal ArticleDOI

Significance analysis of microarrays applied to the ionizing radiation response

TL;DR: A method that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements is described, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.
Journal ArticleDOI

Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications

TL;DR: Survival analyses on a subcohort of patients with locally advanced breast cancer uniformly treated in a prospective study showed significantly different outcomes for the patients belonging to the various groups, including a poor prognosis for the basal-like subtype and a significant difference in outcome for the two estrogen receptor-positive groups.
Related Papers (5)
Frequently Asked Questions (2)
Q1. What have the authors contributed in "Prediction by supervised principal components" ?

The authors describe a technique called supervised principal components that can be applied to this type of problem. The authors also provide asymptotic consistency results to help support their empirical findings. 

Further work is needed in the Cox model setting, because their results there are not yet rigorous.