scispace - formally typeset
Search or ask a question

Showing papers by "Trevor Hastie published in 2006"


Journal ArticleDOI
TL;DR: This work introduces a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings and shows that PCA can be formulated as a regression-type optimization problem.
Abstract: Principal component analysis (PCA) is widely used in data processing and dimensionality reduction. However, PCA suffers from the fact that each principal component is a linear combination of all the original variables, thus it is often difficult to interpret the results. We introduce a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings. We first show that PCA can be formulated as a regression-type optimization problem; sparse loadings are then obtained by imposing the lasso (elastic net) constraint on the regression coefficients. Efficient algorithms are proposed to fit our SPCA models for both regular multivariate data and gene expression arrays. We also give a new formula to compute the total variance of modified principal components. As illustrations, SPCA is applied to real and simulated data with encouraging results.

3,102 citations


Journal ArticleDOI
TL;DR: Supervised Principal Component Analysis (SPCA) as mentioned in this paper is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome, and can be applied to regression and generalized regression problems, such as survival analysis.
Abstract: In regression problems where the number of predictors greatly exceeds the number of observations, conventional regression techniques may produce unsatisfactory results. We describe a technique called supervised principal components that can be applied to this type of problem. Supervised principal components is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome. Supervised principal components can be applied to regression and generalized regression problems, such as survival analysis. It compares favorably to other techniques for this type of problem, and can also account for the effects of other covariates and help identify which predictor variables are most important. We also provide asymptotic consistency results to help support our empirical findings. These methods could become important tools for DNA microarray data, where they may be used to more accurately diagnose and treat cancer.

773 citations


Proceedings ArticleDOI
20 Aug 2006
TL;DR: This paper proposes sparse random projections, an approximate algorithm for estimating distances between pairs of points in a high-dimensional vector space that multiplies A by a random matrix R in RD x k, reducing the D dimensions down to just k for speeding up the computation.
Abstract: There has been considerable interest in random projections, an approximate algorithm for estimating distances between pairs of points in a high-dimensional vector space. Let A in Rn x D be our n points in D dimensions. The method multiplies A by a random matrix R in RD x k, reducing the D dimensions down to just k for speeding up the computation. R typically consists of entries of standard normal N(0,1). It is well known that random projections preserve pairwise distances (in the expectation). Achlioptas proposed sparse random projections by replacing the N(0,1) entries in R with entries in -1,0,1 with probabilities 1/6, 2/3, 1/6, achieving a threefold speedup in processing time.We recommend using R of entries in -1,0,1 with probabilities 1/2√D, 1-1√D, 1/2√D for achieving a significant √D-fold speedup, with little loss in accuracy.

668 citations


Journal ArticleDOI
TL;DR: A gene-expression signature of the hypoxia response, derived from studies of cultured mammary and renal tubular epithelial cells, showed coordinated variation in several human cancers, and was a strong predictor of clinical outcomes in breast and ovarian cancers.
Abstract: Background Inadequate oxygen (hypoxia) triggers a multifaceted cellular response that has important roles in normal physiology and in many human diseases. A transcription factor, hypoxia-inducible factor (HIF), plays a central role in the hypoxia response; its activity is regulated by the oxygen-dependent degradation of the HIF-1α protein. Despite the ubiquity and importance of hypoxia responses, little is known about the variation in the global transcriptional response to hypoxia among different cell types or how this variation might relate to tissue- and cell-specific diseases.

618 citations


Journal ArticleDOI
TL;DR: In this paper, the authors analysed relationships between demersal fish species richness, environment and trawl characteristics using an extensive collection of trawl data from the oceans around New Zealand.
Abstract: We analysed relationships between demersal fish species richness, environment and trawl characteristics using an extensive collection of trawl data from the oceans around New Zealand. Analyses were carried out using both generalised additive models and boosted regression trees (sometimes referred to as 'stochastic gradient boosting'). Depth was the single most important envi- ronmental predictor of variation in species richness, with highest richness occurring at depths of 900 to 1000 m, and with a broad plateau of moderately high richness between 400 and 1100 m. Richness was higher both in waters with high surface concentrations of chlorophyll a and in zones of mixing of water bodies of contrasting origins. Local variation in temperature was also important, with lower richness occurring in waters that were cooler than expected given their depth. Variables describing trawl length, trawl speed, and cod-end mesh size made a substantial contribution to analysis out- comes, even though functions fitted for trawl distance and cod-end mesh size were constrained to reflect the known performance of trawl gear. Species richness declined with increasing cod-end mesh size and increasing trawl speed, but increased with increasing trawl distance, reaching a plateau once trawl distances exceed about 3 nautical miles. Boosted regression trees provided a powerful analysis tool, giving substantially superior predictive performance to generalized additive models, despite the fitting of interaction terms in the latter.

531 citations


Journal ArticleDOI
TL;DR: Six issues are discussed in a methodological framework for generalized regression: links with ecological theory, optimal use of existing data and artificially generated data, incorporating spatial context, integrating ecological and environmental interactions, and assessing prediction errors and uncertainties.
Abstract: Summary 1. Biogeographical models of species’ distributions are essential tools for assessing impacts of changing environmental conditions on natural communities and ecosystems. Practitioners need more reliable predictions to integrate into conservation planning (e.g. reserve design and management). 2. Most models still largely ignore or inappropriately take into account important features of species’ distributions, such as spatial autocorrelation, dispersal and migration, biotic and environmental interactions. Whether distributions of natural communities or ecosystems are better modelled by assembling individual species’ predictions in a bottom-up approach or modelled as collective entities is another important issue. An international workshop was organized to address these issues. 3. We discuss more specifically six issues in a methodological framework for generalized regression: (i) links with ecological theory; (ii) optimal use of existing data and artificially generated data; (iii) incorporating spatial context; (iv) integrating ecological and environmental interactions; (v) assessing prediction errors and uncertainties; and (vi) predicting distributions of communities or collective properties of biodiversity. 4. Synthesis and applications. Better predictions of the effects of impacts on biological communities and ecosystems can emerge only from more robust species’ distribution models and better documentation of the uncertainty associated with these models. An improved understanding of causes of species’ distributions, especially at their range limits, as well as of ecological assembly rules and ecosystem functioning, is necessary if further progress is to be made. A better collaborative effort between theoretical and functional ecologists, ecological modellers and statisticians is required to reach these goals.

506 citations


Journal ArticleDOI
TL;DR: In this paper, two statistical modelling techniques, generalized additive models (GAM) and multivariate adaptive regression splines (MARS), were used to analyse relationships between the distributions of 15 freshwater fish species and their environment.

392 citations


Book ChapterDOI
22 Jun 2006
TL;DR: An improved version of random projections that takes advantage of marginal norms, and using a maximum likelihood estimator (MLE), margin-constrained random projections can improve estimation accuracy considerably.
Abstract: We present an improved version of random projections that takes advantage of marginal norms. Using a maximum likelihood estimator (MLE), margin-constrained random projections can improve estimation accuracy considerably. Theoretical properties of this estimator are analyzed in detail.

71 citations


Proceedings Article
04 Dec 2006
TL;DR: This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections, showing using real-world data that CRS often outperforms random projections.
Abstract: We develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In large-scale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using real-world data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations.

64 citations


01 Jan 2006
TL;DR: This study considers several regularization path algorithms with grouped variable selection for modeling gene-interactions and proposes a path-following algorithm for the group-Lasso method applied to generalized linear models.
Abstract: In this study, we consider several regularization path algorithms with grouped variable selection for modeling gene-interactions. When tting with categorical factors, including the genotype measurements, we often dene a set of dummy variables that represent a single factor/interaction of factors. Yuan & Lin (2006) proposed the groupLars and the group-Lasso methods through which these groups of indicators can be selected simultaneously. Here we introduce another version of group-Lars. In addition, we propose a path-following algorithm for the group-Lasso method applied to generalized linear models. We then use all these path algorithms, which select the grouped variables in a smooth way, to identify gene-interactions aecting disease status in an example. We further compare their performances to that of L2 penalized logistic regression with forward stepwise variable selection discussed in Park & Hastie (2006b).

29 citations



01 Jan 2006
TL;DR: This work proposes a sketch-based sampling algorithm, which effectively exploits the data sparsity and combines the advantages of both conventional random sampling and more modern randomized algorithms such as local sensitive hashing (LSH).
Abstract: We propose a sketch-based sampling algorithm, which effectively exploits the data sparsity Sampling methods have become popular in large-scale data mining and information retrieval, where high data sparsity is a norm A distinct feature of our algorithm is that it combines the advantages of both conventional random sampling and more modern randomized algorithms such as local sensitive hashing (LSH) While most sketch-based algorithms are designed for specific summary statistics, our proposed algorithm is a general purpose technique, useful for estimating any summary statistics including two-way and multi-way distances and joint histograms

Journal Article
TL;DR: Hastie et al. as mentioned in this paper considered reproducing kernel Hubert space Mk (RKHS) and showed that the positive definite kernel K(-,-) has a (possibly finite) eigenexpansion.
Abstract: Trevor Hastie is Professor, Department of Statistics, Stanford University, Stanford, California 94305, USA (e-mail: hastie @ stanford.edu). Ji Zhu is Assistant Professor, Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, USA (e-mail: jizhu @ umich. edu). reproducing kernel Hubert space Mk (RKHS) gener ated by K{-, ) (see Burges, 1998; Evgeniou, Pontil and Poggio, 2000; and Wahba, 1999, for details). Suppose the positive definite kernel K(-,-) has a (possibly finite) eigenexpansion,

Posted Content
TL;DR: The moments of the MLE are analyzed and it is established that k = O(log n / e2) suffices with the constants explicitly given, and both the sample median and the geometric mean estimators are about 80% efficient compared to the Mle.
Abstract: For dimension reduction in $l_1$, the method of {\em Cauchy random projections} multiplies the original data matrix $\mathbf{A} \in\mathbb{R}^{n\times D}$ with a random matrix $\mathbf{R} \in \mathbb{R}^{D\times k}$ ($k\ll\min(n,D)$) whose entries are i.i.d. samples of the standard Cauchy C(0,1). Because of the impossibility results, one can not hope to recover the pairwise $l_1$ distances in $\mathbf{A}$ from $\mathbf{B} = \mathbf{AR} \in \mathbb{R}^{n\times k}$, using linear estimators without incurring large errors. However, nonlinear estimators are still useful for certain applications in data stream computation, information retrieval, learning, and data mining. We propose three types of nonlinear estimators: the bias-corrected sample median estimator, the bias-corrected geometric mean estimator, and the bias-corrected maximum likelihood estimator. The sample median estimator and the geometric mean estimator are asymptotically (as $k\to \infty$) equivalent but the latter is more accurate at small $k$. We derive explicit tail bounds for the geometric mean estimator and establish an analog of the Johnson-Lindenstrauss (JL) lemma for dimension reduction in $l_1$, which is weaker than the classical JL lemma for dimension reduction in $l_2$. Asymptotically, both the sample median estimator and the geometric mean estimators are about 80% efficient compared to the maximum likelihood estimator (MLE). We analyze the moments of the MLE and propose approximating the distribution of the MLE by an inverse Gaussian.

Journal ArticleDOI
TL;DR: A prognostic signature based upon mRNA expression of 15 genes has been identified using FFPE sections with RT-PCR for early stage, N-, ER+ patients and can help clinicians and patients to choose among different therapeutic options.
Abstract: 506 Background: Gene expression profiles have been shown to predict distant metastasis risk in breast cancer patients. For routine medical practice, molecular prognostic tests need to be able to qu...