Showing papers by "Trevor Hastie published in 2006"

PDF

Open Access

Journal Article•DOI•

[...]

Hui Zou¹, Trevor Hastie¹, Robert Tibshirani¹•Institutions (1)

01 Jun 2006-Journal of Computational and Graphical Statistics

TL;DR: This work introduces a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings and shows that PCA can be formulated as a regression-type optimization problem.

...read moreread less

Abstract: Principal component analysis (PCA) is widely used in data processing and dimensionality reduction. However, PCA suffers from the fact that each principal component is a linear combination of all the original variables, thus it is often difficult to interpret the results. We introduce a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings. We first show that PCA can be formulated as a regression-type optimization problem; sparse loadings are then obtained by imposing the lasso (elastic net) constraint on the regression coefficients. Efficient algorithms are proposed to fit our SPCA models for both regular multivariate data and gene expression arrays. We also give a new formula to compute the total variance of modified principal components. As illustrations, SPCA is applied to real and simulated data with encouraging results.

...read moreread less

3,102 citations

Journal Article•DOI•

Prediction by supervised principal components

[...]

Eric Bair, Trevor Hastie, Debashis Paul¹, Robert Tibshirani¹•Institutions (1)

Stanford University¹

01 Mar 2006-Journal of the American Statistical Association

TL;DR: Supervised Principal Component Analysis (SPCA) as mentioned in this paper is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome, and can be applied to regression and generalized regression problems, such as survival analysis.

...read moreread less

Abstract: In regression problems where the number of predictors greatly exceeds the number of observations, conventional regression techniques may produce unsatisfactory results. We describe a technique called supervised principal components that can be applied to this type of problem. Supervised principal components is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome. Supervised principal components can be applied to regression and generalized regression problems, such as survival analysis. It compares favorably to other techniques for this type of problem, and can also account for the effects of other covariates and help identify which predictor variables are most important. We also provide asymptotic consistency results to help support our empirical findings. These methods could become important tools for DNA microarray data, where they may be used to more accurately diagnose and treat cancer.

...read moreread less

773 citations

Proceedings Article•DOI•

Very sparse random projections

[...]

Ping Li¹, Trevor Hastie¹, Kenneth Church²•Institutions (2)

Stanford University¹, Microsoft²

20 Aug 2006

TL;DR: This paper proposes sparse random projections, an approximate algorithm for estimating distances between pairs of points in a high-dimensional vector space that multiplies A by a random matrix R in RD x k, reducing the D dimensions down to just k for speeding up the computation.

...read moreread less

Abstract: There has been considerable interest in random projections, an approximate algorithm for estimating distances between pairs of points in a high-dimensional vector space. Let A in Rn x D be our n points in D dimensions. The method multiplies A by a random matrix R in RD x k, reducing the D dimensions down to just k for speeding up the computation. R typically consists of entries of standard normal N(0,1). It is well known that random projections preserve pairwise distances (in the expectation). Achlioptas proposed sparse random projections by replacing the N(0,1) entries in R with entries in -1,0,1 with probabilities 1/6, 2/3, 1/6, achieving a threefold speedup in processing time.We recommend using R of entries in -1,0,1 with probabilities 1/2√D, 1-1√D, 1/2√D for achieving a significant √D-fold speedup, with little loss in accuracy.

...read moreread less

668 citations

Journal Article•DOI•

Gene Expression Programs in Response to Hypoxia: Cell Type Specificity and Prognostic Significance in Human Cancers

[...]

Jen-Tsan Chi¹, Jen-Tsan Chi², Zhen Wang², Dimitry S.A. Nuyten³, Edwin H. Rodriguez², Marci E. Schaner², Ali Salim², Yun Wang, Gunnar B. Kristensen, Åslaug Helland, Anne Lise Børresen-Dale, Amato J. Giaccia², Michael T. Longaker², Trevor Hastie², George P. Yang², Marc J. van de Vijver³, Patrick O. Brown² - Show less +13 more•Institutions (3)

Duke University¹, Stanford University², Netherlands Cancer Institute³

24 Jan 2006-PLOS Medicine

TL;DR: A gene-expression signature of the hypoxia response, derived from studies of cultured mammary and renal tubular epithelial cells, showed coordinated variation in several human cancers, and was a strong predictor of clinical outcomes in breast and ovarian cancers.

...read moreread less

Abstract: Background Inadequate oxygen (hypoxia) triggers a multifaceted cellular response that has important roles in normal physiology and in many human diseases. A transcription factor, hypoxia-inducible factor (HIF), plays a central role in the hypoxia response; its activity is regulated by the oxygen-dependent degradation of the HIF-1α protein. Despite the ubiquity and importance of hypoxia responses, little is known about the variation in the global transcriptional response to hypoxia among different cell types or how this variation might relate to tissue- and cell-specific diseases.

...read moreread less

618 citations

Journal Article•DOI•

Variation in demersal fish species richness in the oceans surrounding New Zealand: an analysis using boosted regression trees

[...]

John R. Leathwick¹, Jane Elith, Malcolm P. Francis, Trevor Hastie, P. Taylor - Show less +1 more•Institutions (1)

National Institute of Water and Atmospheric Research¹

08 Sep 2006-Marine Ecology Progress Series

TL;DR: In this paper, the authors analysed relationships between demersal fish species richness, environment and trawl characteristics using an extensive collection of trawl data from the oceans around New Zealand.

...read moreread less

Abstract: We analysed relationships between demersal fish species richness, environment and trawl characteristics using an extensive collection of trawl data from the oceans around New Zealand. Analyses were carried out using both generalised additive models and boosted regression trees (sometimes referred to as 'stochastic gradient boosting'). Depth was the single most important envi- ronmental predictor of variation in species richness, with highest richness occurring at depths of 900 to 1000 m, and with a broad plateau of moderately high richness between 400 and 1100 m. Richness was higher both in waters with high surface concentrations of chlorophyll a and in zones of mixing of water bodies of contrasting origins. Local variation in temperature was also important, with lower richness occurring in waters that were cooler than expected given their depth. Variables describing trawl length, trawl speed, and cod-end mesh size made a substantial contribution to analysis out- comes, even though functions fitted for trawl distance and cod-end mesh size were constrained to reflect the known performance of trawl gear. Species richness declined with increasing cod-end mesh size and increasing trawl speed, but increased with increasing trawl distance, reaching a plateau once trawl distances exceed about 3 nautical miles. Boosted regression trees provided a powerful analysis tool, giving substantially superior predictive performance to generalized additive models, despite the fitting of interaction terms in the latter.

...read moreread less

531 citations

Journal Article•DOI•

Making better biogeographical predictions of species’ distributions

[...]

Antoine Guisan¹, Anthony Lehmann, Simon Ferrier², Mike P. Austin³, Jacob McC. Overton⁴, Richard Aspinall⁵, Trevor Hastie⁶ - Show less +3 more•Institutions (6)

University of Lausanne¹, Department of Environment and Conservation², Commonwealth Scientific and Industrial Research Organisation³, Landcare Research⁴, Arizona State University⁵, Stanford University⁶

01 Jun 2006-Journal of Applied Ecology

TL;DR: Six issues are discussed in a methodological framework for generalized regression: links with ecological theory, optimal use of existing data and artificially generated data, incorporating spatial context, integrating ecological and environmental interactions, and assessing prediction errors and uncertainties.

...read moreread less

Abstract: Summary 1. Biogeographical models of species’ distributions are essential tools for assessing impacts of changing environmental conditions on natural communities and ecosystems. Practitioners need more reliable predictions to integrate into conservation planning (e.g. reserve design and management). 2. Most models still largely ignore or inappropriately take into account important features of species’ distributions, such as spatial autocorrelation, dispersal and migration, biotic and environmental interactions. Whether distributions of natural communities or ecosystems are better modelled by assembling individual species’ predictions in a bottom-up approach or modelled as collective entities is another important issue. An international workshop was organized to address these issues. 3. We discuss more specifically six issues in a methodological framework for generalized regression: (i) links with ecological theory; (ii) optimal use of existing data and artificially generated data; (iii) incorporating spatial context; (iv) integrating ecological and environmental interactions; (v) assessing prediction errors and uncertainties; and (vi) predicting distributions of communities or collective properties of biodiversity. 4. Synthesis and applications. Better predictions of the effects of impacts on biological communities and ecosystems can emerge only from more robust species’ distribution models and better documentation of the uncertainty associated with these models. An improved understanding of causes of species’ distributions, especially at their range limits, as well as of ecological assembly rules and ecosystem functioning, is necessary if further progress is to be made. A better collaborative effort between theoretical and functional ecologists, ecological modellers and statisticians is required to reach these goals.

...read moreread less

506 citations

Journal Article•DOI•

Comparative performance of generalized additive models and multivariate adaptive regression splines for statistical modelling of species distributions

[...]

John R. Leathwick¹, Jane Elith², Trevor Hastie³•Institutions (3)

National Institute of Water and Atmospheric Research¹, University of Melbourne², Stanford University³

16 Nov 2006-Ecological Modelling

TL;DR: In this paper, two statistical modelling techniques, generalized additive models (GAM) and multivariate adaptive regression splines (MARS), were used to analyse relationships between the distributions of 15 freshwater fish species and their environment.

...read moreread less

392 citations

Book Chapter•DOI•

Improving random projections using marginal information

[...]

Ping Li¹, Trevor Hastie¹, Kenneth Church²•Institutions (2)

Stanford University¹, Microsoft²

22 Jun 2006

TL;DR: An improved version of random projections that takes advantage of marginal norms, and using a maximum likelihood estimator (MLE), margin-constrained random projections can improve estimation accuracy considerably.

...read moreread less

Abstract: We present an improved version of random projections that takes advantage of marginal norms. Using a maximum likelihood estimator (MLE), margin-constrained random projections can improve estimation accuracy considerably. Theoretical properties of this estimator are analyzed in detail.

...read moreread less

71 citations

Proceedings Article•

Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data

[...]

Ping Li¹, Kenneth Church², Trevor Hastie¹•Institutions (2)

Stanford University¹, Microsoft²

04 Dec 2006

TL;DR: This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections, showing using real-world data that CRS often outperforms random projections.

...read moreread less

Abstract: We develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In large-scale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using real-world data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations.

...read moreread less

64 citations

Regularization Path Algorithms for Detecting Gene Interactions

[...]

Mee Young Park, Trevor Hastie

01 Jan 2006

TL;DR: This study considers several regularization path algorithms with grouped variable selection for modeling gene-interactions and proposes a path-following algorithm for the group-Lasso method applied to generalized linear models.

...read moreread less

Abstract: In this study, we consider several regularization path algorithms with grouped variable selection for modeling gene-interactions. When tting with categorical factors, including the genotype measurements, we often dene a set of dummy variables that represent a single factor/interaction of factors. Yuan & Lin (2006) proposed the groupLars and the group-Lasso methods through which these groups of indicators can be selected simultaneously. Here we introduce another version of group-Lars. In addition, we propose a path-following algorithm for the group-Lasso method applied to generalized linear models. We then use all these path algorithms, which select the grouped variables in a smooth way, to identify gene-interactions aecting disease status in an example. We further compare their performances to that of L2 penalized logistic regression with forward stepwise variable selection discussed in Park & Hastie (2006b).

...read moreread less

29 citations

Journal Article•DOI•

Comment on "Support Vector Machines with Applications"

[...]

Trevor Hastie, Ji Zhu

28 Dec 2006-arXiv: Statistics Theory

A Sketch-based Sampling Algorithm on Sparse Data

[...]

Ping Li¹, Kenneth Church², Trevor Hastie¹•Institutions (2)

Stanford University¹, Microsoft²

01 Jan 2006

TL;DR: This work proposes a sketch-based sampling algorithm, which effectively exploits the data sparsity and combines the advantages of both conventional random sampling and more modern randomized algorithms such as local sensitive hashing (LSH).

...read moreread less

Abstract: We propose a sketch-based sampling algorithm, which effectively exploits the data sparsity Sampling methods have become popular in large-scale data mining and information retrieval, where high data sparsity is a norm A distinct feature of our algorithm is that it combines the advantages of both conventional random sampling and more modern randomized algorithms such as local sensitive hashing (LSH) While most sketch-based algorithms are designed for specific summary statistics, our proposed algorithm is a general purpose technique, useful for estimating any summary statistics including two-way and multi-way distances and joint histograms

...read moreread less

Journal Article•

Comment. Support Vector Machines with Applications.

[...]

Trevor Hastie, Ji Zhu

01 Dec 2006-Statistical Science

TL;DR: Hastie et al. as mentioned in this paper considered reproducing kernel Hubert space Mk (RKHS) and showed that the positive definite kernel K(-,-) has a (possibly finite) eigenexpansion.

...read moreread less

Abstract: Trevor Hastie is Professor, Department of Statistics, Stanford University, Stanford, California 94305, USA (e-mail: hastie @ stanford.edu). Ji Zhu is Assistant Professor, Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, USA (e-mail: jizhu @ umich. edu). reproducing kernel Hubert space Mk (RKHS) gener ated by K{-, ) (see Burges, 1998; Evgeniou, Pontil and Poggio, 2000; and Wahba, 1999, for details). Suppose the positive definite kernel K(-,-) has a (possibly finite) eigenexpansion,

...read moreread less

Posted Content•

Nonlinear Estimators and Tail Bounds for Dimension Reduction in $l_1$ Using Cauchy Random Projections

[...]

Ping Li, Trevor Hastie, Kenneth Church

27 Oct 2006-arXiv: Data Structures and Algorithms

TL;DR: The moments of the MLE are analyzed and it is established that k = O(log n / e2) suffices with the constants explicitly given, and both the sample median and the geometric mean estimators are about 80% efficient compared to the Mle.

...read moreread less

Abstract: For dimension reduction in $l_1$, the method of {\em Cauchy random projections} multiplies the original data matrix $\mathbf{A} \in\mathbb{R}^{n\times D}$ with a random matrix $\mathbf{R} \in \mathbb{R}^{D\times k}$ ($k\ll\min(n,D)$) whose entries are i.i.d. samples of the standard Cauchy C(0,1). Because of the impossibility results, one can not hope to recover the pairwise $l_1$ distances in $\mathbf{A}$ from $\mathbf{B} = \mathbf{AR} \in \mathbb{R}^{n\times k}$, using linear estimators without incurring large errors. However, nonlinear estimators are still useful for certain applications in data stream computation, information retrieval, learning, and data mining. We propose three types of nonlinear estimators: the bias-corrected sample median estimator, the bias-corrected geometric mean estimator, and the bias-corrected maximum likelihood estimator. The sample median estimator and the geometric mean estimator are asymptotically (as $k\to \infty$) equivalent but the latter is more accurate at small $k$. We derive explicit tail bounds for the geometric mean estimator and establish an analog of the Johnson-Lindenstrauss (JL) lemma for dimension reduction in $l_1$, which is weaker than the classical JL lemma for dimension reduction in $l_2$. Asymptotically, both the sample median estimator and the geometric mean estimators are about 80% efficient compared to the maximum likelihood estimator (MLE). We analyze the moments of the MLE and propose approximating the distribution of the MLE by an inverse Gaussian.

...read moreread less

Journal Article•DOI•

An RT-PCR-based multi-gene prognostic signature predicts distant metastasis of node negative, ER positive breast cancer from FFPE sections

[...]

Kit Lau, Alice Wang, Karen Chew, Hongyue Dai, Trevor Hastie, Burkhard Brandt, Fred Waldman, John J. Sninsky - Show less +4 more

20 Jun 2006-Journal of Clinical Oncology

TL;DR: A prognostic signature based upon mRNA expression of 15 genes has been identified using FFPE sections with RT-PCR for early stage, N-, ER+ patients and can help clinicians and patients to choose among different therapeutic options.

...read moreread less

Abstract: 506 Background: Gene expression profiles have been shown to predict distant metastasis risk in breast cancer patients. For routine medical practice, molecular prognostic tests need to be able to qu...

...read moreread less