Showing papers by "Trevor Hastie published in 2015"

PDF

Open Access

Book•DOI•

Statistical Learning with Sparsity: The Lasso and Generalizations

[...]

Trevor Hastie, Robert Tibshirani, Martin J. Wainwright

07 May 2015

TL;DR: Statistical Learning with Sparsity: The Lasso and Generalizations presents methods that exploit sparsity to help recover the underlying signal in a set of data and extract useful and reproducible patterns from big datasets.

...read moreread less

Abstract: Discover New Methods for Dealing with High-Dimensional Data A sparse statistical model has only a small number of nonzero parameters or weights; therefore, it is much easier to estimate and interpret than a dense model. Statistical Learning with Sparsity: The Lasso and Generalizations presents methods that exploit sparsity to help recover the underlying signal in a set of data. Top experts in this rapidly evolving field, the authors describe the lasso for linear regression and a simple coordinate descent algorithm for its computation. They discuss the application of 1 penalties to generalized linear models and support vector machines, cover generalized penalties such as the elastic net and group lasso, and review numerical methods for optimization. They also present statistical inference methods for fitted (lasso) models, including the bootstrap, Bayesian methods, and recently developed approaches. In addition, the book examines matrix decomposition, sparse multivariate analysis, graphical models, and compressed sensing. It concludes with a survey of theoretical results for the lasso. In this age of big data, the number of features measured on a person or object can be large and might be larger than the number of observations. This book shows how the sparsity assumption allows us to tackle these problems and extract useful and reproducible patterns from big datasets. Data analysts, computer scientists, and theorists will appreciate this thorough and up-to-date treatment of sparse statistical modeling.

...read moreread less

2,275 citations

Journal Article•DOI•

Matrix completion and low-rank SVD via fast alternating least squares

[...]

Trevor Hastie¹, Rahul Mazumder², Jason D. Lee¹, Reza Bosagh Zadeh•Institutions (2)

Stanford University¹, Columbia University²

01 Jan 2015-Journal of Machine Learning Research

TL;DR: In this article, the authors bring the two approaches together, leading to an efficient algorithm for large matrix factorization and completion that outperforms both of these two approaches, and develop a software package softImpute in R for implementing their approaches.

...read moreread less

Abstract: The matrix-completion problem has attracted a lot of attention, largely as a result of the celebrated Net flix competition. Two popular approaches for solving the problem are nuclear-norm-regularized matrix approximation (Candes and Tao, 2009; Mazumder et al., 2010), and maximum-margin matrix factorization (Srebro et al., 2005). These two procedures are in some cases solving equivalent problems, but with quite different algorithms. In this article we bring the two approaches together, leading to an efficient algorithm for large matrix factorization and completion that outperforms both of these. We develop a software package softImpute in R for implementing our approaches, and a distributed version for very large matrices using the Spark cluster programming environment.

...read moreread less

366 citations

Journal Article•DOI•

Bias correction in species distribution models: pooling survey and collection data for multiple species.

[...]

William Fithian¹, Jane Elith², Trevor Hastie¹, David A. Keith³•Institutions (3)

Stanford University¹, University of Melbourne², University of New South Wales³

01 Apr 2015-Methods in Ecology and Evolution

TL;DR: A probabilistic model to allow for joint analysis of presence-only and survey data to exploit their complementary strengths and can obtain an unbiased estimate of the first species' geographic range is proposed.

...read moreread less

Abstract: Presence-only records may provide data on the distributions of rare species, but commonly suffer from large, unknown biases due to their typically haphazard collection schemes. Presence-absence or count data collected in systematic, planned surveys are more reliable but typically less abundant.We proposed a probabilistic model to allow for joint analysis of presence-only and survey data to exploit their complementary strengths. Our method pools presence-only and presence-absence data for many species and maximizes a joint likelihood, simultaneously estimating and adjusting for the sampling bias affecting the presence-only data. By assuming that the sampling bias is the same for all species, we can borrow strength across species to efficiently estimate the bias and improve our inference from presence-only data.We evaluate our model's performance on data for 36 eucalypt species in south-eastern Australia. We find that presence-only records exhibit a strong sampling bias towards the coast and towards Sydney, the largest city. Our data-pooling technique substantially improves the out-of-sample predictive performance of our model when the amount of available presence-absence data for a given species is scarceIf we have only presence-only data and no presence-absence data for a given species, but both types of data for several other species that suffer from the same spatial sampling bias, then our method can obtain an unbiased estimate of the first species' geographic range.

...read moreread less

313 citations

Journal Article•DOI•

Point process models for presence-only analysis

[...]

Ian Renner¹, Jane Elith², Adrian Baddeley³, William Fithian⁴, Trevor Hastie⁴, Steven J. Phillips, Gordana C. Popovic⁵, David I. Warton⁵ - Show less +4 more•Institutions (5)

University of Newcastle¹, University of Melbourne², Curtin University³, Stanford University⁴, University of New South Wales⁵

01 Apr 2015-Methods in Ecology and Evolution

TL;DR: In this article, a review of point process models, some of their advantages and some common methods of fitting them to presence-only data is presented. But the authors do not address the problem of sampling bias.

...read moreread less

Abstract: Summary Presence-only data are widely used for species distribution modelling, and point process regression models are a flexible tool that has considerable potential for this problem, when data arise as point events. In this paper, we review point process models, some of their advantages and some common methods of fitting them to presence-only data. Advantages include (and are not limited to) clarification of what the response variable is that is modelled; a framework for choosing the number and location of quadrature points (commonly referred to as pseudo-absences or ‘background points’) objectively; clarity of model assumptions and tools for checking them; models to handle spatial dependence between points when it is present; and ways forward regarding difficult issues such as accounting for sampling bias. Point process models are related to some common approaches to presence-only species distribution modelling, which means that a variety of different software tools can be used to fit these models, including maxent or generalised linear modelling software.

...read moreread less

301 citations

Journal Article•DOI•

Learning Interactions via Hierarchical Group-Lasso Regularization

[...]

Michael Lim¹, Trevor Hastie¹•Institutions (1)

Stanford University¹

16 Sep 2015-Journal of Computational and Graphical Statistics

TL;DR: This work introduces a method for learning pairwise interactions in a linear regression or logistic regression model in a manner that satisfies strong hierarchy: whenever an interaction is estimated to be nonzero, both its associated main effects are also included in the model.

...read moreread less

Abstract: We introduce a method for learning pairwise interactions in a linear regression or logistic regression model in a manner that satisfies strong hierarchy: whenever an interaction is estimated to be nonzero, both its associated main effects are also included in the model. We motivate our approach by modeling pairwise interactions for categorical variables with arbitrary numbers of levels, and then show how we can accommodate continuous variables as well. Our approach allows us to dispense with explicitly applying constraints on the main effects and interactions for identifiability, which results in interpretable interaction models. We compare our method with existing approaches on both simulated and real data, including a genome-wide association study, all using our R package glinternet.

...read moreread less

220 citations

Journal Article•DOI•

Learning the Structure of Mixed Graphical Models

[...]

Jason D. Lee¹, Trevor Hastie¹•Institutions (1)

Stanford University¹

01 Jan 2015-Journal of Computational and Graphical Statistics

TL;DR: This work presents a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning and involves a novel symmetric use of the group-lasso norm.

...read moreread less

Abstract: We consider the problem of learning the structure of a pairwise graphical model over continuous and discrete variables. We present a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning. In previous work, authors have considered structure learning of Gaussian graphical models and structure learning of discrete models. Our approach is a natural generalization of these two lines of work to the mixed case. The penalization scheme involves a novel symmetric use of the group-lasso norm and follows naturally from a particular parameterization of the model. Supplementary materials for this article are available online.

...read moreread less

154 citations

Journal Article•DOI•

Clinically relevant molecular subtypes in leiomyosarcoma

[...]

Xiangqian Guo¹, Vickie Y. Jo², Anne M. Mills³, Shirley Zhu¹, Cheng-Han Lee⁴, Inigo Espinosa⁵, Marisa R. Nucci², Sushama Varma¹, Erna Forgó¹, Trevor Hastie¹, Sharon Anderson¹, Kristen N. Ganjoo¹, Andrew H. Beck⁶, Robert B. West¹, Christopher D.M. Fletcher², Matt van de Rijn¹ - Show less +12 more•Institutions (6)

Stanford University¹, Brigham and Women's Hospital², University of Virginia³, Alexandra Hospital⁴, Autonomous University of Barcelona⁵, Beth Israel Deaconess Medical Center⁶

20 Apr 2015-Clinical Cancer Research

TL;DR: The leiomyosarcoma subtypes showed significant differences in expression levels for genes for which novel targeted therapies are being developed, suggesting that leiOSarcomA subtypes may respond differentially to these targeted therapies.

...read moreread less

Abstract: Purpose: Leiomyosarcoma is a malignant neoplasm with smooth muscle differentiation. Little is known about its molecular heterogeneity and no targeted therapy currently exists for leiomyosarcoma. Recognition of different molecular subtypes is necessary to evaluate novel therapeutic options. In a previous study on 51 leiomyosarcomas, we identified three molecular subtypes in leiomyosarcoma. The current study was performed to determine whether the existence of these subtypes could be confirmed in independent cohorts. Experimental Design: Ninety-nine cases of leiomyosarcoma were expression profiled with 3′end RNA-Sequencing (3SEQ). Consensus clustering was conducted to determine the optimal number of subtypes. Results: We identified 3 leiomyosarcoma molecular subtypes and confirmed this finding by analyzing publically available data on 82 leiomyosarcoma from The Cancer Genome Atlas (TCGA). We identified two new formalin-fixed, paraffin-embedded tissue-compatible diagnostic immunohistochemical markers; LMOD1 for subtype I leiomyosarcoma and ARL4C for subtype II leiomyosarcoma. A leiomyosarcoma tissue microarray with known clinical outcome was used to show that subtype I leiomyosarcoma is associated with good outcome in extrauterine leiomyosarcoma while subtype II leiomyosarcoma is associated with poor prognosis in both uterine and extrauterine leiomyosarcoma. The leiomyosarcoma subtypes showed significant differences in expression levels for genes for which novel targeted therapies are being developed, suggesting that leiomyosarcoma subtypes may respond differentially to these targeted therapies. Conclusions: We confirm the existence of 3 molecular subtypes in leiomyosarcoma using two independent datasets and show that the different molecular subtypes are associated with distinct clinical outcomes. The findings offer an opportunity for treating leiomyosarcoma in a subtype-specific targeted approach. Clin Cancer Res; 21(15); 3501–11. ©2015 AACR .

...read moreread less

123 citations

Journal Article•DOI•

CATS regression – a model-based approach to studying trait-based community assembly

[...]

David I. Warton¹, Bill Shipley², Trevor Hastie³•Institutions (3)

University of New South Wales¹, Université de Sherbrooke², Stanford University³

01 Apr 2015-Methods in Ecology and Evolution

TL;DR: In this paper, the authors show that CATS is equivalent to a generalized linear model for abundance, with species traits as predictor variables, and they use this equivalence between the maximum entropy formalism and Poisson regression.

...read moreread less

Abstract: Summary Shipley, Vile & Garnier (Science 2006; 314: 812) proposed a maximum entropy approach to studying how species relative abundance is mediated by their traits, ‘community assembly via trait selection’ (CATS). In this paper, we build on recent equivalences between the maximum entropy formalism and Poisson regression to show that CATS is equivalent to a generalized linear model for abundance, with species traits as predictor variables. Main advantages gained by access to the machinery of generalized linear models can be summarized as advantages in interpretation, model checking, extensions and inference. A more difficult issue, however, is the development of valid methods of inference for single-site data, as species correlation in abundance is not accounted for in CATS (whether specified as a regression or via maximum entropy). This issue can be circumvented for multisite data using design-based inference. These points are illustrated by example – our plant abundances were found to violate the implicit Poisson assumption of CATS, but a negative binomial regression had much improved fit, and our model was extended to multisite data in order to directly model the environment–trait interaction. Violations of the Poisson assumption were strong and accounting for them qualitatively changed results, presumably because larger counts had undue influence when overdispersion had not been accounted for. We advise that future CATS analysts routinely check for overdispersion and account for it if present.

...read moreread less

77 citations

Posted Content•

Generalized Additive Model Selection

[...]

Alexandra Chouldechova, Trevor Hastie

11 Jun 2015-arXiv: Machine Learning

TL;DR: This work introduces GAMSEL (Generalized Additive Model Selection), a penalized likelihood approach for fitting sparse generalized additive models in high dimension, and presents a blockwise coordinate descent procedure for efficiently optimizing the Penalized likelihood objective over a dense grid of the tuning parameter.

...read moreread less

Abstract: We introduce GAMSEL (Generalized Additive Model Selection), a penalized likelihood approach for fitting sparse generalized additive models in high dimension. Our method interpolates between null, linear and additive models by allowing the effect of each variable to be estimated as being either zero, linear, or a low-complexity curve, as determined by the data. We present a blockwise coordinate descent procedure for efficiently optimizing the penalized likelihood objective over a dense grid of the tuning parameter, producing a regularization path of additive models. We demonstrate the performance of our method on both real and simulated data examples, and compare it with existing techniques for additive model selection.

...read moreread less

57 citations

Journal Article•

Local case-control sampling: Efficient subsampling in imbalanced data sets

[...]

William Fithian, Trevor Hastie

01 Jan 2015-Quality Engineering

TL;DR: This work proposes a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme, and shows that this method can substantially outperform standard case-control subsampled.

...read moreread less

Abstract: For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients $\theta^*$. By contrast, our estimator is consistent for $\theta^*$ provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE - even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to $1+\frac{1}{c}$ if we multiply the baseline acceptance probabilities by $c>1$ (and weight points with acceptance probability greater than 1), taking roughly $\frac{1+c}{2}$ times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

...read moreread less

55 citations

Journal Article•DOI•

Effective degrees of freedom: a flawed metaphor

[...]

Lucas Janson¹, William Fithian¹, Trevor Hastie¹•Institutions (1)

Stanford University¹

01 Jun 2015-Biometrika

TL;DR: This work exhibits and theoretically explore various fitting procedures for which degrees of freedom is not monotonic in the model complexity parameter, and can exceed the total dimension of the ambient space even in very simple settings.

...read moreread less

Abstract: To most applied statisticians, a fitting procedure's degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data. In particular, it is often used to parameterize the bias-variance tradeoff in model selection. We argue that, on the contrary, model complexity and degrees of freedom may correspond very poorly. We exhibit and theoretically explore various fitting procedures for which degrees of freedom is not monotonic in the model complexity parameter, and can exceed the total dimension of the ambient space even in very simple settings. We show that the degrees of freedom for any non-convex projection method can be unbounded.

...read moreread less

Journal Article•DOI•

The mobilize center: an NIH big data to knowledge center to advance human movement research and improve mobility.

[...]

Joy P. Ku¹, Jennifer L. Hicks¹, Trevor Hastie¹, Jure Leskovec¹, Christopher Ré¹, Scott L. Delp¹ - Show less +2 more•Institutions (1)

Stanford University¹

01 Nov 2015-Journal of the American Medical Informatics Association

TL;DR: Important biomedical applications, such as osteoarthritis and weight management, will focus the development of new data science methods, and the Mobilize Center will transform human movement research.

...read moreread less

Posted Content•

Telugu OCR Framework using Deep Learning.

[...]

Rakesh Achanta, Trevor Hastie

20 Sep 2015-arXiv: Machine Learning

TL;DR: An end-to-end framework that segments the text image, classifies the characters and extracts lines using a language model using a deep convolutional neural network to achieve acceptable error rates is presented.

...read moreread less

Abstract: In this paper, we address the task of Optical Character Recognition(OCR) for the Telugu script. We present an end-to-end framework that segments the text image, classifies the characters and extracts lines using a language model. The segmentation is based on mathematical morphology. The classification module, which is the most challenging task of the three, is a deep convolutional neural network. The language is modelled as a third degree markov chain at the glyph level. Telugu script is a complex alphasyllabary and the language is agglutinative, making the problem hard. In this paper we apply the latest advances in neural networks to achieve state-of-the-art error rates. We also review convolutional neural networks in great detail and expound the statistical justification behind the many tricks needed to make Deep Learning work.

...read moreread less

Journal Article•DOI•

Customized training with an application to mass spectrometric imaging of cancer tissue.

[...]

Scott Powers¹, Trevor Hastie¹, Robert Tibshirani¹•Institutions (1)

Stanford University¹

01 Dec 2015-The Annals of Applied Statistics

TL;DR: A simple, interpretable strategy for making predictions on test data when the features of the test data are available at the time of model fitting, and applies to a mass-spectrometric imaging data set from an ongoing collaboration in gastric cancer detection which demonstrates the power and interpretability of the technique.

...read moreread less

Abstract: We introduce a simple, interpretable strategy for making predictions on test data when the features of the test data are available at the time of model fitting. Our proposal—customized training—clusters the data to find training points close to each test point and then fits an $\ell_{1}$-regularized model (lasso) separately in each training cluster. This approach combines the local adaptivity of $k$-nearest neighbors with the interpretability of the lasso. Although we use the lasso for the model fitting, any supervised learning method can be applied to the customized training sets. We apply the method to a mass-spectrometric imaging data set from an ongoing collaboration in gastric cancer detection which demonstrates the power and interpretability of the technique. Our idea is simple but potentially useful in situations where the data have some underlying structure.

...read moreread less

Journal Article•DOI•

Detecting clinically meaningful biomarkers with repeated measurements: An illustration with electronic health records

[...]

Benjamin A. Goldstein¹, Themistocles L. Assimes², Wolfgang C. Winkelmayer³, Trevor Hastie²•Institutions (3)

Duke University¹, Stanford University², Baylor College of Medicine³

01 Jun 2015-Biometrics

TL;DR: EHR data is used as an example to present an analytic procedure to both create an analytic sample and analyze the data to detect clinically meaningful markers of acute myocardial infarction and the results suggest that EHR data can be easily used to detect markers of clinically acute events.

...read moreread less

Abstract: Data sources with repeated measurements are an appealing resource to understand the relationship between changes in biological markers and risk of a clinical event. While longitudinal data present opportunities to observe changing risk over time, these analyses can be complicated if the measurement of clinical metrics is sparse and/or irregular, making typical statistical methods unsuitable. In this article, we use electronic health record (EHR) data as an example to present an analytic procedure to both create an analytic sample and analyze the data to detect clinically meaningful markers of acute myocardial infarction (MI). Using an EHR from a large national dialysis organization we abstracted the records of 64,318 individuals and identified 4769 people that had an MI during the study period. We describe a nested case-control design to sample appropriate controls and an analytic approach using regression splines. Fitting a mixed-model with truncated power splines we perform a series of goodness-of-fit tests to determine whether any of 11 regularly collected laboratory markers are useful clinical predictors. We test the clinical utility of each marker using an independent test set. The results suggest that EHR data can be easily used to detect markers of clinically acute events. Special software or analytic tools are not needed, even with irregular EHR data.

...read moreread less

Posted Content•

Confounder Adjustment in Multiple Hypothesis Testing

[...]

Jingshu Wang¹, Qingyuan Zhao¹, Trevor Hastie², Art B. Owen²•Institutions (2)

University of Pennsylvania¹, Stanford University²

17 Aug 2015-arXiv: Methodology

TL;DR: In this article, the authors consider large-scale studies in which thousands of significance tests are performed simultaneously, and they unify these methods in the same framework, generalize them to include multiple primary variables and multiple nuisance variables and analyze their statistical properties.

...read moreread less

Abstract: We consider large-scale studies in which thousands of significance tests are performed simultaneously. In some of these studies, the multiple testing procedure can be severely biased by latent confounding factors such as batch effects and unmeasured covariates that correlate with both primary variable(s) of interest (e.g. treatment variable, phenotype) and the outcome. Over the past decade, many statistical methods have been proposed to adjust for the confounders in hypothesis testing. We unify these methods in the same framework, generalize them to include multiple primary variables and multiple nuisance variables, and analyze their statistical properties. In particular, we provide theoretical guarantees for RUV-4 and LEAPP, which correspond to two different identification conditions in the framework: the first requires a set of "negative controls" that are known a priori to follow the null distribution; the second requires the true non-nulls to be sparse. Two different estimators which are based on RUV-4 and LEAPP are then applied to these two scenarios. We show that if the confounding factors are strong, the resulting estimators can be asymptotically as powerful as the oracle estimator which observes the latent confounding factors. For hypothesis testing, we show the asymptotic z-tests based on the estimators can control the type I error. Numerical experiments show that the false discovery rate is also controlled by the Benjamini-Hochberg procedure when the sample size is reasonably large.

...read moreread less

Book Chapter•DOI•

The Lasso for Linear Models

[...]

Trevor Hastie, Robert Tibshirani, Martin J. Wainwright

07 May 2015

Posted Content•

Data Representation and Compression Using Linear-Programming Approximations

[...]

Hristo S. Paskov¹, John C. Mitchell¹, Trevor Hastie¹•Institutions (1)

Stanford University¹

20 Nov 2015-arXiv: Learning

TL;DR: Dracula as discussed by the authors is a deep extension of compressive feature learning that learns a dictionary of $n$-grams that efficiently compresses a given corpus and recursively compresses its own dictionary; in effect, Dracula is a ''deep'' extension of Compressive Feature Learning.

...read moreread less

Abstract: We propose `Dracula', a new framework for unsupervised feature selection from sequential data such as text. Dracula learns a dictionary of $n$-grams that efficiently compresses a given corpus and recursively compresses its own dictionary; in effect, Dracula is a `deep' extension of Compressive Feature Learning. It requires solving a binary linear program that may be relaxed to a linear program. Both problems exhibit considerable structure, their solution paths are well behaved, and we identify parameters which control the depth and diversity of the dictionary. We also discuss how to derive features from the compressed documents and show that while certain unregularized linear models are invariant to the structure of the compressed dictionary, this structure may be used to regularize learning. Experiments are presented that demonstrate the efficacy of Dracula's features.

...read moreread less

Proceedings Article•

Fast algorithms for learning with long N -grams via suffix tree based matrix multiplication

[...]

Hristo S. Paskov¹, John C. Mitchell¹, Trevor Hastie¹•Institutions (1)

Stanford University¹

12 Jul 2015

TL;DR: This paper uses suffix trees to show that N-gram matrices require memory and time that is at best linear and at worst linear to store and to multiply a vector, and provides a linear running time and memory framework that produces the data structure necessary for fast multiplication.

...read moreread less

Abstract: This paper addresses the computational issues of learning with long, and possibly all, N-grams in a document corpus. Our main result uses suffix trees to show that N-gram matrices require memory and time that is at worst linear (in the length of the underlying corpus) to store and to multiply a vector. Our algorithm can speed up any N-gram based machine learning algorithm which uses gradient descent or an optimization procedure that makes progress through multiplication. We also provide a linear running time and memory framework that screens N-gram features according to a multitude of statistical criteria and produces the data structure necessary for fast multiplication. Experiments on natural language and DNA sequence datasets demonstrate the computational savings of our framework; our multiplication algorithm is four orders of magnitude more efficient than naive multiplication on the DNA data. We also show that prediction accuracy on large-scale sentiment analysis problems benefits from long N-grams.

...read moreread less