scispace - formally typeset
Search or ask a question

Showing papers by "Trevor Hastie published in 2013"


Book
28 Jul 2013
TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Abstract: During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

19,261 citations


BookDOI
01 Jan 2013
TL;DR: An introduction to statistical learning provides an accessible overview of the essential toolset for making sense of the vast and complex data sets that have emerged in science, industry, and other sectors in the past twenty years.
Abstract: Statistics An Intduction to Stistical Lerning with Applications in R An Introduction to Statistical Learning provides an accessible overview of the fi eld of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fi elds ranging from biology to fi nance to marketing to astrophysics in the past twenty years. Th is book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classifi cation, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fi elds, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical soft ware platform. Two of the authors co-wrote Th e Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. Th is book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. Th e text assumes only a previous course in linear regression and no knowledge of matrix algebra.

8,207 citations


Journal ArticleDOI
TL;DR: A regularized model for linear regression with ℓ1 andℓ2 penalties is introduced and it is shown that it has the desired effect of group-wise and within group sparsity.
Abstract: For high-dimensional supervised learning problems, often using problem-specific assumptions can lead to greater accuracy. For problems with grouped covariates, which are believed to have sparse effects both on a group and within group level, we introduce a regularized model for linear regression with l1 and l2 penalties. We discuss the sparsity and other regularization properties of the optimal fit for this model, and show that it has the desired effect of group-wise and within group sparsity. We propose an algorithm to fit the model via accelerated generalized gradient descent, and extend this model and algorithm to convex loss functions. We also demonstrate the efficacy of our model and the efficiency of our algorithm on simulated data. This article has online supplementary material.

1,233 citations


Journal ArticleDOI
TL;DR: This forum article questions the approach of Royle et al. (2012) that claims to be able to learn the overall species occurrence probability, or prevalence, without making unjustified simplifying assumptions.
Abstract: Presence-only data abounds in ecology, often accompanied by a background sample. Although many interesting aspects of the species’ distribution can be learned from such data, one cannot learn the overall species occurrence probability, or prevalence, without making unjustified simplifying assumptions. In this forum article we question the approach of Royle et al. (2012) that claims to be able to do this.

194 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed an infinitely weighted logistic regression (EWL) model, which is exactly equivalent to the inhomogeneous Poisson process (IPP) model in finite samples.
Abstract: Statistical modeling of presence-only data has attracted much recent attention in the ecological literature, leading to a proliferation of methods, including the inhomogeneous Poisson process (IPP) model, maximum entropy (Maxent) modeling of species distributions and logistic regression models. Several recent articles have shown the close relationships between these methods. We explain why the IPP intensity function is a more natural object of inference in presence-only studies than occurrence probability (which is only defined with reference to quadrat size), and why presence-only data only allows estimation of relative, and not absolute intensity of species occurrence. All three of the above techniques amount to parametric density estimation under the same exponential family model (in the case of the IPP, the fitted density is multiplied by the number of presence records to obtain a fitted intensity). We show that IPP and Maxent give the exact same estimate for this density, but logistic regression in general yields a different estimate in finite samples. When the model is misspecified—as it practically always is—logistic regression and the IPP may have substantially different asymptotic limits with large data sets. We propose “infinitely weighted logistic regression,” which is exactly equivalent to the IPP in finite samples. Consequently, many already-implemented methods extending logistic regression can also extend the Maxent and IPP models in directly analogous ways using this technique.

190 citations


Book ChapterDOI
01 Jan 2013
TL;DR: This chapter describes tree-based methods for regression and classification, which involve stratifying or segmenting the predictor space into a number of simple regions.
Abstract: In this chapter, we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions. In order to make a prediction for a given observation, we typically use the mean or the mode of the training observations in the region to which it belongs. Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision tree methods.

145 citations


Book ChapterDOI
01 Jan 2013
TL;DR: This chapter discusses the support vector machine (SVM), an approach for classification that was developed in the computer science community in the 1990s and that has grown in popularity since then.
Abstract: In this chapter, we discuss the support vector machine (SVM), an approach for classification that was developed in the computer science community in the 1990s and that has grown in popularity since then. SVMs have been shown to perform well in a variety of settings, and are often considered one of the best “out of the box” classifiers.

84 citations


Posted Content
TL;DR: This paper purpose a blockwise descent algorithm for group-penalized multiresponse regression using a quasi-newton framework, and shows that this implementation is an order of magnitude faster than its competitor, and can solve gene-expression-sized problems in real time.
Abstract: In this paper we purpose a blockwise descent algorithm for group-penalized multiresponse regression. Using a quasi-newton framework we extend this to group-penalized multinomial regression. We give a publicly available implementation for these in R, and compare the speed of this algorithm to a competing algorithm --- we show that our implementation is an order of magnitude faster than its competitor, and can solve gene-expression-sized problems in real time.

82 citations


Book ChapterDOI
01 Jan 2013
TL;DR: In the regression setting, the standard linear model is commonly used to describe the relationship between a response Y and a set of variables, and one typically fits this model using least squares as mentioned in this paper.
Abstract: In the regression setting, the standard linear model $$\displaystyle{ Y =\beta _{0} +\beta _{1}X_{1} + \cdots +\beta _{p}X_{p}+\epsilon }$$ (6.1) is commonly used to describe the relationship between a response Y and a set of variables \(X_{1},X_{2},\ldots,X_{p}\). We have seen in Chapter 3 that one typically fits this model using least squares.

52 citations


Posted Content
TL;DR: A method for learning pairwise interactions in a manner that satisfies strong hierarchy: whenever an interaction is estimated to be nonzero, both its associated main effects are also included in the model, which results in interpretable interaction models.
Abstract: We introduce a method for learning pairwise interactions in a manner that satisfies strong hierarchy: whenever an interaction is estimated to be nonzero, both its associated main effects are also included in the model. We motivate our approach by modeling pairwise interactions for categorical variables with arbitrary numbers of levels, and then show how we can accommodate continuous variables and mixtures thereof. Our approach allows us to dispense with explicitly applying constraints on the main effects and interactions for identifiability, which results in interpretable interaction models. We compare our method with existing approaches on both simulated and real data, including a genome wide association study, all using our R package glinternet.

35 citations


Proceedings Article
01 Jan 2013
TL;DR: A new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning is presented.
Abstract: We consider the problem of learning the structure of a pairwise graphical model over continuous and discrete variables. We present a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning. In previous work, authors have considered structure learning of Gaussian graphical models and structure learning of discrete models. Our approach is a natural generalization of these two lines of work to the mixed case. The penalization scheme is new and follows naturally from a particular parametrization of the model.

Posted Content
TL;DR: The authors showed that the IJ estimator requires 1.7 times less bootstrap replicates than the jackknife estimator to achieve a given accuracy, where n is the size of the training set.
Abstract: We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2012) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and working with a large B can be computationally expensive. Direct applications of jackknife and IJ estimators to bagging require B on the order of n^{1.5} bootstrap replicates to converge, where n is the size of the training set. We propose improved versions that only require B on the order of n replicates. Moreover, we show that the IJ estimator requires 1.7 times less bootstrap replicates than the jackknife to achieve a given accuracy. Finally, we study the sampling distributions of the jackknife and IJ variance estimates themselves. We illustrate our findings with multiple experiments and simulation studies.

Proceedings Article
05 Dec 2013
TL;DR: This paper addresses the problem of unsupervised feature learning for text data by using a dictionary-based compression scheme to extract a succinct feature set and finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly.
Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efficient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning.

Posted Content
TL;DR: This paper shows that this full model definition of FDR suffers from unintuitive and potentially undesirable behavior in the presence of correlated predictors, and proposes a new false selection error criterion, the False Variable Rate (FVR), that avoids these problems and behaves in a more intuitive manner.
Abstract: There has been recent interest in extending the ideas of False Discovery Rates (FDR) to variable selection in regression settings. Traditionally the FDR in these settings has been defined in terms of the coefficients of the full regression model. Recent papers have struggled with controlling this quantity when the predictors are correlated. This paper shows that this full model definition of FDR suffers from unintuitive and potentially undesirable behavior in the presence of correlated predictors. We propose a new false selection error criterion, the False Variable Rate (FVR), that avoids these problems and behaves in a more intuitive manner. We discuss the behavior of this criterion and how it compares with the traditional FDR, as well as presenting guidelines for determining which is appropriate in a particular setting. Finally, we present a simple estimation procedure for FVR in stepwise variable selection. We analyze the performance of this estimator and draw connections to recent estimators in the literature.

Posted Content
18 Nov 2013
TL;DR: This paper showed that the IJ estimator requires 1.7 times less bootstrap replicates than the jackknife estimator to achieve a given accuracy, where n is the size of the training set.
Abstract: We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2012) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and working with a large B can be computationally expensive. Direct applications of jackknife and IJ estimators to bagging require B on the order of n^{1.5} bootstrap replicates to converge, where n is the size of the training set. We propose improved versions that only require B on the order of n replicates. Moreover, we show that the IJ estimator requires 1.7 times less bootstrap replicates than the jackknife to achieve a given accuracy. Finally, we study the sampling distributions of the jackknife and IJ variance estimates themselves. We illustrate our findings with multiple experiments and simulation studies.

Book ChapterDOI
01 Jan 2013
TL;DR: This chapter relaxes the linearity assumption while still attempting to maintain as much interpretability as possible by examining very simple extensions of linear models like polynomial regression and step functions, as well as more sophisticated approaches such as splines, local regression, and generalized additive models.
Abstract: So far in this book, we have mostly focused on linear models. Linear models are relatively simple to describe and implement, and have advantages over other approaches in terms of interpretation and inference. However, standard linear regression can have significant limitations in terms of predictive power. This is because the linearity assumption is almost always an approximation, and sometimes a poor one. In Chapter 6 we see that we can improve upon least squares using ridge regression, the lasso, principal components regression, and other techniques. In that setting, the improvement is obtained by reducing the complexity of the linear model, and hence the variance of the estimates. But we are still using a linear model, which can only be improved so far! In this chapter we relax the linearity assumption while still attempting to maintain as much interpretability as possible. We do this by examining very simple extensions of linear models like polynomial regression and step functions, as well as more sophisticated approaches such as splines, local regression, and generalized additive models.

01 Jan 2013
TL;DR: In this article, a blockwise descent algorithm for group-penalized multiresponse regression is proposed, which can solve gene-expression-sized problems in real time.
Abstract: In this paper we purpose a blockwise descent algorithm for grouppenalized multiresponse regression. Using a quasi-newton framework we extend this to group-penalized multinomial regression. We give a publicly available implementation for these in R, and compare the speed of this algorithm to a competing algorithm | we show that our implementation is an order of magnitude faster than its competitor, and can solve gene-expression-sized problems in real time.

Journal ArticleDOI
TL;DR: This paper proposed a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features.
Abstract: For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients $\theta^*$. By contrast, our estimator is consistent for $\theta^*$ provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE - even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to $1+\frac{1}{c}$ if we multiply the baseline acceptance probabilities by $c>1$ (and weight points with acceptance probability greater than 1), taking roughly $\frac{1+c}{2}$ times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

01 Jan 2013
TL;DR: ischemic mitral regurgitation Ischemia in three left ventricular regions: Insights into the pathogenesis of acute cardiac arrhythmia are provided.
Abstract: J Thorac Cardiovasc Surg 2003;125:559-569 and D. Craig Miller Daughters, Paul Dagum, Mary K. Zasio, Sidney Lo, Trevor Hastie, Neil B. Ingels, Jr Tomasz A. Timek, David T. Lai, Frederick Tibayan, David Liang, George T.ischemic mitral regurgitation Ischemia in three left ventricular regions: Insights into the pathogenesis of acutehttp://jtcs.ctsnetjournals.org/cgi/content/full/125/3/559 located on the World Wide Web at: The online version of this article, along with updated information and services, is