scispace - formally typeset
Search or ask a question

Showing papers by "Trevor Hastie published in 2015"


BookDOI
07 May 2015
TL;DR: Statistical Learning with Sparsity: The Lasso and Generalizations presents methods that exploit sparsity to help recover the underlying signal in a set of data and extract useful and reproducible patterns from big datasets.
Abstract: Discover New Methods for Dealing with High-Dimensional Data A sparse statistical model has only a small number of nonzero parameters or weights; therefore, it is much easier to estimate and interpret than a dense model. Statistical Learning with Sparsity: The Lasso and Generalizations presents methods that exploit sparsity to help recover the underlying signal in a set of data. Top experts in this rapidly evolving field, the authors describe the lasso for linear regression and a simple coordinate descent algorithm for its computation. They discuss the application of 1 penalties to generalized linear models and support vector machines, cover generalized penalties such as the elastic net and group lasso, and review numerical methods for optimization. They also present statistical inference methods for fitted (lasso) models, including the bootstrap, Bayesian methods, and recently developed approaches. In addition, the book examines matrix decomposition, sparse multivariate analysis, graphical models, and compressed sensing. It concludes with a survey of theoretical results for the lasso. In this age of big data, the number of features measured on a person or object can be large and might be larger than the number of observations. This book shows how the sparsity assumption allows us to tackle these problems and extract useful and reproducible patterns from big datasets. Data analysts, computer scientists, and theorists will appreciate this thorough and up-to-date treatment of sparse statistical modeling.

2,275 citations


Journal ArticleDOI
TL;DR: In this article, the authors bring the two approaches together, leading to an efficient algorithm for large matrix factorization and completion that outperforms both of these two approaches, and develop a software package softImpute in R for implementing their approaches.
Abstract: The matrix-completion problem has attracted a lot of attention, largely as a result of the celebrated Net flix competition. Two popular approaches for solving the problem are nuclear-norm-regularized matrix approximation (Candes and Tao, 2009; Mazumder et al., 2010), and maximum-margin matrix factorization (Srebro et al., 2005). These two procedures are in some cases solving equivalent problems, but with quite different algorithms. In this article we bring the two approaches together, leading to an efficient algorithm for large matrix factorization and completion that outperforms both of these. We develop a software package softImpute in R for implementing our approaches, and a distributed version for very large matrices using the Spark cluster programming environment.

366 citations


Journal ArticleDOI
TL;DR: A probabilistic model to allow for joint analysis of presence-only and survey data to exploit their complementary strengths and can obtain an unbiased estimate of the first species' geographic range is proposed.
Abstract: Presence-only records may provide data on the distributions of rare species, but commonly suffer from large, unknown biases due to their typically haphazard collection schemes. Presence-absence or count data collected in systematic, planned surveys are more reliable but typically less abundant.We proposed a probabilistic model to allow for joint analysis of presence-only and survey data to exploit their complementary strengths. Our method pools presence-only and presence-absence data for many species and maximizes a joint likelihood, simultaneously estimating and adjusting for the sampling bias affecting the presence-only data. By assuming that the sampling bias is the same for all species, we can borrow strength across species to efficiently estimate the bias and improve our inference from presence-only data.We evaluate our model's performance on data for 36 eucalypt species in south-eastern Australia. We find that presence-only records exhibit a strong sampling bias towards the coast and towards Sydney, the largest city. Our data-pooling technique substantially improves the out-of-sample predictive performance of our model when the amount of available presence-absence data for a given species is scarceIf we have only presence-only data and no presence-absence data for a given species, but both types of data for several other species that suffer from the same spatial sampling bias, then our method can obtain an unbiased estimate of the first species' geographic range.

313 citations


Journal ArticleDOI
TL;DR: In this article, a review of point process models, some of their advantages and some common methods of fitting them to presence-only data is presented. But the authors do not address the problem of sampling bias.
Abstract: Summary Presence-only data are widely used for species distribution modelling, and point process regression models are a flexible tool that has considerable potential for this problem, when data arise as point events. In this paper, we review point process models, some of their advantages and some common methods of fitting them to presence-only data. Advantages include (and are not limited to) clarification of what the response variable is that is modelled; a framework for choosing the number and location of quadrature points (commonly referred to as pseudo-absences or ‘background points’) objectively; clarity of model assumptions and tools for checking them; models to handle spatial dependence between points when it is present; and ways forward regarding difficult issues such as accounting for sampling bias. Point process models are related to some common approaches to presence-only species distribution modelling, which means that a variety of different software tools can be used to fit these models, including maxent or generalised linear modelling software.

301 citations


Journal ArticleDOI
TL;DR: This work introduces a method for learning pairwise interactions in a linear regression or logistic regression model in a manner that satisfies strong hierarchy: whenever an interaction is estimated to be nonzero, both its associated main effects are also included in the model.
Abstract: We introduce a method for learning pairwise interactions in a linear regression or logistic regression model in a manner that satisfies strong hierarchy: whenever an interaction is estimated to be nonzero, both its associated main effects are also included in the model. We motivate our approach by modeling pairwise interactions for categorical variables with arbitrary numbers of levels, and then show how we can accommodate continuous variables as well. Our approach allows us to dispense with explicitly applying constraints on the main effects and interactions for identifiability, which results in interpretable interaction models. We compare our method with existing approaches on both simulated and real data, including a genome-wide association study, all using our R package glinternet.

220 citations


Journal ArticleDOI
TL;DR: This work presents a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning and involves a novel symmetric use of the group-lasso norm.
Abstract: We consider the problem of learning the structure of a pairwise graphical model over continuous and discrete variables. We present a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning. In previous work, authors have considered structure learning of Gaussian graphical models and structure learning of discrete models. Our approach is a natural generalization of these two lines of work to the mixed case. The penalization scheme involves a novel symmetric use of the group-lasso norm and follows naturally from a particular parameterization of the model. Supplementary materials for this article are available online.

154 citations


Journal ArticleDOI
TL;DR: The leiomyosarcoma subtypes showed significant differences in expression levels for genes for which novel targeted therapies are being developed, suggesting that leiOSarcomA subtypes may respond differentially to these targeted therapies.
Abstract: Purpose: Leiomyosarcoma is a malignant neoplasm with smooth muscle differentiation. Little is known about its molecular heterogeneity and no targeted therapy currently exists for leiomyosarcoma. Recognition of different molecular subtypes is necessary to evaluate novel therapeutic options. In a previous study on 51 leiomyosarcomas, we identified three molecular subtypes in leiomyosarcoma. The current study was performed to determine whether the existence of these subtypes could be confirmed in independent cohorts. Experimental Design: Ninety-nine cases of leiomyosarcoma were expression profiled with 3′end RNA-Sequencing (3SEQ). Consensus clustering was conducted to determine the optimal number of subtypes. Results: We identified 3 leiomyosarcoma molecular subtypes and confirmed this finding by analyzing publically available data on 82 leiomyosarcoma from The Cancer Genome Atlas (TCGA). We identified two new formalin-fixed, paraffin-embedded tissue-compatible diagnostic immunohistochemical markers; LMOD1 for subtype I leiomyosarcoma and ARL4C for subtype II leiomyosarcoma. A leiomyosarcoma tissue microarray with known clinical outcome was used to show that subtype I leiomyosarcoma is associated with good outcome in extrauterine leiomyosarcoma while subtype II leiomyosarcoma is associated with poor prognosis in both uterine and extrauterine leiomyosarcoma. The leiomyosarcoma subtypes showed significant differences in expression levels for genes for which novel targeted therapies are being developed, suggesting that leiomyosarcoma subtypes may respond differentially to these targeted therapies. Conclusions: We confirm the existence of 3 molecular subtypes in leiomyosarcoma using two independent datasets and show that the different molecular subtypes are associated with distinct clinical outcomes. The findings offer an opportunity for treating leiomyosarcoma in a subtype-specific targeted approach. Clin Cancer Res; 21(15); 3501–11. ©2015 AACR .

123 citations


Journal ArticleDOI
TL;DR: In this paper, the authors show that CATS is equivalent to a generalized linear model for abundance, with species traits as predictor variables, and they use this equivalence between the maximum entropy formalism and Poisson regression.
Abstract: Summary Shipley, Vile & Garnier (Science 2006; 314: 812) proposed a maximum entropy approach to studying how species relative abundance is mediated by their traits, ‘community assembly via trait selection’ (CATS). In this paper, we build on recent equivalences between the maximum entropy formalism and Poisson regression to show that CATS is equivalent to a generalized linear model for abundance, with species traits as predictor variables. Main advantages gained by access to the machinery of generalized linear models can be summarized as advantages in interpretation, model checking, extensions and inference. A more difficult issue, however, is the development of valid methods of inference for single-site data, as species correlation in abundance is not accounted for in CATS (whether specified as a regression or via maximum entropy). This issue can be circumvented for multisite data using design-based inference. These points are illustrated by example – our plant abundances were found to violate the implicit Poisson assumption of CATS, but a negative binomial regression had much improved fit, and our model was extended to multisite data in order to directly model the environment–trait interaction. Violations of the Poisson assumption were strong and accounting for them qualitatively changed results, presumably because larger counts had undue influence when overdispersion had not been accounted for. We advise that future CATS analysts routinely check for overdispersion and account for it if present.

77 citations


Posted Content
TL;DR: This work introduces GAMSEL (Generalized Additive Model Selection), a penalized likelihood approach for fitting sparse generalized additive models in high dimension, and presents a blockwise coordinate descent procedure for efficiently optimizing the Penalized likelihood objective over a dense grid of the tuning parameter.
Abstract: We introduce GAMSEL (Generalized Additive Model Selection), a penalized likelihood approach for fitting sparse generalized additive models in high dimension. Our method interpolates between null, linear and additive models by allowing the effect of each variable to be estimated as being either zero, linear, or a low-complexity curve, as determined by the data. We present a blockwise coordinate descent procedure for efficiently optimizing the penalized likelihood objective over a dense grid of the tuning parameter, producing a regularization path of additive models. We demonstrate the performance of our method on both real and simulated data examples, and compare it with existing techniques for additive model selection.

57 citations


Journal Article
TL;DR: This work proposes a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme, and shows that this method can substantially outperform standard case-control subsampled.
Abstract: For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients $\theta^*$. By contrast, our estimator is consistent for $\theta^*$ provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE - even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to $1+\frac{1}{c}$ if we multiply the baseline acceptance probabilities by $c>1$ (and weight points with acceptance probability greater than 1), taking roughly $\frac{1+c}{2}$ times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

55 citations


Journal ArticleDOI
TL;DR: This work exhibits and theoretically explore various fitting procedures for which degrees of freedom is not monotonic in the model complexity parameter, and can exceed the total dimension of the ambient space even in very simple settings.
Abstract: To most applied statisticians, a fitting procedure's degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data. In particular, it is often used to parameterize the bias-variance tradeoff in model selection. We argue that, on the contrary, model complexity and degrees of freedom may correspond very poorly. We exhibit and theoretically explore various fitting procedures for which degrees of freedom is not monotonic in the model complexity parameter, and can exceed the total dimension of the ambient space even in very simple settings. We show that the degrees of freedom for any non-convex projection method can be unbounded.

Journal ArticleDOI
TL;DR: Important biomedical applications, such as osteoarthritis and weight management, will focus the development of new data science methods, and the Mobilize Center will transform human movement research.

Posted Content
TL;DR: An end-to-end framework that segments the text image, classifies the characters and extracts lines using a language model using a deep convolutional neural network to achieve acceptable error rates is presented.
Abstract: In this paper, we address the task of Optical Character Recognition(OCR) for the Telugu script. We present an end-to-end framework that segments the text image, classifies the characters and extracts lines using a language model. The segmentation is based on mathematical morphology. The classification module, which is the most challenging task of the three, is a deep convolutional neural network. The language is modelled as a third degree markov chain at the glyph level. Telugu script is a complex alphasyllabary and the language is agglutinative, making the problem hard. In this paper we apply the latest advances in neural networks to achieve state-of-the-art error rates. We also review convolutional neural networks in great detail and expound the statistical justification behind the many tricks needed to make Deep Learning work.

Journal ArticleDOI
TL;DR: A simple, interpretable strategy for making predictions on test data when the features of the test data are available at the time of model fitting, and applies to a mass-spectrometric imaging data set from an ongoing collaboration in gastric cancer detection which demonstrates the power and interpretability of the technique.
Abstract: We introduce a simple, interpretable strategy for making predictions on test data when the features of the test data are available at the time of model fitting. Our proposal—customized training—clusters the data to find training points close to each test point and then fits an $\ell_{1}$-regularized model (lasso) separately in each training cluster. This approach combines the local adaptivity of $k$-nearest neighbors with the interpretability of the lasso. Although we use the lasso for the model fitting, any supervised learning method can be applied to the customized training sets. We apply the method to a mass-spectrometric imaging data set from an ongoing collaboration in gastric cancer detection which demonstrates the power and interpretability of the technique. Our idea is simple but potentially useful in situations where the data have some underlying structure.

Journal ArticleDOI
TL;DR: EHR data is used as an example to present an analytic procedure to both create an analytic sample and analyze the data to detect clinically meaningful markers of acute myocardial infarction and the results suggest that EHR data can be easily used to detect markers of clinically acute events.
Abstract: Data sources with repeated measurements are an appealing resource to understand the relationship between changes in biological markers and risk of a clinical event. While longitudinal data present opportunities to observe changing risk over time, these analyses can be complicated if the measurement of clinical metrics is sparse and/or irregular, making typical statistical methods unsuitable. In this article, we use electronic health record (EHR) data as an example to present an analytic procedure to both create an analytic sample and analyze the data to detect clinically meaningful markers of acute myocardial infarction (MI). Using an EHR from a large national dialysis organization we abstracted the records of 64,318 individuals and identified 4769 people that had an MI during the study period. We describe a nested case-control design to sample appropriate controls and an analytic approach using regression splines. Fitting a mixed-model with truncated power splines we perform a series of goodness-of-fit tests to determine whether any of 11 regularly collected laboratory markers are useful clinical predictors. We test the clinical utility of each marker using an independent test set. The results suggest that EHR data can be easily used to detect markers of clinically acute events. Special software or analytic tools are not needed, even with irregular EHR data.

Posted Content
TL;DR: In this article, the authors consider large-scale studies in which thousands of significance tests are performed simultaneously, and they unify these methods in the same framework, generalize them to include multiple primary variables and multiple nuisance variables and analyze their statistical properties.
Abstract: We consider large-scale studies in which thousands of significance tests are performed simultaneously. In some of these studies, the multiple testing procedure can be severely biased by latent confounding factors such as batch effects and unmeasured covariates that correlate with both primary variable(s) of interest (e.g. treatment variable, phenotype) and the outcome. Over the past decade, many statistical methods have been proposed to adjust for the confounders in hypothesis testing. We unify these methods in the same framework, generalize them to include multiple primary variables and multiple nuisance variables, and analyze their statistical properties. In particular, we provide theoretical guarantees for RUV-4 and LEAPP, which correspond to two different identification conditions in the framework: the first requires a set of "negative controls" that are known a priori to follow the null distribution; the second requires the true non-nulls to be sparse. Two different estimators which are based on RUV-4 and LEAPP are then applied to these two scenarios. We show that if the confounding factors are strong, the resulting estimators can be asymptotically as powerful as the oracle estimator which observes the latent confounding factors. For hypothesis testing, we show the asymptotic z-tests based on the estimators can control the type I error. Numerical experiments show that the false discovery rate is also controlled by the Benjamini-Hochberg procedure when the sample size is reasonably large.


Posted Content
TL;DR: Dracula as discussed by the authors is a deep extension of compressive feature learning that learns a dictionary of $n$-grams that efficiently compresses a given corpus and recursively compresses its own dictionary; in effect, Dracula is a ''deep'' extension of Compressive Feature Learning.
Abstract: We propose `Dracula', a new framework for unsupervised feature selection from sequential data such as text. Dracula learns a dictionary of $n$-grams that efficiently compresses a given corpus and recursively compresses its own dictionary; in effect, Dracula is a `deep' extension of Compressive Feature Learning. It requires solving a binary linear program that may be relaxed to a linear program. Both problems exhibit considerable structure, their solution paths are well behaved, and we identify parameters which control the depth and diversity of the dictionary. We also discuss how to derive features from the compressed documents and show that while certain unregularized linear models are invariant to the structure of the compressed dictionary, this structure may be used to regularize learning. Experiments are presented that demonstrate the efficacy of Dracula's features.

Proceedings Article
12 Jul 2015
TL;DR: This paper uses suffix trees to show that N-gram matrices require memory and time that is at best linear and at worst linear to store and to multiply a vector, and provides a linear running time and memory framework that produces the data structure necessary for fast multiplication.
Abstract: This paper addresses the computational issues of learning with long, and possibly all, N-grams in a document corpus. Our main result uses suffix trees to show that N-gram matrices require memory and time that is at worst linear (in the length of the underlying corpus) to store and to multiply a vector. Our algorithm can speed up any N-gram based machine learning algorithm which uses gradient descent or an optimization procedure that makes progress through multiplication. We also provide a linear running time and memory framework that screens N-gram features according to a multitude of statistical criteria and produces the data structure necessary for fast multiplication. Experiments on natural language and DNA sequence datasets demonstrate the computational savings of our framework; our multiplication algorithm is four orders of magnitude more efficient than naive multiplication on the DNA data. We also show that prediction accuracy on large-scale sentiment analysis problems benefits from long N-grams.