scispace - formally typeset
Search or ask a question
Proceedings Article

EfficientL 1 regularized logistic regression

16 Jul 2006-pp 401-408
TL;DR: Theoretical results show that the proposed efficient algorithm for L1 regularized logistic regression is guaranteed to converge to the global optimum, and experiments show that it significantly outperforms standard algorithms for solving convex optimization problems.
Abstract: L1 regularized logistic regression is now a workhorse of machine learning: it is widely used for many classification problems, particularly ones with many features. L1 regularized logistic regression requires solving a convex optimization problem. However, standard algorithms for solving convex optimization problems do not scale well enough to handle the large datasets encountered in many practical settings. In this paper, we propose an efficient algorithm for L1 regularized logistic regression. Our algorithm iteratively approximates the objective function by a quadratic approximation at the current point, while maintaining the L1 constraint. In each iteration, it uses the efficient LARS (Least Angle Regression) algorithm to solve the resulting L1 constrained quadratic optimization problem. Our theoretical results show that our algorithm is guaranteed to converge to the global optimum. Our experiments show that our algorithm significantly outperforms standard algorithms for solving convex optimization problems. Moreover, our algorithm outperforms four previously published algorithms that were specifically designed to solve the L1 regularized logistic regression problem.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
13 Aug 2016
TL;DR: In this article, the authors propose LIME, a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem.
Abstract: Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model, which can be used to transform an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally varound the prediction. We also propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). We show the utility of explanations via novel experiments, both simulated and with human subjects, on various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and identifying why a classifier should not be trusted.

11,104 citations

Journal ArticleDOI
TL;DR: The performance of lasso penalized logistic regression in case-control disease gene mapping with a large number of SNPs (single nucleotide polymorphisms) predictors is evaluated and coeliac disease results replicate the previous SNP results and shed light on possible interactions among the SNPs.
Abstract: Motivation: In ordinary regression, imposition of a lasso penalty makes continuous model selection straightforward. Lasso penalized regression is particularly advantageous when the number of predictors far exceeds the number of observations. Method: The present article evaluates the performance of lasso penalized logistic regression in case–control disease gene mapping with a large number of SNPs (single nucleotide polymorphisms) predictors. The strength of the lasso penalty can be tuned to select a predetermined number of the most relevant SNPs and other predictors. For a given value of the tuning constant, the penalized likelihood is quickly maximized by cyclic coordinate ascent. Once the most potent marginal predictors are identified, their two-way and higher order interactions can also be examined by lasso penalized logistic regression. Results: This strategy is tested on both simulated and real data. Our findings on coeliac disease replicate the previous SNP results and shed light on possible interactions among the SNPs. Availability: The software discussed is available in Mendel 9.0 at the UCLA Human Genetics web site. Contact: klange@ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online.

784 citations

Proceedings ArticleDOI
Galen Andrew1, Jianfeng Gao1
20 Jun 2007
TL;DR: This work presents an algorithm Orthant-Wise Limited-memory Quasi-Newton (OWL-QN), based on L-BFGS, that can efficiently optimize the L1-regularized log-likelihood of log-linear models with millions of parameters.
Abstract: The L-BFGS limited-memory quasi-Newton method is the algorithm of choice for optimizing the parameters of large-scale log-linear models with L2 regularization, but it cannot be used for an L1-regularized loss due to its non-differentiability whenever some parameter is zero. Efficient algorithms have been proposed for this task, but they are impractical when the number of parameters is very large. We present an algorithm Orthant-Wise Limited-memory Quasi-Newton (OWL-QN), based on L-BFGS, that can efficiently optimize the L1-regularized log-likelihood of log-linear models with millions of parameters. In our experiments on a parse reranking task, our algorithm was several orders of magnitude faster than an alternative algorithm, and substantially faster than L-BFGS on the analogous L2-regularized problem. We also present a proof that OWL-QN is guaranteed to converge to a globally optimal parameter vector.

612 citations


Cites methods from "EfficientL 1 regularized logistic r..."

  • ...Lee et al. (2006) propose the algorithm irls-lars, inspired by Newton’s method, which iteratively minimizes the function’s second order Taylor expansion, subject to linear constraints....

    [...]

  • ...Lee et al. (2006) propose the algorithm irls-lars, inspired by Newton s method, which iteratively min­imizes the function s second order Taylor expansion, subject to linear constraints....

    [...]

01 Jan 2007
TL;DR: In this article, an efficient interior-point method for solving large-scale 1-regularized logistic regression problems is described. But the method is not suitable for large scale problems, such as the 20 Newsgroups data set.
Abstract: Logistic regression with ‘1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale ‘1-regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently.

596 citations

References
More filters
Journal ArticleDOI
TL;DR: Consideration of the primal and dual problems together leads to important new insights into the characteristics of the LASSO estimator and to an improved method for estimating its covariance matrix.
Abstract: Proposed by Tibshirani, the least absolute shrinkage and selection operator (LASSO) estimates a vector of regression coefficients by minimizing the residual sum of squares subject to a constraint on the l 1-norm of the coefficient vector. The LASSO estimator typically has one or more zero elements and thus shares characteristics of both shrinkage estimation and variable selection. In this article we treat the LASSO as a convex programming problem and derive its dual. Consideration of the primal and dual problems together leads to important new insights into the characteristics of the LASSO estimator and to an improved method for estimating its covariance matrix. Using these results we also develop an efficient algorithm for computing LASSO estimates which is usable even in cases where the number of regressors exceeds the number of observations. An S-Plus library based on this algorithm is available from StatLib.

763 citations


"EfficientL 1 regularized logistic r..." refers methods in this paper

  • ...Roth (2004) proposed an algorithm called generalized LASSO that extends a LASSO algorithm proposed by Osborne et al. (2000)....

    [...]

  • ...Roth (2004) proposed an algorithm called generalized LASSO that extends a LASSO algorithm proposed by Osborne et al. (2000). (The LASSO refers to an L1 regularized...

    [...]

Journal ArticleDOI
TL;DR: The scope of application of iteratively reweighted least squares to statistical estimation problems is considerably wider than is generally appreciated as mentioned in this paper, and it extends beyond the exponential-family-type generalized linear models to other distributions, to non-linear parameterizations, and to dependent observations.
Abstract: The scope of application of iteratively reweighted least squares to statistical estimation problems is considerably wider than is generally appreciated. It extends beyond the exponential-family-type generalized linear models to other distributions, to non-linear parameterizations, and to dependent observations. Various criteria for estimation other than maximum likelihood, including resistant alternatives, may be used. The algorithms are generally numerically stable, easily programmed without he aid of packages, and highly suited to interactive computation.

586 citations


"EfficientL 1 regularized logistic r..." refers background or methods in this paper

  • ...(Green 1984; Minka 2003) IRLS reformulates the problem of finding the step direction for Newton’s method as a weighted ordinary least squares problem....

    [...]

  • ...(See Green 1984, or Minka 2003 for details of this derivation.)...

    [...]

  • ...In particular, our algorithm can be used for parameter learning for L1 constrained generalized linear models....

    [...]

  • ...In the k’th iteration, it finds a step direction γ(k) by solving the constrained least squares problem of Equation (11)....

    [...]

01 Jan 2004
TL;DR: This note compares eight different algorithms for computing the maximum a-posteriori parameter estimate and finds the fastest algorithms turn out to be conjugate gradient ascent and quasi-Newton algorithms, which far outstrip Iterative Scaling and its variants.
Abstract: Logistic regression is a workhorse of statistics and is closely related to method s used in Machine Learning, including the Perceptron and the Support Vector Machine. This note compares eight different algorithms for computing the maximum a-posteriori parameter estimate. A full derivation of each algorithm is given. In particular, a new derivation of Iterative Scaling is given which applies more generally than the conventional one. A new derivation is also given for the Modified Iterative Scaling algorithm of Collins et al. (2002). Most of the alg orithms operate in the primal space, but can also work in dual space. All algorithms are compared in terms of computational complexity by experiments on large data sets. The fastest algorithms turn out to be conjugate gradient ascent and quasi-Newton algorithms, which far outstrip Iterative Scaling and its variants.

290 citations


"EfficientL 1 regularized logistic r..." refers background or methods in this paper

  • ...(Green 1984; Minka 2003) IRLS reformulates the problem of finding the step direction for Newton’s method as a weighted ordinary least squares problem....

    [...]

  • ...(See Green 1984, or Minka 2003 for details of this derivation.)...

    [...]

Journal ArticleDOI
Volker Roth1
TL;DR: This paper presents a different class of kernel regressors that effectively overcome the above problems, and presents a highly efficient algorithm with guaranteed global convergence that defies a unique framework for sparse regression models in the very rich class of IRLS models.
Abstract: In the last few years, the support vector machine (SVM) method has motivated new interest in kernel regression techniques. Although the SVM has been shown to exhibit excellent generalization properties in many experiments, it suffers from several drawbacks, both of a theoretical and a technical nature: the absence of probabilistic outputs, the restriction to Mercer kernels, and the steep growth of the number of support vectors with increasing size of the training set. In this paper, we present a different class of kernel regressors that effectively overcome the above problems. We call this approach generalized LASSO regression. It has a clear probabilistic interpretation, can handle learning sets that are corrupted by outliers, produces extremely sparse solutions, and is capable of dealing with large-scale problems. For regression functionals which can be modeled as iteratively reweighted least-squares (IRLS) problems, we present a highly efficient algorithm with guaranteed global convergence. This defies a unique framework for sparse regression models in the very rich class of IRLS models, including various types of robust regression models and logistic regression. Performance studies for many standard benchmark datasets effectively demonstrate the advantages of this model over related approaches.

281 citations


"EfficientL 1 regularized logistic r..." refers methods in this paper

  • ...Roth (2004) proposed an algorithm called generalized LASSO that extends a LASSO algorithm proposed by Osborne et al. (2000)....

    [...]

  • ...Experimental details on the other algorithms We compared our algorithm (IRLS-LARS) to four previously published algorithms: Grafting (Perkins & Theiler 2003), Generalized LASSO (Roth 2004), SCGIS (Goodman 2004), and Gl1ce (Lokhorst 1999)....

    [...]

Proceedings Article
21 Aug 2003
TL;DR: It is argued that existing feature selection methods do not perform well in this scenario, and a promising alternative method is described, based on a stagewise gradient descent technique which is called grafting.
Abstract: In the standard feature selection problem, we are given a fixed set of candidate features for use in a learning problem, and must select a subset that will be used to train a model that is "as good as possible" according to some criterion. In this paper, we present an interesting and useful variant, the online feature selection problem, in which, instead of all features being available from the start, features arrive one at a time. The learner's task is to select a subset of features and return a corresponding model at each time step which is as good as possible given the features seen so far. We argue that existing feature selection methods do not perform well in this scenario, and describe a promising alternative method, based on a stagewise gradient descent technique which we call grafting.

211 citations


"EfficientL 1 regularized logistic r..." refers methods in this paper

  • ...Figure 1 shows the results for the five algorithms specifically designed for L1 regularized logistic regression (IRLSLARS, Grafting, SCGIS, GenLASSO and Gl1ce) on 12 datasets....

    [...]

  • ...More specifically, in 8 (out of 12) datasets our method was more than 8 times faster than Grafting....

    [...]

  • ...For example, for the algorithms that are based on conjugate gradient and Newton’s method, we extensively tuned the parameters of the line-search algorithm; for the algorithms using conjugate gradient (Grafting, CG-epsL1, CG-Huber and CGL1), we tested both our own conjugate gradient implementation as well as the MATLAB optimization toolbox’s conjugate gradient; for the algorithms that use the approximate L1 penalty term, we tried many choices for in the range (10−15 < < 0.01) and chose the one with the shortest running time; etc....

    [...]

  • ...Grafting uses a local derivative test in each iteration of the conjugate gradient method, to choose an additional feature that is allowed to differ from zero....

    [...]

  • ...Experimental details on the other algorithms We compared our algorithm (IRLS-LARS) to four previously published algorithms: Grafting (Perkins & Theiler 2003), Generalized LASSO (Roth 2004), SCGIS (Goodman 2004), and Gl1ce (Lokhorst 1999)....

    [...]