scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Genome-wide association analysis by lasso penalized logistic regression

01 Mar 2009-Bioinformatics (Oxford University Press)-Vol. 25, Iss: 6, pp 714-721
TL;DR: The performance of lasso penalized logistic regression in case-control disease gene mapping with a large number of SNPs (single nucleotide polymorphisms) predictors is evaluated and coeliac disease results replicate the previous SNP results and shed light on possible interactions among the SNPs.
Abstract: Motivation: In ordinary regression, imposition of a lasso penalty makes continuous model selection straightforward. Lasso penalized regression is particularly advantageous when the number of predictors far exceeds the number of observations. Method: The present article evaluates the performance of lasso penalized logistic regression in case–control disease gene mapping with a large number of SNPs (single nucleotide polymorphisms) predictors. The strength of the lasso penalty can be tuned to select a predetermined number of the most relevant SNPs and other predictors. For a given value of the tuning constant, the penalized likelihood is quickly maximized by cyclic coordinate ascent. Once the most potent marginal predictors are identified, their two-way and higher order interactions can also be examined by lasso penalized logistic regression. Results: This strategy is tested on both simulated and real data. Our findings on coeliac disease replicate the previous SNP results and shed light on possible interactions among the SNPs. Availability: The software discussed is available in Mendel 9.0 at the UCLA Human Genetics web site. Contact: klange@ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In comparative timings, the new algorithms are considerably faster than competing methods and can handle large problems and can also deal efficiently with sparse features.
Abstract: We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include l(1) (the lasso), l(2) (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.

13,656 citations


Cites background from "Genome-wide association analysis by..."

  • ...Several other researchers have also re-discovered coordinate descent, many for solving the same problems we address in this paper—notably Shevade and Keerthi (2003), Krishnapuram and Hartemink (2005), Genkin et al. (2007) and Wu et al. (2009)....

    [...]

  • ...Several other researchers have also re-discovered coordinate descent, many for solving the same problems we address in this paper|notably Shevade and Keerthi (2003), Krishnapuram and Hartemink (2005), Genkin et al. (2007) and Wu et al. (2009) ....

    [...]

BookDOI
07 May 2015
TL;DR: Statistical Learning with Sparsity: The Lasso and Generalizations presents methods that exploit sparsity to help recover the underlying signal in a set of data and extract useful and reproducible patterns from big datasets.
Abstract: Discover New Methods for Dealing with High-Dimensional Data A sparse statistical model has only a small number of nonzero parameters or weights; therefore, it is much easier to estimate and interpret than a dense model. Statistical Learning with Sparsity: The Lasso and Generalizations presents methods that exploit sparsity to help recover the underlying signal in a set of data. Top experts in this rapidly evolving field, the authors describe the lasso for linear regression and a simple coordinate descent algorithm for its computation. They discuss the application of 1 penalties to generalized linear models and support vector machines, cover generalized penalties such as the elastic net and group lasso, and review numerical methods for optimization. They also present statistical inference methods for fitted (lasso) models, including the bootstrap, Bayesian methods, and recently developed approaches. In addition, the book examines matrix decomposition, sparse multivariate analysis, graphical models, and compressed sensing. It concludes with a survey of theoretical results for the lasso. In this age of big data, the number of features measured on a person or object can be large and might be larger than the number of observations. This book shows how the sparsity assumption allows us to tackle these problems and extract useful and reproducible patterns from big datasets. Data analysts, computer scientists, and theorists will appreciate this thorough and up-to-date treatment of sparse statistical modeling.

2,275 citations

Journal ArticleDOI
TL;DR: It is shown that published studies with significant association of polygenic scores have been well powered, whereas those with negative results can be explained by low sample size, and that useful levels of prediction may only be approached when predictors are estimated from very large samples.
Abstract: Polygenic scores have recently been used to summarise genetic effects among an ensemble of markers that do not individually achieve significance in a large-scale association study. Markers are selected using an initial training sample and used to construct a score in an independent replication sample by forming the weighted sum of associated alleles within each subject. Association between a trait and this composite score implies that a genetic signal is present among the selected markers, and the score can then be used for prediction of individual trait values. This approach has been used to obtain evidence of a genetic effect when no single markers are significant, to establish a common genetic basis for related disorders, and to construct risk prediction models. In some cases, however, the desired association or prediction has not been achieved. Here, the power and predictive accuracy of a polygenic score are derived from a quantitative genetics model as a function of the sizes of the two samples, explained genetic variance, selection thresholds for including a marker in the score, and methods for weighting effect sizes in the score. Expressions are derived for quantitative and discrete traits, the latter allowing for case/control sampling. A novel approach to estimating the variance explained by a marker panel is also proposed. It is shown that published studies with significant association of polygenic scores have been well powered, whereas those with negative results can be explained by low sample size. It is also shown that useful levels of prediction may only be approached when predictors are estimated from very large samples, up to an order of magnitude greater than currently available. Therefore, polygenic scores currently have more utility for association testing than predicting complex traits, but prediction will become more feasible as sample sizes continue to grow.

1,393 citations


Cites background from "Genome-wide association analysis by..."

  • ...The normal distribution simplifies some calculations, but various heavy-tailed distributions have also been proposed for GWAS data [38,39] and would lead to improved prediction if such models held in truth....

    [...]

Journal ArticleDOI
TL;DR: The development of pathway-based approaches for GWA studies are reviewed, their practical use and caveats are discussed, and it is suggested that pathway- based approaches may also be useful for future GWA study data sets with sequencing data.
Abstract: Genome-wide association (GWA) studies have typically focused on the analysis of single markers, which often lacks the power to uncover the relatively small effect sizes conferred by most genetic variants. Recently, pathway-based approaches have been developed, which use prior biological knowledge on gene function to facilitate more powerful analysis of GWA study data sets. These approaches typically examine whether a group of related genes in the same functional pathway are jointly associated with a trait of interest. Here we review the development of pathway-based approaches for GWA studies, discuss their practical use and caveats, and suggest that pathway-based approaches may also be useful for future GWA studies with sequencing data.

796 citations


Cites methods from "Genome-wide association analysis by..."

  • ...In fact, hierarchical models using LASSO have been successfully applied in simultaneous multivariate analyses of all GWA study SNP...

    [...]

Journal ArticleDOI
TL;DR: This work applies Bayesian sparse linear mixed model (BSLMM) and compares it with other methods for two polygenic modeling applications: estimating the proportion of variance in phenotypes explained (PVE) by available genotypes, and phenotype (or breeding value) prediction, and demonstrates that BSLMM considerably outperforms either of the other two methods.
Abstract: Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications, including, recently, polygenic modeling in genome-wide association studies. These two approaches make very different assumptions, so are expected to perform well in different situations. However, in practice, for a given dataset one typically does not know which assumptions will be more accurate. Motivated by this, we consider a hybrid of the two, which we refer to as a “Bayesian sparse linear mixed model” (BSLMM) that includes both these models as special cases. We address several key computational and statistical issues that arise when applying BSLMM, including appropriate prior specification for the hyper-parameters and a novel Markov chain Monte Carlo algorithm for posterior inference. We apply BSLMM and compare it with other methods for two polygenic modeling applications: estimating the proportion of variance in phenotypes explained (PVE) by available genotypes, and phenotype (or breeding value) prediction. For PVE estimation, we demonstrate that BSLMM combines the advantages of both standard LMMs and sparse regression modeling. For phenotype prediction it considerably outperforms either of the other two methods, as well as several other large-scale regression methods previously suggested for this problem. Software implementing our method is freely available from http://stephenslab.uchicago.edu/software.html.

764 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Abstract: SUMMARY The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.

83,420 citations


"Genome-wide association analysis by..." refers background or methods in this paper

  • ...In the Simes procedure highlighted by Benjamini and Hochberg (1995) in their analysis of FDR, there are n null hypotheses H1,....

    [...]

  • ...In the Simes procedure highlighted by Benjamini and Hochberg (1995) in their analysis of FDR, there are n null hypotheses H1,...,Hn and n corresponding P-values P1,...,Pn. The latter are replaced by their order statistics P(1),...,P(n). If for a given α≥0, we choose the largest integer j such that P(i) ≤ (i/n)α for all i≤ j, then we can reject the hypotheses H(1), ...,H(j) at an FDR of α or better. This procedure is justified in theory when the tests are independent or positively correlated. In the presence of linkage equilibrium, association tests are independent; in the presence of linkage disequilibrium, they are positively correlated. For a more detailed discussion of the multiple testing issues in SNP studies, see Nyholt (2004)....

    [...]

  • ...In the Simes procedure highlighted by Benjamini and Hochberg (1995) in their analysis of FDR, there are n null hypotheses H1,...,Hn and n corresponding P-values P1,...,Pn....

    [...]

Journal ArticleDOI
TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Abstract: SUMMARY We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.

40,785 citations


"Genome-wide association analysis by..." refers background in this paper

  • ...Contact: klange@ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online....

    [...]

  • ...The lasso penalty is an effective device for continuous model selection, especially in problems where the number of predictors p far exceeds the number of observations n (Chen et al., 1998; Claerbout and Muir, 1973; Santosa and Symes, 1986; Taylor et al., 1979; Tibshirani, 1996)....

    [...]

Journal ArticleDOI
TL;DR: In comparative timings, the new algorithms are considerably faster than competing methods and can handle large problems and can also deal efficiently with sparse features.
Abstract: We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include l(1) (the lasso), l(2) (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.

13,656 citations


"Genome-wide association analysis by..." refers background in this paper

  • ...Schwender and Ickstadt (2008) and Kooperberg and Ruczinski (2005) identify interactions using logic regression. These and other relevant papers are reviewed by Liang and Kelemen (2008). We focus on a coordinate descent algorithm because it appears to be the fastest available....

    [...]

  • ...Schwender and Ickstadt (2008) and Kooperberg and Ruczinski (2005) identify interactions using logic regression....

    [...]

Journal ArticleDOI
TL;DR: Basis Pursuit (BP) is a principle for decomposing a signal into an "optimal" superposition of dictionary elements, where optimal means having the smallest l1 norm of coefficients among all such decompositions.
Abstract: The time-frequency and time-scale communities have recently developed a large number of overcomplete waveform dictionaries --- stationary wavelets, wavelet packets, cosine packets, chirplets, and warplets, to name a few. Decomposition into overcomplete systems is not unique, and several methods for decomposition have been proposed, including the method of frames (MOF), Matching pursuit (MP), and, for special dictionaries, the best orthogonal basis (BOB). Basis Pursuit (BP) is a principle for decomposing a signal into an "optimal" superposition of dictionary elements, where optimal means having the smallest l1 norm of coefficients among all such decompositions. We give examples exhibiting several advantages over MOF, MP, and BOB, including better sparsity and superresolution. BP has interesting relations to ideas in areas as diverse as ill-posed problems, in abstract harmonic analysis, total variation denoising, and multiscale edge denoising. BP in highly overcomplete dictionaries leads to large-scale optimization problems. With signals of length 8192 and a wavelet packet dictionary, one gets an equivalent linear program of size 8192 by 212,992. Such problems can be attacked successfully only because of recent advances in linear programming by interior-point methods. We obtain reasonable success with a primal-dual logarithmic barrier method and conjugate-gradient solver.

9,950 citations


"Genome-wide association analysis by..." refers background in this paper

  • ...Contact: klange@ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online....

    [...]

Journal ArticleDOI
TL;DR: It is proved that replacing the usual quadratic regularizing penalties by weighted 𝓁p‐penalized penalties on the coefficients of such expansions, with 1 ≤ p ≤ 2, still regularizes the problem.
Abstract: We consider linear inverse problems where the solution is assumed to have a sparse expansion on an arbitrary preassigned orthonormal basis. We prove that replacing the usual quadratic regularizing penalties by weighted p-penalties on the coefficients of such expansions, with 1 ≤ p ≤ 2, still regularizes the problem. Use of such p-penalized problems with p < 2 is often advocated when one expects the underlying ideal noiseless solution to have a sparse expansion with respect to the basis under consideration. To compute the corresponding regularized solutions, we analyze an iterative algorithm that amounts to a Landweber iteration with thresholding (or nonlinear shrinkage) applied at each iteration step. We prove that this algorithm converges in norm. © 2004 Wiley Periodicals, Inc.

4,339 citations


"Genome-wide association analysis by..." refers background in this paper

  • ...Contact: klange@ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online....

    [...]