scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A solution to the problem of separation in logistic regression

30 Aug 2002-Statistics in Medicine (Stat Med)-Vol. 21, Iss: 16, pp 2409-2419
TL;DR: A procedure by Firth originally developed to reduce the bias of maximum likelihood estimates is shown to provide an ideal solution to separation and produces finite parameter estimates by means of penalized maximum likelihood estimation.
Abstract: The phenomenon of separation or monotone likelihood is observed in the fitting process of a logistic model if the likelihood converges while at least one parameter estimate diverges to +/- infinity. Separation primarily occurs in small samples with several unbalanced and highly predictive risk factors. A procedure by Firth originally developed to reduce the bias of maximum likelihood estimates is shown to provide an ideal solution to separation. It produces finite parameter estimates by means of penalized maximum likelihood estimation. Corresponding Wald tests and confidence intervals are available but it is shown that penalized likelihood ratio tests and profile penalized likelihood confidence intervals are often preferable. The clear advantage of the procedure over previous options of analysis is impressively demonstrated by the statistical analysis of two cancer studies.
Citations
More filters
Journal ArticleDOI
01 May 1981
TL;DR: This chapter discusses Detecting Influential Observations and Outliers, a method for assessing Collinearity, and its applications in medicine and science.
Abstract: 1. Introduction and Overview. 2. Detecting Influential Observations and Outliers. 3. Detecting and Assessing Collinearity. 4. Applications and Remedies. 5. Research Issues and Directions for Extensions. Bibliography. Author Index. Subject Index.

4,948 citations

Book
29 Mar 2012
TL;DR: The problem of missing data concepts of MCAR, MAR and MNAR simple solutions that do not (always) work multiple imputation in a nutshell and some dangers, some do's and some don'ts are covered.
Abstract: Basics Introduction The problem of missing data Concepts of MCAR, MAR and MNAR Simple solutions that do not (always) work Multiple imputation in a nutshell Goal of the book What the book does not cover Structure of the book Exercises Multiple imputation Historic overview Incomplete data concepts Why and when multiple imputation works Statistical intervals and tests Evaluation criteria When to use multiple imputation How many imputations? Exercises Univariate missing data How to generate multiple imputations Imputation under the normal linear normal Imputation under non-normal distributions Predictive mean matching Categorical data Other data types Classification and regression trees Multilevel data Non-ignorable methods Exercises Multivariate missing data Missing data pattern Issues in multivariate imputation Monotone data imputation Joint Modeling Fully Conditional Specification FCS and JM Conclusion Exercises Imputation in practice Overview of modeling choices Ignorable or non-ignorable? Model form and predictors Derived variables Algorithmic options Diagnostics Conclusion Exercises Analysis of imputed data What to do with the imputed data? Parameter pooling Statistical tests for multiple imputation Stepwise model selection Conclusion Exercises Case studies Measurement issues Too many columns Sensitivity analysis Correct prevalence estimates from self-reported data Enhancing comparability Exercises Selection issues Correcting for selective drop-out Correcting for non-response Exercises Longitudinal data Long and wide format SE Fireworks Disaster Study Time raster imputation Conclusion Exercises Extensions Conclusion Some dangers, some do's and some don'ts Reporting Other applications Future developments Exercises Appendices: Software R S-Plus Stata SAS SPSS Other software References Author Index Subject Index

2,156 citations

Journal ArticleDOI
TL;DR: In this paper, the authors propose a new prior distribution for logistic regression models, called Cauchy prior, constructed by first scaling all nonbinary variables to have mean 0 and standard deviation 0.5, and then placing independent Student-t prior distributions on the coefficients.
Abstract: We propose a new prior distribution for classical (nonhierarchical) logistic regression models, constructed by first scaling all nonbinary variables to have mean 0 and standard deviation 0.5, and then placing independent Student-t prior distributions on the coefficients. As a default choice, we recommend the Cauchy distribution with center 0 and scale 2.5, which in the simplest setting is a longer-tailed version of the distribution attained by assuming one-half additional success and one-half additional failure in a logistic regression. Cross-validation on a corpus of datasets shows the Cauchy class of prior distributions to outperform existing implementations of Gaussian and Laplace priors. We recommend this prior distribution as a default choice for routine applied use. It has the advantage of always giving answers, even when there is complete separation in logistic regression (a common problem, even when the sample size is large and the number of predictors is small), and also automatically applying more shrinkage to higher-order interactions. This can be useful in routine data analysis as well as in automated procedures such as chained equations for missing-data imputation. We implement a procedure to fit generalized linear models in R with the Student-t prior distribution by incorporating an approximate EM algorithm into the usual iteratively weighted least squares. We illustrate with several applications, including a series of logistic regressions predicting voting preferences, a small bioassay experiment, and an imputation model for a public health data set.

1,598 citations

Journal ArticleDOI
TL;DR: It is shown that the collapsing method, which involves collapsing genotypes across variants and applying a univariate test, is powerful for analyzing rare variants, whereas multivariate analysis is robust against inclusion of noncausal variants.
Abstract: Although whole-genome association studies using tagSNPs are a powerful approach for detecting common variants, they are underpowered for detecting associations with rare variants. Recent studies have demonstrated that common diseases can be due to functional variants with a wide spectrum of allele frequencies, ranging from rare to common. An effective way to identify rare variants is through direct sequencing. The development of cost-effective sequencing technologies enables association studies to use sequence data from candidate genes and, in the future, from the entire genome. Although methods used for analysis of common variants are applicable to sequence data, their performance might not be optimal. In this study, it is shown that the collapsing method, which involves collapsing genotypes across variants and applying a univariate test, is powerful for analyzing rare variants, whereas multivariate analysis is robust against inclusion of noncausal variants. Both methods are superior to analyzing each variant individually with univariate tests. In order to unify the advantages of both collapsing and multiple-marker tests, we developed the Combined Multivariate and Collapsing (CMC) method and demonstrated that the CMC method is both powerful and robust. The CMC method can be applied to either candidate-gene or whole-genome sequence data.

1,500 citations


Cites methods from "A solution to the problem of separa..."

  • ...It is a well-known phenomenon that low cell counts or empty cells can cause numerical instability of the maximum-likelihood estimation.(34) When logistic-regression analysis was applied to collapsed variants or to the CMC method, type I error was well controlled; however, this might not be the case if after collapsing the total allele frequency is still very low....

    [...]

Journal ArticleDOI
TL;DR: Monte Carlo analysis demonstrates that, for the types of hazards one often sees in substantive research, the polynomial approximation always outperforms time dummies and generally performs as well as splines or even more flexible autosmoothing procedures.
Abstract: Since Beck, Katz, and Tucker (1998), the standard method for modeling time dependence in binary data has been to incorporate time dummies or splined time in logistic regressions. Although we agree with the need for modeling time dependence, we demonstrate that time dummies can induce estimation problems due to separation. Splines do not suffer from these problems. However, the complexity of splines has led substantive researchers (1) to use knot values that may be inappropriate for their data and (2) to ignore any substantive discussion concerning temporal dependence. We propose a relatively simple alternative: including t, t 2 , and t 3 in the regression. This cubic polynomial approximation is trivial to implement—and, therefore, interpret—and it avoids problems such as quasi-complete separation. Monte Carlo analysis demonstrates that, for the types of hazards one often sees in substantive research, the polynomial approximation always outperforms time dummies and generally performs as well as splines or even more flexible autosmoothing procedures. Due to its simplicity, this method also accommodates nonproportional hazards in a straightforward way. We reanalyze Crowley and Skocpol (2001) using nonproportional hazards and find new empirical support for the historical-institutionalist perspective.

1,314 citations

References
More filters
Book
08 Jul 1980
TL;DR: In this article, the authors present a method for detecting and assessing Collinearity of observations and outliers in the context of extensions to the Wikipedia corpus, based on the concept of Influential Observations.
Abstract: 1. Introduction and Overview. 2. Detecting Influential Observations and Outliers. 3. Detecting and Assessing Collinearity. 4. Applications and Remedies. 5. Research Issues and Directions for Extensions. Bibliography. Author Index. Subject Index.

6,449 citations

Journal ArticleDOI
01 May 1981
TL;DR: This chapter discusses Detecting Influential Observations and Outliers, a method for assessing Collinearity, and its applications in medicine and science.
Abstract: 1. Introduction and Overview. 2. Detecting Influential Observations and Outliers. 3. Detecting and Assessing Collinearity. 4. Applications and Remedies. 5. Research Issues and Directions for Extensions. Bibliography. Author Index. Subject Index.

4,948 citations

Journal ArticleDOI
TL;DR: In this paper, the first-order term is removed from the asymptotic bias of maximum likelihood estimates by a suitable modification of the score function, and the effect is to penalize the likelihood by the Jeffreys invariant prior.
Abstract: SUMMARY It is shown how, in regular parametric problems, the first-order term is removed from the asymptotic bias of maximum likelihood estimates by a suitable modification of the score function. In exponential families with canonical parameterization the effect is to penalize the likelihood by the Jeffreys invariant prior. In binomial logistic models, Poisson log linear models and certain other generalized linear models, the Jeffreys prior penalty function can be imposed in standard regression software using a scheme of iterative adjustments to the data.

3,362 citations


"A solution to the problem of separa..." refers background or methods in this paper

  • ...By using this modi cation Firth [19] showed that the O(n−1) bias of maximum likelihood...

    [...]

  • ...order to reduce the small sample bias of these estimates Firth [19] suggested basing estimation on modi ed score equations...

    [...]

  • ...Estimation of standard errors can be based on the roots of the diagonal elements of I(̂)−1, which is a rst-order approximation to {−92 log L∗=(9 )2}−1 (see Firth, reference [19], p....

    [...]

  • ...In the following section we rst review some principal ideas of Firth [19], then deal with their implementation in logistic regression (FL), and, nally, suggest con dence intervals based on the pro le penalized likelihood....

    [...]

Book
01 Dec 1993
TL;DR: This paper discusses the design of clinical trials, use of computer software in survival analysis, and some non-parametric procedures for modelling survival data.
Abstract: Some non-parametric procedures. Modelling survival data. The Cox Regression Model. Design of clinical trials. Some other models for survival data. Model checking. Time dependent co-variates. Interval censored survival data. Multi-state survival models. Some additional topics. Use of computer software in survival analysis. Appendices: Example data sets. Maximum liklihood estimation score statistics and information. GLIM macros for survival analysis.

2,564 citations