scispace - formally typeset
Search or ask a question
Author

Joseph L. Schafer

Bio: Joseph L. Schafer is an academic researcher from Pennsylvania State University. The author has contributed to research in topics: Missing data & Imputation (statistics). The author has an hindex of 20, co-authored 29 publications receiving 15403 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: 2 general approaches that come highly recommended: maximum likelihood (ML) and Bayesian multiple imputation (MI) are presented and may eventually extend the ML and MI methods that currently represent the state of the art.
Abstract: Statistical procedures for missing data have vastly improved, yet misconception and unsound practice still abound. The authors frame the missing-data problem, review methods, offer advice, and raise issues that remain unresolved. They clear up common misunderstandings regarding the missing at random (MAR) concept. They summarize the evidence against older procedures and, with few exceptions, discourage their use. They present, in both technical and practical language, 2 general approaches that come highly recommended: maximum likelihood (ML) and Bayesian multiple imputation (MI). Newer developments are discussed, including some for dealing with missing data that are not MAR. Although not yet in the mainstream, these procedures may eventually extend the ML and MI methods that currently represent the state of the art.

10,568 citations

Journal ArticleDOI
TL;DR: A simulation was presented to assess the potential costs and benefits of a restrictive strategy, which makes minimal use of auxiliary variables, versus an inclusive strategy,Which shows that the inclusive strategy is to be greatly preferred.
Abstract: Two classes of modern missing data procedures, maximum likelihood (ML) and multiple imputation (MI), tend to yield similar results when implemented in comparable ways. In either approach, it is possible to include auxiliary variables solely for the purpose of improving the missing data procedure. A simulation was presented to assess the potential costs and benefits of a restrictive strategy, which makes minimal use of auxiliary variables, versus an inclusive strategy, which makes liberal use of such variables. The simulation showed that the inclusive strategy is to be greatly preferred. With an inclusive strategy not only is there a reduced chance of inadvertently omitting an important cause of missingness, there is also the possibility of noticeable gains in terms of increased efficiency and reduced bias, with only minor costs. As implemented in currently available software, the ML approach tends to encourage the use of a restrictive strategy, whereas the MI approach makes it relatively simple to use an inclusive strategy.

2,153 citations

Journal ArticleDOI
Abstract: Latent class analysis (LCA) is a statistical method used to identify a set of discrete, mutually exclusive latent classes of individuals based on their responses to a set of observed categorical variables. In multiple-group LCA, both the measurement part and structural part of the model can vary across groups, and measurement invariance across groups can be empirically tested. LCA with covariates extends the model to include predictors of class membership. In this article, we introduce PROC LCA, a new SAS procedure for conducting LCA, multiple-group LCA, and LCA with covariates. The procedure is demonstrated using data on alcohol use behavior in a national sample of high school seniors.

888 citations

Journal ArticleDOI
TL;DR: Doubly robust (DR) procedures apply both types of model simultaneously and produce a consistent estimate of the parameter if either of the two models has been correctly specified as mentioned in this paper. But it does not demonstrate that, in at least some settings, two wrong models are not better than one.
Abstract: When outcomes are missing for reasons beyond an investigator's control, there are two different ways to adjust a parameter estimate for covariates that may be related both to the outcome and to missingness. One approach is to model the relationships between the covariates and the outcome and use those relationships to predict the missing values. Another is to model the probabilities of missingness given the covariates and incorporate them into a weighted or stratified estimate. Doubly robust (DR) procedures apply both types of model simultaneously and produce a consistent estimate of the parameter if either of the two models has been correctly specified. In this article, we show that DR estimates can be constructed in many ways. We compare the performance of various DR and non-DR estimates of a population mean in a simulated example where both models are incorrect but neither is grossly misspecified. Methods that use inverse-probabilities as weights, whether they are DR or not, are sensitive to misspecification of the propensity model when some estimated propensities are small. Many DR methods perform better than simple inverse-probability weighting. None of the DR methods we tried, however, improved upon the performance of simple regression-based prediction of the missing values. This study does not represent every missing-data problem that will arise in practice. But it does demonstrate that, in at least some settings, two wrong models are not better than one.

529 citations

Journal ArticleDOI
TL;DR: In this article, the authors review Rubin's definition of an average causal effect (ACE) as the average difference between potential outcomes under different treatments and review 9 strategies for estimating ACEs on the basis of regression, propensity scores, and doubly robust methods.
Abstract: In a well-designed experiment, random assignment of participants to treatments makes causal inference straightforward. However, if participants are not randomized (as in observational study, quasi-experiment, or nonequivalent control-group designs), group comparisons may be biased by confounders that influence both the outcome and the alleged cause. Traditional analysis of covariance, which includes confounders as predictors in a regression model, often fails to eliminate this bias. In this article, the authors review Rubin's definition of an average causal effect (ACE) as the average difference between potential outcomes under different treatments. The authors distinguish an ACE and a regression coefficient. The authors review 9 strategies for estimating ACEs on the basis of regression, propensity scores, and doubly robust methods, providing formulas for standard errors not given elsewhere. To illustrate the methods, the authors simulate an observational study to assess the effects of dieting on emotional distress. Drawing repeated samples from a simulated population of adolescent girls, the authors assess each method in terms of bias, efficiency, and interval coverage. Throughout the article, the authors offer insights and practical guidance for researchers who attempt causal inference with observational data.

461 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: 2 general approaches that come highly recommended: maximum likelihood (ML) and Bayesian multiple imputation (MI) are presented and may eventually extend the ML and MI methods that currently represent the state of the art.
Abstract: Statistical procedures for missing data have vastly improved, yet misconception and unsound practice still abound. The authors frame the missing-data problem, review methods, offer advice, and raise issues that remain unresolved. They clear up common misunderstandings regarding the missing at random (MAR) concept. They summarize the evidence against older procedures and, with few exceptions, discourage their use. They present, in both technical and practical language, 2 general approaches that come highly recommended: maximum likelihood (ML) and Bayesian multiple imputation (MI). Newer developments are discussed, including some for dealing with missing data that are not MAR. Although not yet in the mainstream, these procedures may eventually extend the ML and MI methods that currently represent the state of the art.

10,568 citations

Journal ArticleDOI
TL;DR: Mice adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling routines, model selection tools, and diagnostic graphs.
Abstract: The R package mice imputes incomplete multivariate data by chained equations. The software mice 1.0 appeared in the year 2000 as an S-PLUS library, and in 2001 as an R package. mice 1.0 introduced predictor selection, passive imputation and automatic pooling. This article documents mice, which extends the functionality of mice 1.0 in several ways. In mice, the analysis of imputed data is made completely general, whereas the range of models under which pooling works is substantially extended. mice adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling routines, model selection tools, and diagnostic graphs. Imputation of categorical data is improved in order to bypass problems caused by perfect prediction. Special attention is paid to transformations, sum scores, indices and interactions using passive imputation, and to the proper setup of the predictor matrix. mice can be downloaded from the Comprehensive R Archive Network. This article provides a hands-on, stepwise approach to solve applied incomplete data problems.

10,234 citations

Journal ArticleDOI
TL;DR: The propensity score is a balancing score: conditional on the propensity score, the distribution of observed baseline covariates will be similar between treated and untreated subjects, and different causal average treatment effects and their relationship with propensity score analyses are described.
Abstract: The propensity score is the probability of treatment assignment conditional on observed baseline characteristics. The propensity score allows one to design and analyze an observational (nonrandomized) study so that it mimics some of the particular characteristics of a randomized controlled trial. In particular, the propensity score is a balancing score: conditional on the propensity score, the distribution of observed baseline covariates will be similar between treated and untreated subjects. I describe 4 different propensity score methods: matching on the propensity score, stratification on the propensity score, inverse probability of treatment weighting using the propensity score, and covariate adjustment using the propensity score. I describe balance diagnostics for examining whether the propensity score model has been adequately specified. Furthermore, I discuss differences between regression-based methods and propensity score-based methods for the analysis of observational data. I describe different causal average treatment effects and their relationship with propensity score analyses.

7,895 citations

Book
01 Jan 2006
TL;DR: In this article, the authors present a detailed, worked-through example drawn from psychology, management, and sociology studies illustrate the procedures, pitfalls, and extensions of CFA methodology.
Abstract: "With its emphasis on practical and conceptual aspects, rather than mathematics or formulas, this accessible book has established itself as the go-to resource on confirmatory factor analysis (CFA). Detailed, worked-through examples drawn from psychology, management, and sociology studies illustrate the procedures, pitfalls, and extensions of CFA methodology. The text shows how to formulate, program, and interpret CFA models using popular latent variable software packages (LISREL, Mplus, EQS, SAS/CALIS); understand the similarities and differences between CFA and exploratory factor analysis (EFA); and report results from a CFA study. It is filled with useful advice and tables that outline the procedures. The companion website offers data and program syntax files for most of the research examples, as well as links to CFA-related resources. New to This Edition *Updated throughout to incorporate important developments in latent variable modeling. *Chapter on Bayesian CFA and multilevel measurement models. *Addresses new topics (with examples): exploratory structural equation modeling, bifactor analysis, measurement invariance evaluation with categorical indicators, and a new method for scaling latent variables. *Utilizes the latest versions of major latent variable software packages"--

7,620 citations

Journal ArticleDOI
TL;DR: The principles of the method and how to impute categorical and quantitative variables, including skewed variables, are described and shown and the practical analysis of multiply imputed data is described, including model building and model checking.
Abstract: Multiple imputation by chained equations is a flexible and practical approach to handling missing data. We describe the principles of the method and show how to impute categorical and quantitative variables, including skewed variables. We give guidance on how to specify the imputation model and how many imputations are needed. We describe the practical analysis of multiply imputed data, including model building and model checking. We stress the limitations of the method and discuss the possible pitfalls. We illustrate the ideas using a data set in mental health, giving Stata code fragments. Copyright © 2010 John Wiley & Sons, Ltd.

6,349 citations