scispace - formally typeset
Search or ask a question

Showing papers by "Donald B. Rubin published in 2001"


Journal ArticleDOI
TL;DR: In this article, it was shown that the likelihood ratio statistic based on the Kullback-Leibler information criterion of the null hypothesis that a random sample is drawn from a k 0 -component normal mixture distribution against the alternative hypothesis that the sample was drawn from an k 1 -component normalized mixture distribution is asymptotically distributed as a weighted sum of independent chi-squared random variables with one degree of freedom, under general regularity conditions.
Abstract: We demonstrate that, under a theorem proposed by Vuong, the likelihood ratio statistic based on the Kullback-Leibler information criterion of the null hypothesis that a random sample is drawn from a k 0 -component normal mixture distribution against the alternative hypothesis that the sample is drawn from a k 1 -component normal mixture distribution is asymptotically distributed as a weighted sum of independent chi-squared random variables with one degree of freedom, under general regularity conditions. We report simulation studies of two cases where we are testing a single normal versus a two-component normal mixture and a two-component normal mixture versus a three-component normal mixture. An empirical adjustment to the likelihood ratio statistic is proposed that appears to improve the rate of convergence to the limiting distribution.

3,531 citations


Journal ArticleDOI
TL;DR: How to use the techniques of matching, subclassification and/or weighting to help design observational studies is the primary focus of this article, and a new diagnostic table is proposed to aid in this endeavor.
Abstract: Propensity score methodology can be used to help design observational studies in a way analogous to the way randomized experiments are designed: without seeing any answers involving outcome variables. The typical models used to analyze observational data (e.g., least squares regressions, difference of difference methods) involve outcomes, and so cannot be used for design in this sense. Because the propensity score is a function only of covariates, not outcomes, repeated analyses attempting to balance covariate distributions across treatment groups do not bias estimates of the treatment effect on outcome variables. This theme will the primary focus of this article: how to use the techniques of matching, subclassification and/or weighting to help design observational studies. The article also proposes a new diagnostic table to aid in this endeavor, which is especially useful when there are many covariates under consideration. The conclusion of the initial design phase may be that the treatment and control groups are too far apart to produce reliable effect estimates without heroic modeling assumptions. In such cases, it may be wisest to abandon the intended observational study, and search for a more acceptable data set where such heroic modeling assumptions are not necessary. The ideas and techniques will be illustrated using the initial design of an observational study for use in the tobacco litigation based on the NMES data set.

1,830 citations


Journal ArticleDOI
TL;DR: In this article, the effects of the magnitude of lottery prizes on economic behavior were analyzed using an original survey of people playing the lottery in Massachusetts in the mid-1980's, and the authors found that unearned income reduces labor earnings, with a marginal propensity to consume leisure of approximately 11 percent, with larger effects for individuals between 55 and 65 years old.
Abstract: This paper provides empirical evidence about the effect of unearned income on earnings. consumption, and savings. Using an original survey of people playing the lottery in Massachusetts in the mid-1980's, the effects of the magnitude of lottery prizes on economic behavior are analyzed. The critical assumption is that among lottery winners the magnitude of the prize is randomly assigned. It is found that unearned income reduces labor earnings, with a marginal propensity to consume leisure of approximately 11 percent, with larger effects for individuals between 55 and 65 years old. After receiving about half their prize, individuals saved about 16 percent.

344 citations


Journal ArticleDOI
TL;DR: A method is proposed and illustrated that uses marginal information in the database to select mixture models, identifies sets of records for clerks to review based on the models and marginal information, incorporates clerically reviewed data into estimates of model parameters, and classifies pairs as links, nonlinks, or in need of further clerical review.
Abstract: The goal of record linkage is to link quickly and accurately records that correspond to the same person or entity. Whereas certain patterns of agreements and disagreements on variables are more likely among records pertaining to a single person than among records for different people, the observed patterns for pairs of records can be viewed as arising from a mixture of matches and nonmatches. Mixture model estimates can be used to partition record pairs into two or more groups that can be labeled as probable matches (links) and probable nonmatches (nonlinks). A method is proposed and illustrated that uses marginal information in the database to select mixture models, identifies sets of records for clerks to review based on the models and marginal information, incorporates clerically reviewed data, as they become available, into estimates of model parameters, and classifies pairs as links, nonlinks, or in need of further clerical review. The procedure is illustrated with five datasets from the U.S. Bureau ...

177 citations


Journal ArticleDOI
TL;DR: The manner in which the presence of refreshment samples allows the researcher to test various models for attrition in panel data, including models based on the assumption that missing data are missing at random are described.
Abstract: tistical models that allow for more complex relationships than can be inferred using only cross-sectional data. Panel, i.e., longitudinal, data, in which the same units are observed repeatedly at different points in time, can often provide the richer data needed for such models (e.g., Chamberlain (1984), Hsiao (1986), Baltagi (1995), Arellano and Honore (forthcoming)). Missing data problems, however, can be more severe in panels, because even those units that respond in initial waves of the panel may drop out of the sample in subsequent waves (e.g., Hausman and Wise (1979), Robins and West (1986), Ridder (1990), Verbeek and Nijman (1992), Abowd, Crepon, Kramarz, and Trognon (1995), Fitzgerald, Gottschalk, and Moffitt (1998), and Vella (1998)). Sometimes, in the hope of mitigating the effects of such attrition, panel data sets are augmented by replacing the units that have dropped out with new units randomly sampled from the original population. Following Ridder (1992), who used such replacement units to test alternative models for attrition, we call such additional samples refreshment samples. Here we explore the benefits of refreshment samples for inference in the presence of attrition. Two general approaches are often used to deal with attrition in panel data sets when refreshment samples are not available. One model, based on the missing at random assumption (MAR, Rubin (1976), Little and Rubin (1987)), allows the probability of attrition to depend on lagged but not on contemporaneous variables that have missing values. The other model (denoted by HW in the remainder of the paper, given the similarity to a model developed by Hausman and Wise (1979)), allows the probability of attrition to depend on such contemporaneous, but not on lagged, variables. Both sets of models have some theoretical plausibility, but they rely on fundamentally different restrictions on the

174 citations


Journal ArticleDOI
TL;DR: Three statistical models are developed for multiply imputing the missing values of airborne particulate matter and it is expected that these models are useful for creating multiple imputations in a variety of incomplete multivariate time series data sets.
Abstract: Summary. Many chemical and environmental data sets are complicated by the existence of fully missing values or censored values known to lie below detection thresholds. For example, week-long samples of airborne particulate matter were obtained at Alert, NWT, Canada, between 1980 and 1991, where some of the concentrations of 24 particulate constituents were coarsened in the sense of being either fully missing or below detection limits. To facilitate scientific analysis, it is appealing to create complete data by filling in missing values so that standard complete-data methods can be applied. We briefly review commonly used strategies for handling missing values and focus on the multiple-imputation approach, which generally leads to valid inferences when faced with missing data. Three statistical models are developed for multiply imputing the missing values of airborne particulate matter. We expect that these models are useful for creating multiple imputations in a variety of incomplete multivariate time series data sets.

121 citations


Journal ArticleDOI
TL;DR: It is shown that, in general in the setting, administrative censoring times are not independent of survival times within the two subgroups, nondropouts and sampled dropouts, and the stratified Kaplan–Meier estimator is not appropriate for the cohort survival curve.
Abstract: We investigate the use of follow-up samples of individuals to estimate survival curves from studies that are subject to right censoring from two sources: (i) early termination of the study, namely, administrative censoring, or (ii) censoring due to lost data prior to administrative censoring, so-called dropout. We assume that, for the full cohort of individuals, administrative censoring times are independent of the subjects' inherent characteristics, including survival time. To address the loss to censoring due to dropout, which we allow to be possibly selective, we consider an intensive second phase of the study where a representative sample of the originally lost subjects is subsequently followed and their data recorded. As with double-sampling designs in survey methodology, the objective is to provide data on a representative subset of the dropouts. Despite assumed full response from the follow-up sample, we show that, in general in our setting, administrative censoring times are not independent of survival times within the two subgroups, nondropouts and sampled dropouts. As a result, the stratified Kaplan-Meier estimator is not appropriate for the cohort survival curve. Moreover, using the concept of potential outcomes, as opposed to observed outcomes, and thereby explicitly formulating the problem as a missing data problem, reveals and addresses these complications. We present an estimation method based on the likelihood of an easily observed subset of the data and study its properties analytically for large samples. We evaluate our method in a realistic situation by simulating data that match published margins on survival and dropout from an actual hip-replacement study. Limitations and extensions of our design and analytic method are discussed.

58 citations


Journal ArticleDOI
TL;DR: It is shown that, in general in this setting, administrative censoring times are not independent of survival times within the two subgroups, nondropouts and sampled dropouts, and the stratified Kaplan-Meier estimator is not appropriate for the cohort survival curve.
Abstract: We thank the editorial board for the opportunity to have discussion on the issue of study design, which is often more important than analysis for obtaining reliable information, especially in problems with missing data. Double sampling designs to address dropout require allocating resources for recovering data for a subgroup of dropouts, but there are often positive trade-offs when doing so. Although such ideas have long been used for surveys with onetime enrollment, we have little evidence that they are being used systematically in studies in public health, where enrollment is often longitudinal. The conclusion is that, in these longitudinal settings, either double sampling is not often used or, as we suspect based on communications, it is employed implicitly, and the data are being analyzed with unknown methods. Our goals therefore were to

39 citations


Journal ArticleDOI
TL;DR: The formulation presented here, although described for the problem of estimating excess health care expenditures due to the alleged misconduct of the tobacco industry, is more general and can be applied to any outcome, such as mortality, morbidity, or income from excise taxes, as well as to any situation in which consequences due to alleged misconduct or due to hypothetical programmes are to be estimated.
Abstract: An important application of statistics in recent years has been to address the causal effects of smoking. There is little doubt that there are health risks associated with smoking. However, more general issues concern the causal effects due to the alleged misconduct of the tobacco industry or due to programmes designed to curtail tobacco use. To address any such causal question, assumptions must be made. Although some of the issues are well known in the statistical and epidemiological literature, there does not appear to be a unified treatment that provides prescriptive guidance on the estimation of these causal effects with explication of the needed assumptions. A 'conduct attributable fraction' is derived, which allows for arbitrary changes in smoking and non-smoking health care expenditure related factors in a counterfactual world without the alleged misconduct, and therefore generalizes the traditional 'smoking attributable fraction'. The formulation presented here, although described for the problem of estimating excess health care expenditures due to the alleged misconduct of the tobacco industry, is more general. It can be applied to any outcome, such as mortality, morbidity, or income from excise taxes, as well as to any situation in which consequences due to alleged misconduct (for example, of two entities, such as the tobacco and the asbestos industries) or due to hypothetical programmes (for example, extra smoking reduction initiatives) are to be estimated.

31 citations


Book ChapterDOI
01 Jan 2001
TL;DR: It is shown how to accelerate convergence using parameter extended EM (PX-EM), a recent modification of the EM algorithm, which reduces the number of required iterations by approximately one-half without adding any appreciable computation to each EM-iteration.
Abstract: Several large-scale educational surveys use item response theory (IRT) models to summarize complex cognitive responses and relate them to educational and demographic variables. The IRT models are often multidimensional with prespecified traits, and maximum likelihood estimates (MLEs) are found using an EM algorithm, which is typically very slow to converge. We show that the slow convergence is due primarily to missing information about the correlations between latent traits relative to the information that would be present if these traits were observed (“complete data”). We show how to accelerate convergence using parameter extended EM (PX-EM), a recent modification of the EM algorithm. The PX-EM modification is simple to implement for the IRT survey model and reduces the number of required iterations by approximately one-half without adding any appreciable computation to each EM-iteration.

9 citations


01 Jan 2001
TL;DR: This study can be favorably contrasted with other school choice evaluations in terms of the consideration that went into the randomized experimental design, the completely new design of the Propensity Matched Pairs Design, and the rigorous data collection and compliance-encouraging efforts.

Book ChapterDOI
01 Jan 2001
TL;DR: More principled methods, namely multiple imputation, maximum likelihood, and inference from Bayesian posterior distributions, are presented, along with an introduction to computation via the EM algorithm and Markov chain Monte Carlo.
Abstract: Missing data are a ubiquitous problem in the social and behavioral sciences. Here we overview the problem and possible solutions. We begin by distinguishing between the pattern of missing data and the mechanism that creates the missing data. Then we consider common, but limited, approaches: complete cases, available cases, weighting analyses, and single imputation. More principled methods, namely multiple imputation, maximum likelihood, and inference from Bayesian posterior distributions, are then presented, along with an introduction to computation via the EM algorithm and Markov chain Monte Carlo.

Journal ArticleDOI
01 Mar 2001-Chance
TL;DR: Think carefully about the issues raised by the type of self-experimentation Seth Roberts discusses leads to a better understanding of the foundations of causal inference, which involves the comparisons of two "potential outcomes" from the observation of only one.
Abstract: (2001). Comment: Self-Experimentation for Causal Effects. CHANCE: Vol. 14, No. 2, pp. 16-17.