scispace - formally typeset
Search or ask a question

Showing papers by "Donald B. Rubin published in 2004"


Journal ArticleDOI
TL;DR: In this paper, the use of principal stratification is used to understand the meaning of direct and indirect causal effects in the context of epidemiology and biomedicine, and a current study of anthrax vaccine will be used to illustrate ideas.
Abstract: jjThe use of the concept of 'direct' versus 'indirect' causal effects is common, not only in statistics but also in many areas of social and economic sciences. The related terms of 'biomarkers' and 'surrogates' are common in pharmacological and biomedical sciences. Sometimes this concept is represented by graphical displays of various kinds. The view here is that there is a great deal of imprecise discussion surrounding this topic and, moreover, that the most straightforward way to clarify the situation is by using potential outcomes to define causal effects. In particular, I suggest that the use of principal stratification is key to understanding the meaning of direct and indirect causal effects. A current study of anthrax vaccine will be used to illustrate ideas.

327 citations



Journal ArticleDOI
TL;DR: Ballou et al. as discussed by the authors used a variety of statistical models, known as "value-added" models in the education literature, to estimate the effect of school and teacher effects.
Abstract: There has been substantial interest in recent years in the performance and accountability of teachers and schools, partially due to the No Child Left Behind legislation, which requires states to develop a system of sanctions and rewards to hold districts and schools accountable for academic achievement. This focus has lead to an increase in “high-stakes” testing with publicized school rankings and test results. The papers by Ballou et al. (2004), McCaffrey et al. (2004) and Tekwe et al. (2004) approach the estimation of school and teacher effects through a variety of statistical models, known as “value-added” models in the education literature. There are many complex issues involved, and we applaud the authors for addressing this challenging topic.

253 citations


Patent
16 Apr 2004
TL;DR: In this article, a user can generate a predictive model based on historical data about a system being modeled, and the project includes a series of user choice points and actions or parameter settings that govern the generation of the model, which direct the user to select and apply an optimal model.
Abstract: Models are generated using a variety of tools and features of a model generation platform. For example, in connection with a project in which a user generates a predictive model based on historical data about a system being modeled, the user is provided through a graphical user interface a structured sequence of model generation activities to be followed, the sequence including dimension reduction, model generation, model process validation, and model re-generation. In connection with a project in which a user generates a predictive model based on historical data about a system being modeled, and in which the project includes a series of user choice points and actions or parameter settings that govern the generation of the model based on rules, which direct the user to select and apply an optimal model.

106 citations


Patent
16 Apr 2004
TL;DR: In this article, a user can generate a predictive model based on historical data about a system being modeled and validate the model development process with cross-validation between at least two subsets of the historical data.
Abstract: Models are generated using a variety of tools and features of a model generation platform. For example, in connection with a project in which a user generates a predictive model based on historical data about a system being modeled, the user is provided through a graphical user interface a structured sequence of model generation activities to be followed, the sequence including dimension reduction, model generation, model process validation, and model re-generation. In connection with a project in which a user generates a predictive model based on historical data about a system being modeled, the user is enabled to validate the model development process with cross-validation between at least two subsets of the historical data; the validated model development process is enabled to be reapplied.

97 citations


Journal ArticleDOI
TL;DR: Inference for causal effects is a critical activity in many branches of science and public policy as mentioned in this paper, and it is arguably essential that departments of statistics teach courses in causal inference to both graduate and undergraduate students.
Abstract: Inference for causal effects is a critical activity in many branches of science and public policy. The field of statistics is the one field most suited to address such problems, whether from designed experiments or observational studies. Consequently, it is arguably essential that departments of statistics teach courses in causal inference to both graduate and undergraduate students. This article discusses an outline of such courses based on repeated experience over more than a decade.

97 citations


Patent
16 Apr 2004
TL;DR: In this paper, a user can generate a predictive model based on historical data about a system being modeled, the user is provided through a graphical user interface a structured sequence of model generation activities to be followed, the sequence including dimension reduction, model generation, model process validation, and model re-generation.
Abstract: Models are generated using a variety of tools and features of a model generation platform. For example, in connection with a project in which a user generates a predictive model based on historical data about a system being modeled, the user is provided through a graphical user interface a structured sequence of model generation activities to be followed, the sequence including dimension reduction, model generation, model process validation, and model re-generation. Historical multi-dimensional data is received representing multiple variables transformed to be maximally predictive for at least one outcome variable to be used as an input to a predictive model of a commercial system, model development process is validated for at one or more sets of such variables and enabling a user of a model generation tool to combine at least two of the variables from the sets of variables.

92 citations


Patent
16 Apr 2004
TL;DR: In this article, a user generates a predictive model based on historical data about a system being modeled, provided through a graphical user interface a structured sequence of model generation activities to be followed, the sequence including dimension reduction, model generation, model process validation and model re-generation.
Abstract: Models are generated using a variety of tools and features of a model generation platform. For example, in connection with a project in which a user generates a predictive model based on historical data about a system being modeled, the user is provided through a graphical user interface a structured sequence of model generation activities to be followed, the sequence including dimension reduction, model generation, model process validation, and model re-generation. Historical multi-dimensional data is received representing multiple source variables to be used as an input to a predictive model of a commercial system and applying transformations to the data that are selected based on the strength of measurement represented by a variable; variables are transformed into new more predictive variables, including the Bayesian renormalization of sparsely sampled variable and including the imputation of missing values for categorical or continuous variables.

77 citations


Patent
16 Apr 2004
TL;DR: In this paper, a user generates a predictive model based on historical data about a system being modeled, provided through a graphical user interface a structured sequence of model generation activities to be followed, the sequence including dimension reduction, model generation, model process validation, and model re-generation.
Abstract: Models are generated using a variety of tools and features of a model generation platform For example, in connection with a project in which a user generates a predictive model based on historical data about a system being modeled, the user is provided through a graphical user interface a structured sequence of model generation activities to be followed, the sequence including dimension reduction, model generation, model process validation, and model re-generation Historical multi-dimensional data is received representing multiple variables to be used as an input to a predictive model of a commercial system variables are pruned for which the data is sparse or missing, and the population of variables is adjusted to represent main effects exhibited by the data and interaction and non-linear effects exhibited by the data

51 citations


Journal ArticleDOI
TL;DR: The best approach to nonresponse (missingness) in sur veys is seen to be one where the authors can (1) insert more than one value for a missing datum, and (2) the inserted values reflect a variety of models for the dataset.
Abstract: The general approach to nonresponse (missingness) in sur veys that I will take here will be to impute values for missing data (really, several values for each missing datum). The ap proach that imputes one value for each missing datum is stan dard in practice, although often criticized by more mathematical statisticians who prefer to think about estimating parameters un der some model. I am very sympathetic with the imputation position. There do not exist parameters except under hypothetical models; there do, however, exist actual observed values and values that would have been observed. Focusing on the estimation of parameters is often not what we want to do since a hypothetical model is simply a structure that guides us to do sensible things with observed values. Of course (1) imputing one value for missing datum can't be correct in general, and (2) in order to insert sensible values for a missing datum we must rely more or less on some model relating unobserved values to observed values. Hence, I see the best approach to be one where we can (1) insert more than one value for a missing datum, and (2) the inserted values reflect a variety of models for the dataset. This position focusing on values to impute rather than param eters to be estimated is actually very Bayesian, and the Bayesian perspective guides us in our design of a general system for nonre sponse. What we really want to impute is the "predictive distribu tion" of the missing values given then observed values (having integrated?averaged?over all model parameters). The theo retical Bayesian position tells us that (1) the missing data has a distribution given the observed data (the predictive distribution) and (2) this distribution depends on assumptions that have been made about the model. Notice that the (l)'s and (2)'s in the above paragraphs are meant to refer to the same two points. The related practical questions are (1) how do we represent in a dataset a distribution of values to impute for each missing datum? And (2) what models should we use to tie observed and unobserved values to each other in order to produce the predictive distribution needed in (1). Section 2 addresses the first question and Section 3 addresses the second question.

50 citations



Journal ArticleDOI
TL;DR: In the following articles, we get some personal insight on how statisticians do research, and we also see that research is an activity for all statisticians, not just the ones in academia as mentioned in this paper.
Abstract: Editor's Note: In the May 2004 TAS, Hamada and Sitter provided advice for the statistics graduate student on doing research. We asked several prominent statistics researchers to discuss this article, and TAS readers were also invited to contribute discussion. In the following articles, we get some personal insight on how statisticians do research, and we also see that research is an activity for all statisticians, not just the ones in academia. ?James Albert, Editor, The American Statistician

01 Jan 2004
TL;DR: A Bayesian hierarchical random effects regression model for serum creatinine, a critically important blood measurement for Fabry patients, is developed, using a historical data base compiled from medical records and patient registries, to aid in multiply imputing missing placebo data from a clinical trial for a new drug specifically developed to treat Fabry disease.
Abstract: Management of chronic diseases often involves monitoring one or more outcomes over time. Modeling long term disease progression in untreated patients can be useful for drug development as well as patient monitoring. The outcomes of interest in chronic diseases often have expected monotone or approximately monotone progression over time, possibly preceded by a period during which the outcome measurement is relatively constant. Many diseases have some treatment available for current patients, if only to relieve some symptoms or to slow progression, making disease modeling in untreated current patients difficult. Then historical data can often be useful for developing models for disease progression without treatment. The first part of this thesis presents a model for disease progression in untreated patients suffering from Fabry disease, an X-linked recessive disorder. We develop a Bayesian hierarchical random effects regression model for serum creatinine, a critically important blood measurement for Fabry patients, using a historical data base compiled from medical records and patient registries. We then use this model and results from the historical patients' data to aid in multiply imputing missing placebo data from a clinical trial for a new drug specifically developed to treat Fabry disease; data are missing when placebo patients drop out or switch to the new drug, and thus imputation using only the current untreated patients would rely solely on extrapolation. The second and third parts of this thesis expand upon two topics encountered in developing the disease progression models considered in Part One. Part Two outlines a general strategy for testing the correctness of software for fitting Bayesian models, capitalizing on properties of Bayesian posterior distributions. Software testing becomes a critically important enterprise with models fit using Markov Chain Monte Carlo methods. Part Three presents a method for constructing vague prior distributions for complex Bayesian models, based on the asymptotic normality of the likelihood function. Throughout, we illustrate concepts and techniques using the disease progression model developed in Part One.

Journal ArticleDOI
TL;DR: This article describes the kinds of assumptions needed in the anthrax vaccine experiments, as acknowledged by Professor Lauritzen, and discusses the randomization-based frequentist approaches to causal inference of Fisher and Neyman.
Abstract: I want to begin by thanking the editorial board of Scandinavian Journal of Statistics for assembling this interesting package of materials on causal inference. Although I find much to agree with in the discussions of Aalen and Lauritzen, I also find points of disagreement or at least puzzlement. First, I agree with Professor Aalen about the crucial role statistics has to play in many areas of empirical investigation into causal effects, including medicine. Also, I agree on the criticality of substantive knowledge in this endeavour, as I hope I indicated in my very brief description of the kinds of assumptions needed in the anthrax vaccine experiments, as acknowledged by Professor Lauritzen. In fact, I have been personally involved in real applications (in psychology, medicine, economics, astrophysics, pharmacology, etc.) for four decades, and have only worked on statistical topics that were stimulated by real problems. Perhaps my article should have emphasized this more clearly to avoid the implicit criticism in Professor Aalen’s final concluding paragraph, where he emphasizes the need for statistics to interact with substantive fields. Secondly, comments on modes of inference are relevant to both discussions. Although completely compatible, the randomization-based frequentist approaches to causal inference, due initially to Fisher (1926) and Neyman (1923), are distinctly different from the predictive Bayesian (model-based) approach (Rubin, 1978). The former base inferences solely on a model for the assignment mechanism, Pr(W|Y(0), Y(1), X), where the potential outcomes Y(0), Y(1) and covariates X are treated as fixed, possibly unknown, quantities without any distribution (or notions such as independence) assumed, and W indicates the treatment assignments. The assignment mechanism models what actions we take to try to learn about casual effects of the treatment on the potential outcomes at different values of the covariates. The Bayesian approach supplements the model for the assignment mechanism with a model for the ‘science’ Pr(Y(0), Y(1), X) 1⁄4 Pr(Y(0), Y(1)|X)Pr(X) – about which we hope to learn by comparing Y(1) with Y(0) at various values of X. With respect to Aalen’s discussion, it is puzzling to me to read Aalen’s comment that ‘It is the impression of this discussant that the distinction between Bayesian and frequentist analysis is rapidly disappearing, at least in a conceptual sense.’ Perhaps Aalen includes within ‘frequentist analysis’ all varieties of model-based, hierarchical random parameter approaches and excludes the classical randomization-based approaches to causal inference of Fisher and Neyman. If so, I still find that the conceptual differences between the approaches are important to understand for practice. The realization that so much can be accomplished from the randomization-based perspective alone is, in my view, responsible for the great advances in experimental design (e.g. as reflected in the classic texts by Cochran & Cox, 1950; Kempthorne, 1952; Cox, 1958) and more modest advances in the design of observational studies (e.g. Rosenbaum & Rubin, 1983; Rosenbaum, 1995; Rubin, 2002). With respect to Lauritzen’s discussion, the critical distinction between what we can do to learn about causal effects through design, the assignment mechanism, and how the gods have aligned the world (the science) seems to be largely obfuscated in the graphical approach. How does a graph represent the classical Fisher–Neyman perspective where the only random quantity is the assignment indicator, W, or represent classical designs such as splitplots with their levels of randomization? Evidently, missing arrows indicate both (a) the implications of randomization (actual or assumed) on conditional distributions and 196 D. B. Rubin Scand J Statist 31