# Showing papers in "Statistical Science in 2007"

••

TL;DR: This discussion aims to complement the presentation of the authors by elaborating on the view from the vantage point of semi-parametric theory, focusing on the assumptions embedded in the statistical models leading to different “types” of estimators rather than on the forms of the estimators themselves.

Abstract: We congratulate Drs. Kang and Schafer (KS henceforth) for a careful and thought-provoking contribution to the literature regarding the so-called “double robustness” property, a topic that still engenders some confusion and disagreement. The authors’ approach of focusing on the simplest situation of estimation of the population mean μ of a response y when y is not observed on all subjects according to a missing at random (MAR) mechanism (equivalently, estimation of the mean of a potential outcome in a causal model under the assumption of no unmeasured confounders) is commendable, as the fundamental issues can be explored without the distractions of the messier notation and considerations required in more complicated settings. Indeed, as the article demonstrates, this simple setting is sufficient to highlight a number of key points.
As noted eloquently by Molenberghs (2005), in regard to how such missing data/causal inference problems are best addressed, two “schools” may be identified: the “likelihood-oriented” school and the “weighting-based” school. As we have emphasized previously (Davidian, Tsiatis and Leon, 2005), we prefer to view inference from the vantage point of semi-parametric theory, focusing on the assumptions embedded in the statistical models leading to different “types” of estimators (i.e., “likelihood-oriented” or “weighting-based”) rather than on the forms of the estimators themselves. In this discussion, we hope to complement the presentation of the authors by elaborating on this point of view.
Throughout, we use the same notation as in the paper.

906 citations

••

ETH Zurich

^{1}TL;DR: A statistical perspective on boosting is presented, with special emphasis on estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis.

Abstract: We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and corresponding Akaike or Bayesian information criteria, particularly useful for regularization and variable selection in high-dimensional covariate spaces, are discussed as well. The practical aspects of boosting procedures for fitting statistical models are illustrated by means of the dedicated open-source software package mboost. This package implements functions which can be used for model fitting, prediction and variable selection. It is flexible, allowing for the implementation of new boosting algorithms optimizing user-specified loss functions.

820 citations

•

TL;DR: Doubly robust (DR) procedures apply both types of model simultaneously and produce a consistent estimate of the parameter if either of the two models has been correctly specified as discussed by the authors. But it does not demonstrate that, in at least some settings, two wrong models are not better than one.

Abstract: When outcomes are missing for reasons beyond an investigator's control, there are two different ways to adjust a parameter estimate for covariates that may be related both to the outcome and to missingness. One approach is to model the relationships between the covariates and the outcome and use those relationships to predict the missing values. Another is to model the probabilities of missingness given the covariates and incorporate them into a weighted or stratified estimate. Doubly robust (DR) procedures apply both types of model simultaneously and produce a consistent estimate of the parameter if either of the two models has been correctly specified. In this article, we show that DR estimates can be constructed in many ways. We compare the performance of various DR and non-DR estimates of a population mean in a simulated example where both models are incorrect but neither is grossly misspecified. Methods that use inverse-probabilities as weights, whether they are DR or not, are sensitive to misspecification of the propensity model when some estimated propensities are small. Many DR methods perform better than simple inverse-probability weighting. None of the DR methods we tried, however, improved upon the performance of simple regression-based prediction of the missing values. This study does not represent every missing-data problem that will arise in practice. But it does demonstrate that, in at least some settings, two wrong models are not better than one.

439 citations

••

TL;DR: In this article, the authors discuss in the context of several ongoing public health and social surveys how to develop general families of multilevel probability models that yield reasonable Bayesian inferences.

Abstract: The general principles of Bayesian data analysis imply that models for survey responses should be constructed conditional on all variables that affect the probability of inclusion and nonresponse, which are also the variables used in survey weighting and clustering. However, such models can quickly become very complicated, with potentially thousands of poststratification cells. It is then a challenge to develop general families of multilevel probability models that yield reasonable Bayesian inferences. We discuss in the context of several ongoing public health and social surveys. This work is currently open-ended, and we conclude with thoughts on how research could proceed to solve these problems.

348 citations

••

TL;DR: The authors revisited principal components as a reductive method in regression, developed several model-based extensions and end with descriptions of general approaches to model-free dimension reduction in regression and argued that the role for principal components and related methodology may be broader than previously seen and that the common practice of conditioning on observed values of the predictors may unnecessarily limit the choice of regression methodology.

Abstract: Beginning with a discussion of R. A. Fisher’s early written remarks that relate to dimension reduction, this article revisits principal components as a reductive method in regression, develops several model-based extensions and ends with descriptions of general approaches to model-based and model-free dimension reduction in regression. It is argued that the role for principal components and related methodology may be broader than previously seen and that the common practice of conditioning on observed values of the predictors may unnecessarily limit the choice of regression methodology.

299 citations

••

TL;DR: In this paper, a multivariate generalized least squares approach is presented for the synthesis of regression slopes, and the complexity involved in synthesizing slopes is described and described in more detail.

Abstract: Research on methods of meta-analysis (the synthesis of related study results) has dealt with many simple study indices, but less attention has been paid to the issue of summarizing regression slopes. In part this is because of the many complications that arise when real sets of regression models are accumulated. We outline the complexities involved in synthesizing slopes, describe existing methods of analysis and present a multivariate generalized least squares approach to the synthesis of regression slopes.

281 citations

••

TL;DR: In this paper, the authors describe centering and noncentering methodology as complementary techniques for use in parametrization of broad classes of hierarchical models, with a view to the construction of effective MCMC algorithms for exploring posterior distributions from these models.

Abstract: In this paper, we describe centering and noncentering methodology as complementary techniques for use in parametrization of broad classes of hierarchical models, with a view to the construction of effective MCMC algorithms for exploring posterior distributions from these models. We give a clear qualitative understanding as to when centering and noncentering work well, and introduce theory concerning the convergence time complexity of Gibbs samplers using centered and noncentered parametrizations. We give general recipes for the construction of noncentered parametrizations, including an auxiliary variable technique called the state-space expansion technique. We also describe partially noncentered methods, and demonstrate their use in constructing robust Gibbs sampler algorithms whose convergence properties are not overly sensitive to the data.

248 citations

••

TL;DR: The authors provided a review of doublerobust estimators of the mean of a response when Y is missing at random (MAR) (but not completely at random) and the average treatment effect in an observational study under the assumption of strong ignorability.

Abstract: We thank the editor Ed George for the opportunity to discuss the paper by Kang and Schaeffer. The authors’ paper provides a review of doublerobust (equivalently, double-protected) estimators of (i) the mean μ = E(Y ) of a response Y when Y is missing at random (MAR) (but not completely at random) and of (ii) the average treatment effect in an observational study under the assumption of strong ignorability. In our discussion we will depart from the notation in Kang and Schaeffer (throughout, K&S) and use capital letters to denote random variables and lowercase letter to denote their possible values. In the missing-data setting (i), one observes n i.i.d. copies of O = (T,X,TY ), where X is a vector of always observed covariates and T is the indicator that the response Y is observed. An estimator of μ is double-robust (throughout, DR) if it remains consistent and asymptotically normal (throughout, CAN) when either (but not necessarily both) a model for the propensity score π(X) ≡ P (T = 1|X) = P (T = 1|X,Y ) or a model for the conditional mean m(X)≡

241 citations

••

TL;DR: Hospital profiling involves a comparison of a health care provider's structure, processes of care, or outcomes to a standard, often in the form of a report card as mentioned in this paper, which is a relatively recent phenomenon in health care.

Abstract: Hospital profiling involves a comparison of a health care provider’s structure, processes of care, or outcomes to a standard, often in the form of a report card. Given the ubiquity of report cards and similar consumer ratings in contemporary American culture, it is notable that these are a relatively recent phenomenon in health care. Prior to the 1986 release of Medicare hospital outcome data, little such information was publicly available. We review the historical evolution of hospital profiling with special emphasis on outcomes; present a detailed history of cardiac surgery report cards, the paradigm for modern provider profiling; discuss the potential unintended negative consequences of public report cards; and describe various statistical methodologies for quantifying the relative performance of cardiac surgery programs. Outstanding statistical issues are also described.

138 citations

••

TL;DR: Gaussian graphical model selection can be performed by multiple testing of hypotheses about vanishing (partial) correlation coefficients associated with individual edges that are absent from the graph by demonstrating how this approach allows one to perform model selection while controlling error rates for incorrect edge inclusion.

Abstract: Graphical models provide a framework for exploration of multivariate dependence patterns. The connection between graph and statistical model is made by identifying the vertices of the graph with the observed variables and translating the pattern of edges in the graph into a pattern of conditional independences that is imposed on the variables’ joint distribution. Focusing on Gaussian models, we review classical graphical models. For these models the defining conditional independences are equivalent to vanishing of certain (partial) correlation coefficients associated with individual edges that are absent from the graph. Hence, Gaussian graphical model selection can be performed by multiple testing of hypotheses about vanishing (partial) correlation coefficients. We show and exemplify how this approach allows one to perform model selection while controlling error rates for incorrect edge inclusion.

137 citations

••

TL;DR: This work describes a simple ``building block'' approach to formulating discrete-time models, and shows how to estimate the parameters of such models from time series of data, and how to quantify uncertainty in those estimates and in numbers of individuals of different types in populations, using computer-intensive Bayesian methods.

Abstract: Increasing pressures on the environment are generating an ever-increasing need to manage animal and plant populations sustainably, and to protect and rebuild endangered populations. Effective management requires reliable mathematical models, so that the effects of management action can be predicted, and the uncertainty in these predictions quantified. These models must be able to predict the response of populations to anthropogenic change, while handling the major sources of uncertainty. We describe a simple “building block” approach to formulating discrete-time models. We show how to estimate the parameters of such models from time series of data, and how to quantify uncertainty in those estimates and in numbers of individuals of different types in populations, using computer-intensive Bayesian methods. We also discuss advantages and pitfalls of the approach, and give an example using the British grey seal population.

••

TL;DR: In this article, a review of the history of the idea of maximum likelihood is presented, from well before Fisher's work to the time of Lucien Le Cam's dissertation.

Abstract: At a superficial level, the idea of maximum likelihood must be prehistoric: early hunters and gatherers may not have used the words "method of maximum likelihood" to describe their choice of where and how to hunt and gather, but it is hard to believe they would have been surprised if their method had been described in those terms. It seems a simple, even unassail- able idea: Who would rise to argue in favor of a method of minimum likeli- hood, or even mediocre likelihood? And yet the mathematical history of the topic shows this "simple idea" is really anything but simple. Joseph Louis Lagrange, Daniel Bernoulli, Leonard Euler, Pierre Simon Laplace and Carl Friedrich Gauss are only some of those who explored the topic, not always in ways we would sanction today. In this article, that history is reviewed from back well before Fisher to the time of Lucien Le Cam's dissertation. In the process Fisher's unpublished 1930 characterization of conditions for the con- sistency and efficiency of maximum likelihood estimates is presented, and the mathematical basis of his three proofs discussed. In particular, Fisher's derivation of the information inequality is seen to be derived from his work on the analysis of variance, and his later approach via estimating functions was derived from Euler's Relation for homogeneous functions. The reaction to Fisher's work is reviewed, and some lessons drawn.

••

TL;DR: In this paper, the utility of principal stratification, stratification on a single potential auxiliary variable and stratification of an observed auxiliary variable is evaluated for decision making and understanding causal processes.

Abstract: It has recently become popular to define treatment effects for subsets of the target population characterized by variables not observable at the time a treatment decision is made. Characterizing and estimating such treatment effects is tricky; the most popular but naive approach inappropriately adjusts for variables affected by treatment and so is biased. We consider several appropriate ways to formalize the effects: principal stratification, stratification on a single potential auxiliary variable, stratification on an observed auxiliary variable and stratification on expected levels of auxiliary variables. We then outline identifying assumptions for each type of estimand. We evaluate the utility of these estimands and estimation procedures for decision making and understanding causal processes, contrasting them with the concepts of direct and indirect effects. We motivate our development with examples from nephrology and cancer screening, and use simulated data and real data on cancer screening to illustrate the estimation methods.

••

TL;DR: This work systematically examines the propen sity score (PS) and the outcome regression (OR) ap proaches and doubly robust (DR) estimation, which are all discussed by KS, to clarify and better the understanding of the three interrelated subjects.

Abstract: We congratulate Kang and Schafer (KS) on their ex cellent article comparing various estimators of a popu lation mean in the presence of missing data, and thank the Editor for organizing the discussion. In this com munication, we systematically examine the propen sity score (PS) and the outcome regression (OR) ap proaches and doubly robust (DR) estimation, which are all discussed by KS. The aim is to clarify and better our understanding of the three interrelated subjects. Sections 1 and 2 contain the following main points, respectively.

••

York University

^{1}TL;DR: These data provide the opportunity to ask how modern methods of statistics, graphics, thematic cartography and geovisualization can shed further light on Guerry's challenge for multivariate spatial statistics.

Abstract: Andre-Michel Guerry’s (1833) Essai sur la Statistique Morale de la France was one of the foundation studies of modern social science. Guerry assembled data on crimes, suicides, literacy and other “moral statistics,” and used tables and maps to analyze a variety of social issues in perhaps the first comprehensive study relating such variables. Indeed, the Essai may be considered the book that launched modern empirical social science, for the questions raised and the methods Guerry developed to try to answer them. Guerry’s data consist of a large number of variables recorded for each of the departments of France in the 1820–1830s and therefore involve both multivariate and geographical aspects. In addition to historical interest, these data provide the opportunity to ask how modern methods of statistics, graphics, thematic cartography and geovisualization can shed further light on the questions he raised. We present a variety of methods attempting to address Guerry’s challenge for multivariate spatial statistics.

••

TL;DR: In this article, the authors investigate Bayesian methods for model checking and compare them with the objective Bayesian method in which careful specification of an informative prior distribution is avoided, and different proposals are investigated and compared.

Abstract: Hierarchical models are increasingly used in many applications. Along with this increased use comes a desire to investigate whether the model is compatible with the observed data. Bayesian methods are well suited to eliminate the many (nuisance) parameters in these complicated models; in this paper we investigate Bayesian methods for model checking. Since we contemplate model checking as a preliminary, exploratory analysis, we concentrate on objective Bayesian methods in which careful specification of an informative prior distribution is avoided. Numerous examples are given and different proposals are investigated and critically compared.

••

RAND Corporation

^{1}TL;DR: In this paper, Doubly Robust (DR) estimators were compared with OLS and the authors' finding that no method outperformed OLS ran counter to their intuition and experience with propensity score weighting and DR estimators.

Abstract: This article is an excellent introduction to doubly robust methods and we congratulate the authors for their thoroughness in bringing together the wide array of methods from different traditions that all share the property of being doubly robust. Statisticians at RAND have been making exten sive use of propensity score weighting in education (McCaffrey and Hamilton, 2007), policing and crim inal justice (Ridgeway, 2006), drug treatment evalua tion (Morral et al., 2006), and military workforce issues (Harreil, Lim, Casta?eda and Golinelli, 2004). More recently, we have been adopting doubly robust (DR) methods in these applications believing that we could achieve further bias and variance reduction. Initially, this article made us second-guess our decision. The ap parently strong performance of OLS and the authors' finding that no method outperformed OLS ran counter to our intuition and experience with propensity score weighting and DR estimators. We posited two potential explanations for this. First, we suspected that the high variance reported by the authors when using propensity score weights could result from their use of standard logistic regression. Second, stronger interaction effects in the outcome regression model might favor the DR approach.

••

TL;DR: This paper discusses two forms of posterior model check, one based on cross-validation and onebased on replication of new groups in a hierarchical model, and thinks both these checks are good ideas and can become even more effective when understood in the context of posterior predictive checking.

Abstract: Bayarri and Castellanos (BC) have written an interesting paper discussing two forms of posterior model check, one based on cross-validation and one based on replication of new groups in a hierarchical model. We think both these checks are good ideas and can become even more effective when understood in the context of posterior predictive checking. For the purpose of discussion, however, it is most interesting to focus on the areas where we disagree with BC:

••

TL;DR: The authors overviews the fundamental statistical foundations for predictive modeling and the general questions associated with unlabeled data, highlighting the relevance of sampling design and prior specification, illustrated with a series of central illustrative examples and two substantial real data analyses.

Abstract: The incorporation of unlabeled data in regression and classification analysis is an increasing focus of the applied statistics and machine learning literatures, with a number of recent examples demonstrating the potential for unlabeled data to contribute to improved predictive accuracy. The statistical basis for this semisupervised analysis does not appear to have been well delineated; as a result, the underlying theory and rationale may be underappreciated, especially by nonstatisticians. There is also room for statisticians to become more fully engaged in the vigorous research in this important area of intersection of the statistical and computer sciences. Much of the theoretical work in the literature has focused, for example, on geometric and structural properties of the unlabeled data in the context of particular algorithms, rather than probabilistic and statistical questions. This paper overviews the fundamental statistical foundations for predictive modeling and the general questions associated with unlabeled data, highlighting the relevance of venerable concepts of sampling design and prior specification. This theory, illustrated with a series of central illustrative examples and two substantial real data analyses, shows precisely when, why and how unlabeled data matter.

••

TL;DR: A sampling of the diverse areas in which majorization has been found to be useful in the past 25 years can be found in this article, where the authors present a sampling of their work.

Abstract: The appearance of Marshall and Olkin’s 1979 book on inequalities with special emphasis on majorization generated a surge of interest in potential applications of majorization and Schur convexity in a broad spectrum of fields. After 25 years this continues to be the case. The present article presents a sampling of the diverse areas in which majorization has been found to be useful in the past 25 years.

•

••

TL;DR: This article extends Bayarri and Berger's (1999) proposal for model evaluation using “partial posterior” p values to the evaluation of second-stage model assumptions in hierarchical models, and defines the reference distribution of a test statistic t by the partial posterior distribution.

Abstract: This article extends Bayarri and Berger’s (1999) proposal for model evaluation using “partial posterior” p values to the evaluation of second-stage model assumptions in hierarchical models. Applications focus on normal-normal hierarchical models, although the final example involves an application to a beta-binomial model in which the distribution of the test statistic is assumed to be approximately normal. The notion of using partial posterior p values is potentially appealing because it avoids what the authors refer to as “double use” of the data, that is, use of the data for both fitting model parameters and evaluating model fit. In classical terms, this phenomenon is synonymous to masking and is widely known to reduce the power of test statistics for diagnosing model inadequacy. In the present context, masking is avoided by defining the reference distribution of a test statistic t by the partial posterior distribution, defined as

••

TL;DR: In this article, the authors propose a modular computational environment in R for exploring their models, and elaborate on the connections between L2-boosting of a linear model and infinitesimal forward stagewise linear regression, and take the authors to task on their definition of degrees of freedom.

Abstract: We congratulate the authors (hereafter BH) for an interesting take on the boosting technology, and for developing a modular computational environment in R for exploring their models. Their use of low-degree-offreedom smoothing splines as a base learner provides an interesting approach to adaptive additive modeling. The notion of “Twin Boosting” is interesting as well; besides the adaptive lasso, we have seen the idea applied more directly for the lasso and Dantzig selector (James, Radchenko and Lv, 2007). In this discussion we elaborate on the connections between L2-boosting of a linear model and infinitesimal forward stagewise linear regression. We then take the authors to task on their definition of degrees of freedom.

••

TL;DR: In the ideal samples of survey sampling textbooks, weights are the inverses of the inclusion probabil ities for the units as mentioned in this paper. But nonresponse and undercov erage occur, and survey statisticians try to compen sate for the resulting bias by adjusting the sampling weights.

Abstract: In the ideal samples of survey sampling textbooks, weights are the inverses of the inclusion probabil ities for the units. But nonresponse and undercov erage occur, and survey statisticians try to compen sate for the resulting bias by adjusting the sampling weights. There has been much debate about when and whether weights should be used in analyses, and how they should be constructed. Professor Gelman deserves thanks for clarifying the discussion about weights and for raising interesting issues and questions. If we use weights in estimation, what would we like them to accomplish? Here are some desirable proper ties:

••

TL;DR: Encountering this sort of statement in the documentation of opinion poll data the authors were analyzing in political science: “A weight is assigned to each sample record, and MUST be used for all tabulations”

Abstract: Encountering this sort of statement in the documentation of opinion poll data we were analyzing in political science: “A weight is assigned to each sample record, and MUST be used for all tabulations.” (This particular version was in the codebook for the 1988 CBS News/New York Times Poll; as you can see, this is a problem that has been bugging me for a long time.) Computing weighted averages is fine, but weighted regression is a little more tricky—I do not really know what a weighted logistic regression likelihood, for example, is supposed to represent. Constructing the weighting for the New York City Social Indicators Survey (SIS). It quickly became clear that we had to make many arbitrary choices about inclusion and smoothing of weighting variables, and we could not find any good general guidelines. We wanted to estimate state-level public opinion from national polls. If our surveys were simple random samples, this would be basic Bayes hierarchical modeling (with 50 groups, possibly linked using state-level predictors). Actually, though, the surveys suffer differential nonresponse (lower response by men, younger people, ethnic minorities, etc.) as signaled to the user (such as myself) via a vector of weights.

••

TL;DR: The first biography of De Moivre, on which almost all subsequent ones have since relied, was written in French by Matthew Maty and published in 1755 in the Journal britannique.

Abstract: November 27, 2004, marked the 250th anniversary of the death of Abraham De Moivre, best known in statistical circles for his famous large-sample approximation to the binomial distribution, whose generalization is now referred to as the Central Limit Theorem. De Moivre was one of the great pioneers of classical probability the- ory. He also made seminal contributions in analytic geometry, complex analysis and the theory of annuities. The first biography of De Moivre, on which almost all subsequent ones have since relied, was written in French by Matthew Maty. It was published in 1755 in the Journal britannique. The authors provide here, for the first time, a complete translation into English of Maty's biography of De Moivre. New mate- rial, much of it taken from modern sources, is given in footnotes, along with numerous annotations designed to provide additional clarity to Maty's biography for contemporary readers.

••

TL;DR: A well-written and up-to-date overview of boosting that originated with the seminal algorithms of Freund and Schapire and a display of its potential with extensions from classification to least squares, exponential family models, survival analysis, to base-learners other than trees.

Abstract: The authors are doing the readers of Statistical Science a true service with a well-written and up-to-date overview of boosting that originated with the seminal algorithms of Freund and Schapire. Equally, we are grateful for high-level software that will permit a larger readership to experiment with, or simply apply, boosting-inspired model fitting. The authors show us a world of methodology that illustrates how a fundamental innovation can penetrate every nook and cranny of statistical thinking and practice. They introduce the reader to one particular interpretation of boosting and then give a display of its potential with extensions from classification (where it all started) to least squares, exponential family models, survival analysis, to base-learners other than trees such as smoothing splines, to degrees of freedom and regularization, and to fascinating recent work in model selection. The uninitiated reader will find that the authors did a nice job of presenting a certain coherent and useful interpretation of boosting. The other reader, though, who has watched the business of boosting for a while, may have quibbles with the authors over details of the historic record and, more importantly, over their optimism about the current state of theoretical knowledge. In fact, as much as “the statistical view” has proven fruitful, it has also resulted in some ideas about why boosting works that may be misconceived, and in some recommendations that may be misguided.

••

ETH Zurich

^{1}TL;DR: Rejoinder to ``Boosting Algorithms: Regularization, Prediction and Model Fitting'' [arXiv:0804.2752]

Abstract: We are grateful that Hastie points out the connection to degrees of freedom for LARS which leads to another—and often better—definition of degrees of freedom for boosting in generalized linear models. As Hastie writes and as we said in the paper, our formula for degrees of freedom is only an approximation: the cost of searching, for example, for the best variable in componentwise linear least squares or componentwise smoothing splines, is ignored. Hence, our approximation formula

•

••

TL;DR: The research on meta-analysis has been greatly influenced by the work of Ingram Olkin this paper, whose contributions by way of citation counts and several areas of contribution by Olkin and his academic descendants are discussed.

Abstract: The research on meta-analysis and particularly multivariate meta-analysis has been greatly influenced by the work of Ingram Olkin. This paper documents Olkin’s contributions by way of citation counts and outlines several areas of contribution by Olkin and his academic descendants. An academic family tree is provided.