Showing papers in &quot;The American Statistician in 2004&quot;

Bootstrap Methods for Developing Predictive Models

TL;DR: The principle behind MM algorithms is explained, some methods for constructing them are suggested, and some of their attractive features are discussed.

...read moreread less

Abstract: Most problems in frequentist statistics involve optimization of a function such as a likelihood or a sum of squares. EM algorithms are among the most effective algorithms for maximum likelihood estimation because they consistently drive the likelihood uphill by maximizing a simple surrogate function for the log-likelihood. Iterative optimization of a surrogate function as exemplified by an EM algorithm does not necessarily require missing data. Indeed, every EM algorithm is a special case of the more general class of MM optimization algorithms, which typically exploit convexity rather than missing data in majorizing or minorizing an objective function. In our opinion, MM algorithms deserve to be part of the standard toolkit of professional statisticians. This article explains the principle behind MM algorithms, suggests some methods for constructing them, and discusses some of their attractive features. We include numerous examples throughout the article to illustrate the concepts described. In addition t...

...read moreread less

1,756 citations

Journal Article•DOI•

[...]

Peter C. Austin, Jack V. Tu

Statistical Computing: An Introduction to Data Analysis using S-Plus

TL;DR: The authors used bootstrap resampling in conjunction with automated variable selection methods to develop parsimonious prediction models using data on patients admitted to hospital with a heart attack, and demonstrated that selecting those variables that were identified as independent predictors of mortality in at least 60% of the bootstrap samples resulted in a parsimony model with excellent predictive ability.

...read moreread less

Abstract: Researchers frequently use automated model selection methods such as backwards elimination to identify variables that are independent predictors of an outcome under consideration. We propose using bootstrap resampling in conjunction with automated variable selection methods to develop parsimonious prediction models. Using data on patients admitted to hospital with a heart attack, we demonstrate that selecting those variables that were identified as independent predictors of mortality in at least 60%% of the bootstrap samples resulted in a parsimonious model with excellent predictive ability.

...read moreread less

515 citations

Journal Article•DOI•

[...]

Mary Kathryn Cowles

SAS for Linear Models (4th ed.)

TL;DR: Crawley's book as mentioned in this paper is intended as a reference book for students and statistical novices, but it in fact is much more suitable for experienced statisticians, a vehicle for learning the S statistical computing language, or a resource for statistics instructors.

...read moreread less

Abstract: Crawley’s unusual book is intended as a textbook for students and statistical novices, but it in fact is much more suitable as a reference book for experienced statisticians, a vehicle for learning the S statistical computing language, or a resource for statistics instructors. The commercial statistical package S-Plus, as well the very similar but freely available R, provides a superb environment for data exploration, manipulation, analysis, and graphical display. In addition to very high-level built-in functions for statistical and graphical analysis, S-Plus and R include a well-developed programming language, which makes them highly extensible. The strongest feature of Crawley’s book is its clear, step-by-step, examplebased presentation of how to use S-Plus for exploratory data analysis and statistical testing and modeling. The subject matter ranges from simple descriptive statistics and standard hypothesis-testing procedures through much more advanced topics including bootstrap and jackknife estimation of bias and standard errors, permutation tests, complex linear models, linear and nonlinear regression, generalized linear models, tree models, nonparametric smoothing, survival analysis, time series, and spatial statistics. No exercises are included. Datasets and code for all examples in the book are available from the companion Web site. The Preface states that “the computing is presented in S-Plus, but all the examples will also work in the freeware programcalled R.”The text also includes some excellent hypothetical examples that promote intuitive understanding of such ideas as the perils of nonrandomized assignment of experimental units to treatments and pseudoreplication. It is written in an engaging, conversational style with liberal dashes of British humor. Unfortunately, the poor presentation of elementary and crucial concepts of statistics and probability renders the book unsuitable for use by people who are not already well-versed in these subjects. Again and again, Crawley states as fact the very misconceptions that we statistics instructors work so hard to prevent our students from acquiring. A typical example is on page 174, where, in the context of the two-sample t test, Crawley states, “our null hypothesis is that the two sample means are the same.” After obtaining a large P value for a Wilcoxon rank-sum test, he states (p. 178), “this p-value of 0.433 is much bigger than 0.05, so we accept the null hypothesis: : : we have just demonstrated, rather convincingly, that the true mu is equal to 0.” After getting a value of ¡3:87 for a t statistic, he states (p. 175), “with t tests you can ignore the minus sign; it is only the absolute value of the difference between the two sample means that concerns us.” Of course, this holds only for a two-sided test and is not a characteristic of t tests per se. The following comes from Chapter 7 on the normal distribution:

...read moreread less

467 citations

Posted Content•

[...]

Cowles M.K.

Separate and Joint Modeling of Longitudinal and Event Time Data Using Standard Computer Packages

330 citations

Journal Article•DOI•

[...]

Xu Guo¹, Bradley P. Carlin²•Institutions (2)

Thermo Fisher Scientific¹, University of Minnesota²

Six Approaches to Calculating Standardized Logistic Regression Coefficients

TL;DR: A fully Bayesian version of this approach is developed, implemented via Markov chain Monte Carlo (MCMC) methods, and used to jointly model the longitudinal and survival data from an AIDS clinical trial comparing two treatments, didanosine and zalcitabine.

...read moreread less

Abstract: Many clinical trials and other medical and reliability studies generate both longitudinal (repeated measurement) and survival (time to event) data. Many well-established methods exist for analyzing such data separately, but these may be inappropriate when the longitudinal variable is correlated with patient health status, hence the survival endpoint (as well as the possibility of study dropout). To remedy this, an earlier article proposed a joint model for longitudinal and survival data, obtaining maximum likelihood estimates via the EM algorithm. The longitudinal and survival responses are assumed independent given a linking latent bivariate Gaussian process and available covariates. We develop a fully Bayesian version of this approach, implemented via Markov chain Monte Carlo (MCMC) methods. We use the approach to jointly model the longitudinal and survival data from an AIDS clinical trial comparing two treatments, didanosine (ddI) and zalcitabine (ddC). Despite the complexity of the model, we find it t...

...read moreread less

324 citations

Journal Article•DOI•

[...]

Scott Menard

Model Selection, Confounder Control, and Marginal Structural Models

TL;DR: In this article, the authors present six alternative approaches to constructing standardized logistic regression coefficients, the least attractive of which is the unstandardized coefficient divided by its standard error (which is actually the normal distribution version of the Wald statistic), while a slightly more complex alternative most closely parallels the standardized coefficient in ordinary least squares regression, in the sense of being based on variance in the dependent variable and the predictors.

...read moreread less

Abstract: This article reviews six alternative approaches to constructing standardized logistic regression coefficients. The least attractive of the options is the one currently most readily available in logistic regression software, the unstandardized coefficient divided by its standard error (which is actually the normal distribution version of the Wald statistic). One alternative has the advantage of simplicity, while a slightly more complex alternative most closely parallels the standardized coefficient in ordinary least squares regression, in the sense of being based on variance in the dependent variable and the predictors. The sixth alternative, based on information theory, may be the best from a conceptual standpoint, but unless and until appropriate algorithms are constructed to simplify its calculation, its use is limited to relatively simple logistic regression models in practical application.

...read moreread less

260 citations

Journal Article•DOI•

[...]

Marshall M. Joffe¹, Thomas R. Ten Have¹, Harold I. Feldman¹, Stephen E. Kimmel¹•Institutions (1)

University of Pennsylvania¹

Grade Inflation: A Crisis in College Education

TL;DR: Marginal structural models are a flexible new set of causal models that allow more flexibility in choosing covariates for inclusion in the structural model and allows the model to more precisely reflect the scientific questions of interest.

...read moreread less

Abstract: In traditional regression modeling, to control for confounding by a variable one must include it in the structural part of the statistical model. Marginal structural models are a flexible new set of causal models. The estimation methods used to estimate model parameters use weighting to control for confounding; this allows more flexibility in choosing covariates for inclusion in the structural model and allows the model to more precisely reflect the scientific questions of interest. An important example of this is in multicenter observational studies where there is confounding by cluster. We illustrate these points with data from a study of surgery to provide vascular access for hemodialysis and a study comparing different timings for coronary angioplasty.

...read moreread less

224 citations

Journal Article•DOI•

[...]

Kristin A Duncan

Biases in SPSS 12.0 Missing Value Analysis

TL;DR: In this paper, a crisis in college education can help students to solve the problem of where they get the ideas for their writing skills, and it can be one of the right sources to develop their writing skill.

...read moreread less

Abstract: When writing can change your life, when writing can enrich you by offering much money, why don't you try it? Are you still very confused of where getting the ideas? Do you still have no idea with what you are going to write? Now, you will need reading. A good writer is a good reader at once. You can define how you write depending on what books to read. This grade inflation a crisis in college education can help you to solve the problem. It can be one of the right sources to develop your writing skill.

...read moreread less

144 citations

Journal Article•DOI•

[...]

Paul T. von Hippel¹•Institutions (1)

Ohio State University¹

The Problem of Nonresponse in Sample Surveys

TL;DR: MVA can also impute values using the EM algorithm, but values are imputed without residual variation, so analyses that use the imputed values can be biased, and EM's implementation in MVA is limited to point estimates of means, variances, and covariances.

...read moreread less

Abstract: In addition to SPSS Base software, SPSS Inc. sells a number of add-on packages, including a package called Missing Value Analysis (MVA). In version 12.0, MVAoffers four general methods for analyzing data with missing values. Unfortunately, none of these methods is wholly satisfactory when values are missing at random. The first two methods, listwise and pairwise deletion, are well known to be biased. The third method, regression imputation, uses a regression model to impute missing values, but the regression parameters are biased because they are derived using pairwise deletion. The final method, expectation maximization (EM), produces asymptotically unbiased estimates, but EM's implementation in MVA is limited to point estimates (without standard errors) of means, variances, and covariances. MVAcan also impute values using the EM algorithm, but values are imputed without residual variation, so analyses that use the imputed values can be biased.

...read moreread less

131 citations

Journal Article•DOI•

[...]

Morris H. Hansen, William N. Hurwitz

TL;DR: In this paper, the authors consider the problem of minimizing the number of mail questionnaires to be sent out and the personal views to take in following up nonresponses to the mail questionnaire, in order to attain the required precision at a minimum cost.

...read moreread less

Abstract: The mail questionnaire is used in a number of surveys be cause of the economies involved. The principal objection to this method of collecting factual information is that it generally in volves a large nonresponse rate, and an unknown bias is involved in an assumption [that] those responding are representative of the combined total of respondents and nonrespondents. Personal interviews generally elicit a substantially complete response, but the cost per schedule is, of course, considerably higher than it would be for the mail questionnaire method. The purpose of this paper is to indicate a technique which combines the advantages of both procedures. The problem considered is to determine the number of mail questionnaires to be sent out and the number of personal inter views to take in following up nonresponses to the mail ques tionnaire, in order to attain the required precision at a minimum cost. The procedure outlined below can be applied whatever the methods of collecting data are. For example, perhaps equally important as the problem of nonresponse in using mail question naires is the problem of call-backs in taking field interviews. In this latter problem the procedure to minimize cost for a given degree of reliability would call for taking a larger sample of first interviews and calling back on a fraction of "those not at home." The technique presented herein makes it possible to use unbi ased designs at a reasonable cost where the excessive cost of ordinary methods of follow-up has frequently led to abandoning them. As an illustration, let us assume we want to estimate the num ber of employees in retail stores during a specified period in the State of Indiana. We shall assume we have a listing of all establishments having one or more employees, say from Social Security records, and their corresponding mailing addresses. A procedure sometimes followed is to take a sample of addresses from this list, mail out the questionnaires, and then depend ex clusively on the mail returns for the estimate of number of em ployees for all retail stores in the State. The result of this proce dure usually will be biased. It may be seriously so if there is a large rate of nonresponse. On the other hand, if all the addresses were actually visited by an enumerator, the cost of collecting the information would be much greater. Suppose the cost of mailing is 10 cents per questionnaire mailed, and the cost of processing the returns is 40 cents per questionnaire returned. Suppose, on the other hand, that the cost of carrying through field interviews is $4.10 per questionnaire, and that this cost, together with the cost of processing the field returns, is $4.50 per questionnaire. For the cost of one field visit we could then obtain about eight mail questionnaires with only a 20 percent response rate. This does not mean that we should take our entire sample by mail even though for the fixed cost we

...read moreread less

122 citations

Journal Article•DOI•

Bowlers' Hot Hands

[...]

Reid Dorsey-Palmateer, Gary Smith¹•Institutions (1)

Pomona College¹

TL;DR: For example, the authors found that the probability of rolling a strike is not independent of previous outcomes and the number of strikes rolled varies more across games than can be explained by chance alone.

...read moreread less

Abstract: Earlier analysis of basketball data debunked the common perception that players sometimes have “hot hands.” That analysis, however, did not control for several confounding influences. Our analysis of professional bowling indicates that, for many bowlers, the probability of rolling a strike is not independent of previous outcomes and the number of strikes rolled varies more across games than can be explained by chance alone. For example, most bowlers have a higher strike proportion after j consecutive strikes than after j consecutive nonstrikes, and this difference becomes more pronounced as j increases from 1 to 4.

...read moreread less

Journal Article•DOI•

Review of WinBUGS 1.4

[...]

Mary Kathryn Cowles¹•Institutions (1)

University of Iowa¹

Difficulties in Drawing Inferences With Finite-Mixture Models

TL;DR: WinBUGS is highly recommended for both simple and complex Bayesian analyses, with the caveat that users require knowledge of both Bayesian methods and issues in MCMC.

...read moreread less

Abstract: WinBUGS, a software package that uses Markov chain Monte Carlo (MCMC) methods to fit Bayesian statistical models, has facilitated Bayesian analysis in a wide variety of applications areas. This review shows the steps required to fit a Bayesian model with WinBUGS, and discusses the package's strengths and weaknesses. WinBUGS is highly recommended for both simple and complex Bayesian analyses, with the caveat that users require knowledge of both Bayesian methods and issues in MCMC.

...read moreread less

Journal Article•DOI•

[...]

Hwan Chung¹, Eric Loken¹, Joseph L Schafer¹•Institutions (1)

National Institute on Drug Abuse¹

Two Statistical Paradoxes in the Interpretation of Group Differences: Illustrated with Medical School Admission and Licensing Data

TL;DR: The authors' simulations show that MCMC performs much better than ML if the label-switching problem is adequately addressed, and that asymmetric prior information performs as well as or better than the other proposed methods.

...read moreread less

Abstract: Likelihood functions from finite mixture models have many unusual features. Maximum likelihood (ML) estimates may behave poorly over repeated samples, and the abnormal shape of the likelihood often makes it difficult to assess the uncertainty in parameter estimates. Bayesian inference via Markov chain Monte Carlo (MCMC) can be a useful alternative to ML, but the component labels may switch during the MCMC run, making the output difficult to interpret. Two basic methods for handling the label-switching problem have been proposed: imposing constraints on the parameter space and cluster-based relabeling of the simulated parameters. We have found that label switching may also be reduced by supplying small amounts of prior information that are asymmetric with respect to the mixture components. Simply assigning one observation to each component a priori may effectively eliminate the problem. Using a very simple example—a univariate sample from a mixture of two exponentials—we evaluate the performance of likelih...

...read moreread less

Book Chapter•DOI•

[...]

Howard Wainer, Lisa M Brown

The Design of a General and Flexible System for Handling Nonresponse in Sample Surveys

TL;DR: In this paper, the authors describe three of the best known paradoxes (Simpson's paradox, Kelley's Paradox, and Lord's Paradox) and illustrate them in a single data set.

...read moreread less

Abstract: Interpreting group differences observed in aggregated data is a practice that must be done with enormous care. Often the truth underlying such data is quite different than a naive first look would indicate. The confusions that can arise are so perplexing that some of the more frequently occurring ones have been dubbed paradoxes. In this paper we describe three of the best known of these paradoxes --Simpson’s Paradox, Kelley’s Paradox, and Lord’s Paradox -- and illustrate them in a single data set. The data set contains the score distributions, separated by race, on the biological sciences component of the Medical College Admission Test (MCAT) and Step 1 of the United States Medical Licensing Examination™ (USMLE). Our goal in examining these data was to move toward a greater understanding of race differences in admissions policies in medical schools. As we demonstrate, the path toward this goal is hindered by differences in the score distributions which gives rise to these three paradoxes. The ease with which we were able to illustrate all of these paradoxes within a single data set is indicative of how wide spread they are likely to be in practice.

...read moreread less

Journal Article•DOI•

[...]

Donald B. Rubin¹•Institutions (1)

Social Security Administration¹

Use of R as a Toolbox for Mathematical Statistics Exploration

TL;DR: The best approach to nonresponse (missingness) in sur veys is seen to be one where the authors can (1) insert more than one value for a missing datum, and (2) the inserted values reflect a variety of models for the dataset.

...read moreread less

Abstract: The general approach to nonresponse (missingness) in sur veys that I will take here will be to impute values for missing data (really, several values for each missing datum). The ap proach that imputes one value for each missing datum is stan dard in practice, although often criticized by more mathematical statisticians who prefer to think about estimating parameters un der some model. I am very sympathetic with the imputation position. There do not exist parameters except under hypothetical models; there do, however, exist actual observed values and values that would have been observed. Focusing on the estimation of parameters is often not what we want to do since a hypothetical model is simply a structure that guides us to do sensible things with observed values. Of course (1) imputing one value for missing datum can't be correct in general, and (2) in order to insert sensible values for a missing datum we must rely more or less on some model relating unobserved values to observed values. Hence, I see the best approach to be one where we can (1) insert more than one value for a missing datum, and (2) the inserted values reflect a variety of models for the dataset. This position focusing on values to impute rather than param eters to be estimated is actually very Bayesian, and the Bayesian perspective guides us in our design of a general system for nonre sponse. What we really want to impute is the "predictive distribu tion" of the missing values given then observed values (having integrated?averaged?over all model parameters). The theo retical Bayesian position tells us that (1) the missing data has a distribution given the observed data (the predictive distribution) and (2) this distribution depends on assumptions that have been made about the model. Notice that the (l)'s and (2)'s in the above paragraphs are meant to refer to the same two points. The related practical questions are (1) how do we represent in a dataset a distribution of values to impute for each missing datum? And (2) what models should we use to tie observed and unobserved values to each other in order to produce the predictive distribution needed in (1). Section 2 addresses the first question and Section 3 addresses the second question.

...read moreread less

Journal Article•DOI•

[...]

Nicholas J. Horton¹, Elizabeth R. Brown², Linjuan Qian¹•Institutions (2)

Smith College¹, University of Washington²

First Significant Digit Patterns From Mixtures of Uniform Distributions

TL;DR: How R can be used in a mathematical statistics course as a toolbox for experimentation, using a series of case studies and activities that provide an introduction to the framework and idioms available in this rich environment.

...read moreread less

Abstract: The R language, a freely available environment for statistical computing and graphics is widely used in many fields. This “expert-friendly” system has a powerful command language and programming environment, combined with an active user community. We discuss how R is ideal as a platform to support experimentation in mathematical statistics, both at the undergraduate and graduate levels. Using a series of case studies and activities, we describe how R can be used in a mathematical statistics course as a toolbox for experimentation. Examples include the calculation of a running average, maximization of a nonlinear function, resampling of a statistic, simple Bayesian modeling, sampling from multivariate normal, and estimation of power. These activities, often requiring only a few dozen lines of code, offer students the opportunity to explore statistical concepts and experiment. In addition, they provide an introduction to the framework and idioms available in this rich environment.

...read moreread less

Journal Article•DOI•

[...]

Ricardo J. Rodriguez¹•Institutions (1)

University of Miami¹

The Effect of Dependence on Confidence Intervals for a Population Proportion

TL;DR: In this paper, the first significant digits (FSD) patterns belong to a family arising from mixtures of uniforms, and characterize the FSD patterns for a one-parameter subset of the family.

...read moreread less

Abstract: Traditional tests searching for human influence in data assume that, barring such influence, first significant digits (FSD) are uniformly distributed More recent tests rely on Benford's law, postulating that lower digits are more likely than higher ones I show that both patterns belong to a family arising from mixtures of uniforms, and characterize the FSD patterns for a one-parameter subset of the family I also show that all family members exhibit decreasing FSD probabilities The empirical analysis suggests that although the uniform FSD pattern and Benford's law are reasonable models for some data, alternative family members better fit other data

...read moreread less

Journal Article•DOI•

[...]

Weiwen Miao, Joseph L. Gastwirth

Statistics and the College Football Championship

TL;DR: In this paper, the authors show that when the observations are dependent, even slightly, the coverage probabilities of the usual confidence intervals can deviate noticeably from their nominal level and propose modified confidence intervals that incorporate the dependence structure.

...read moreread less

Abstract: The binomial model is widely used in statistical applications. Usually, the success probability, p, and its associated confidence interval are estimated from a random sample. Thus, the observations are independent and identically distributed. Motivated by a legal case where some grand jurors could serve a second year, this article shows that when the observations are dependent, even slightly, the coverage probabilities of the usual confidence intervals can deviate noticeably from their nominal level. Several modified confidence intervals that incorporate the dependence structure are proposed and examined. Our results show that the modified Wilson, Agresti-Coull, and Jeffreys confidence intervals perform well and can be recommended for general use.

...read moreread less

Journal Article•DOI•

[...]

Hal S. Stern

Proofs That Really Count: The Art of Combinatorial Proof. Arthur T. Benjamin and Jennifer J. Quinn

TL;DR: The BCS has been a controversial system since its implementation prior to the 1998 season with the most recent 2003 season producing a disputed championship, the very thing the BCS system was developed to avoid as mentioned in this paper.

...read moreread less

Abstract: The U.S. college football champion is determined each winter by the Bowl Championship Series (BCS), a set of four college football games and an associated ranking system that helps to determine the participants in the four games. One game each winter (the specific game rotates among the four participating games) hosts the national championship game between the top two teams in the BCS ranking. The BCS has been a controversial system since its implementation prior to the 1998 season with the most recent 2003 season producing a disputed championship, the very thing the BCS system was developed to avoid. This current article reviews the history of the college football national championship, the rise of the BCS, the BCS ranking system, and the contributions that statistical thinking can make toward improving the BCS. Though the problem of optimally ranking sports teams is a difficult one there is clearly room for improvement in the present system!

...read moreread less

Posted Content•

[...]

D. Glass

An Example of Slow Convergence of the Bootstrap in High Dimensions

Journal Article•DOI•

[...]

James Troendle¹, Edward L. Korn¹, Lisa M. McShane¹•Institutions (1)

National Institutes of Health¹

The St. Petersburg Paradox and the Crash of High-Tech Stocks in 2000

TL;DR: In this paper, the authors examined the use of bootstrap hypothesis tests for testing the equality of two multivariate distributions and found that the test levels are conservative or anti-conservative when the sample sizes are small and the number of variables is large.

...read moreread less

Abstract: This article examines the use of bootstrap hypothesis tests for testing the equality of two multivariate distributions. The test statistic used is the maximum of the univariate two-sample t-statistics. Depending upon the type of bootstrap resampling used, the simulation studies show that the test levels are conservative or anti-conservative when the sample sizes are small and the number of variables is large. For small sample sizes, using the bootstrap resampling that preserves the Type I error can lead to a testing procedure that has lower power, sometimes dramatically lower, than a permutation test.

...read moreread less

Journal Article•DOI•

[...]

Gábor J. Székely¹, Gábor J. Székely², Donald St. P. Richards•Institutions (2)

Hungarian Academy of Sciences¹, Bowling Green State University²

Special Section: Teaching Computational Statistics

TL;DR: In this article, the authors review the history of the St. Petersburg paradox and some related games and conclude that the run-up in stock prices in the late 1990s and the subsequent declines in 2000 could have been avoided by an analysis and application of the st Petersburg paradox.

...read moreread less

Abstract: During the late 1990s high technology growth stock prices were raised to unprecedented levels by avid stock purchasers around the world. In early 2000, share prices subsequently underwent prolonged declines, leaving many purchasers with devastating losses. This article reviews some aspects of the history of the St. Petersburg paradox and some related games. We recount a remarkable article by Durand in which the valuation of growth stocks is related to the St. Petersburg paradox. Our conclusion is that the run-up in stock prices in the late 1990s and the subsequent declines in 2000 could have been avoided by an analysis and application of the St. Petersburg paradox.

...read moreread less

Posted Content•

[...]

J H Albert, J E Gentle

Statistical Methods for the Analysis of Biomedical Data

TL;DR: In this paper, the authors examine the curriculum in computing for undergraduate and graduate programs in statistics and present a small sampling of computing curriculum in graduate programs of statistics in the United States.

...read moreread less

Abstract: Statisticianshave always been heavyusers of computingfacilities. In the past these facilities were calculators and the “computers” were those who operated them. Most applied statisticians knew how to use the calculators and even had personal calculators. Students in applied statistics courses were given instruction in the use of calculators, usually in labs associatedwith the courses. When “computers”became machines and use of the computing facilities required software programs, a divergence developed in how the computing facilities were used. A statistician who just wanted to do a regression analysis could nd a program already written that would do the computations for the regression. Students in applied statistics courses were told about these programs—already written by someone else and available for standard statistical analyses. The training required was only marginally greater than that required for use of calculators for simpler tasks. Statisticians who wanted to do something “nonstandard” had a whole new world before them. They could now write instructions for the computer to perform whatever calculations, no matter how intricate, to implement their new analysis method. The training required to do this correctly was of a different order of magnitude. An important question is how to integrate education in computing into the statistics curriculum. The answers to this question obviously must change with technology. They must also account for the differing types of computer usage. Almost all statisticians,whether their academicdegrees are bachelors,masters, or PhDs must use prepackaged statistical analysis software to analyze data. Their statistics training should prepare them for this. Research statisticians who develop new methodology must somehow understand the methods implemented in software. Statisticians who just want to study and compare different statisticalmethodsare often faced with a certain amount of computer programming. These activities require a different level of computer expertise than what is required to do an analysis using a statistics software package. It is helpful to examine how computing is taught in academic statistics programs. The three articles in this special section represent a small (and nonrandom!) sampling of computing curriculum in graduate programs in statistics. Also they re ect the personal perspectives of the authors on the topics that should receive special emphasis. Certainly “one size does not t all”; different programs will have different emphases and should approach the question differently. But the articles should be useful in guiding the curriculum in computing for undergraduate and graduate programs in statistics.

...read moreread less

Journal Article•DOI•

[...]

Montserrat Rue

Criteria for Evaluating Dimension-Reducing Components for Multivariate Data

TL;DR: Woolson and Clarke as mentioned in this paper presented an intermediate-level reference text for medical, public health, and biological researchers, and an introductory textbook for graduate students in the biostatistics field.

...read moreread less

Abstract: This is the second edition of a well-written book that was published by Woolson alone in 1987. The book is an intermediate-level reference text for medical, public health, and biological researchers, and an introductory textbook for graduate students in the biostatistics eld. There is a considerable amount of algebra that might make the book more suitable for audiences with a mathematical background, and discourage researchers who prefer to skip the theory that underlies the statistical techniques. The book covers descriptive statistics, probability distributions, con dence intervals and hypothesis testing, and methods of inference for comparing two or more groups, all in depth. There is also a chapter on measures of association, which includes odds ratio and measures of reliability and agreement, and a chapter on estimation and comparison of survival curves. The second edition also includes a chapter on multiple linear and logistic regression. Each chapter has an extensive introduction that focuses a speci c problem and discusses design and analysis questions. The authors use many point-bypoint examples taken from epidemiological and clinical settings to illustrate the methods and techniques that are described in each chapter. A comparison of different methods to solve the same problem is done. The authors emphasize the importance of the design when selecting the method of analysis. The type of design shapes the statistical analysis that is most appropriate for that data. As stated in the book: “once a design is selected, the statistical analysis should conform to that design.” The authors have made four major revisions to the book: (1) they have reworked and included more exercises at the end of each chapter to illustrate the topics discussed in that chapter; (2) they have added a section at the end of each chapter that describes how to perform the statistical analyses described using the SAS software; (3) they have added a section to the chapter “Least-Squares Regression Methods” that describes how to examine residuals to determine if the underlying assumptions for simple linear regression are valid; and (4) they have added a new chapter that describes and illustrates the techniquesofmultiple linear regression and multiple linear regression, including a section on analysis of covariance. A characteristic that makes this book more attractive than other books addressed to similar audiences is the applied approach that the authors use to introduce and compare statistical methods through step-by-step examples integrated with the theory underlying these methods. Woolson and Clarke say in the Preface that, “an understandingof the theory behind a technique should alert the user as to when it would be inappropriate to apply that technique.” And they have done a great job in making statistical theory more accessible. Other strengths of the book are an extensive coverage of nonparametric tests and multiple comparisons, and a chapter on analysis of epidemiologic and clinical data. However, topics like the Cox proportional hazards model, and discriminant analysis are not covered, and repeated-measures designs are covered only for the paired t test. As a biostatistician who has worked in clinical research for more than 15 years, I do not use this book as frequently now as I did many years ago. An experienced statistician or researcher will probably use more specialized books. But I have used the rst editionof this book when teaching introductorystatistics courses to clinicians and to public health students, and I have recommended it to research colleagues who have asked me to recommend an introductory or intermediate book on statistical methods. A criticism that I have is that the authors have updated very few bibliographic references. For example, some reference books that I have found very useful for survival analysis, such as Collet (1994), or Hosmer and Lemeshow (1999), are not cited. The newest reference in the Analysis of Epidemiologic and Clinical Data is from 1986. To summarize, I nd this book interesting and useful. I recommend it as an addition to your statistical library, and if you already own the rst edition, it would be worthwhile to update it.

...read moreread less

Journal Article•DOI•

[...]

Daniel Gervini¹, Valentin Rousson¹•Institutions (1)

University of Zurich¹

Statistical research: Some advice for beginners

TL;DR: In this paper, the authors show that the criteria commonly used to evaluate principal components are not adequate for evaluating such alternatives, and propose two new criteria that are more suitable for this purpose.

...read moreread less

Abstract: Principal components are the benchmark for linear dimension reduction, but they are not always easy to interpret. For this reason, some alternatives have been proposed in recent years. These methods produce components that, unlike principal components, are correlated and/or have nonorthogonal loadings. This article shows that the criteria commonly used to evaluate principal components are not adequate for evaluating such alternatives, and proposes two new criteria that are more suitable for this purpose.

...read moreread less

Journal Article•DOI•

[...]

Michael Hamada¹, Randy R. Sitter²•Institutions (2)

Bowling Green State University¹, Los Alamos National Laboratory²

Bootstrap Methods for Developing Predictive Models in Cardiovascular Research

TL;DR: For new graduate students, the authors discuss issues and aspects of doing statistical research and provide advice, answering questions that we had when we were beginners, such as When do I start, How do I find out what has already been done, how do I make progress, and What else can I do?

...read moreread less

Abstract: For new graduate students, we discuss issues and aspects of doing statistical research and provide advice. We answer questions that we had when we were beginners, like When do I start?, How do I start?, How do I find out what has already been done?, How do I make progress?, How do I finish?, and What else can I do?.

...read moreread less

Journal Article•

[...]

Peter C. Austin

A Review of Software Packages for Analyzing Correlated Survival Data

Journal Article•DOI•

[...]

Patrick J. Kelly¹•Institutions (1)

University of Reading¹

The State of Undergraduate Education in Statistics

TL;DR: Six software packages for fitting either marginal or random effects models to correlated survival data are considered: SAS, Stata, S-Plus and R, MLwiN, and WinBUGS.

...read moreread less

Abstract: This article provides a review of software packages for fitting either marginal or random effects models to correlated survival data. Six packages are considered: SAS, Stata, S-Plus and R, MLwiN, and WinBUGS. Each software package is reviewed with respect to Cox and parametric accelerated failure time (AFT) models. The article aims to give the reader a summary of the different capabilities of each package.

...read moreread less

Journal Article•DOI•

[...]

Richard L. Scheaffer¹, Elizabeth A Stasny¹•Institutions (1)

University of Florida¹

"Transmuting" Women into Men: Galton's Family Data on Human Stature

TL;DR: The most recent survey of undergraduate education in the mathematical sciences in the United States was conducted by the Conference Board of the Mathematical Sciences (CBMS) with the support of the National Science Foundation (NSF) as discussed by the authors.

...read moreread less

Abstract: Every five years since 1965 the Conference Board of the Mathematical Sciences (CBMS), with the support of the National Science Foundation, has conducted a national survey of undergraduate education in the mathematical sciences in the United States. The survey collects information on undergraduate enrollments in courses in the mathematical sciences and on the demographics of faculty members. It also asks about the undergraduate curriculum to determine what is taught, who teaches it, and how it is taught. The 2000 CBMS survey, for the first time, sampled departments of statistics separately and asked questions about the educational backgrounds of those teaching statistics in departments of mathematics. This article presents a summary of the 2000 CBMS survey results of particular interest to statisticians.

...read moreread less

Journal Article•DOI•

[...]

James A. Hanley¹•Institutions (1)

McGill University¹