scispace - formally typeset
Search or ask a question

Showing papers in "Journal of the American Statistical Association in 2004"


Journal ArticleDOI
TL;DR: The Elements of Statistical Learning: Data Mining, Inference, and Prediction as discussed by the authors is a popular book for data mining and machine learning, focusing on data mining, inference, and prediction.
Abstract: (2004). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Journal of the American Statistical Association: Vol. 99, No. 466, pp. 567-567.

10,549 citations


Journal ArticleDOI
TL;DR: The theory of proper scoring rules on general probability spaces is reviewed and developed, and the intuitively appealing interval score is proposed as a utility function in interval estimation that addresses width as well as coverage.
Abstract: Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that materializes. A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the distributionF if he or she issues the probabilistic forecast F, rather than G ≠ F. It is strictly proper if the maximum is unique. In prediction problems, proper scoring rules encourage the forecaster to make careful assessments and to be honest. In estimation problems, strictly proper scoring rules provide attractive loss and utility functions that can be tailored to the problem at hand. This article reviews and develops the theory of proper scoring rules on general probability spaces, and proposes and discusses examples thereof. Proper scoring rules derive from convex functions and relate to information measures, entropy functions, and Bregman divergences. In the case of categorical variables, we prove a rigorous version of the ...

4,644 citations


Journal ArticleDOI
TL;DR: The default ad hoc adjustment, provided as part of the Affymetrix system, can be improved through the use of estimators derived from a statistical model that uses probe sequence information, which greatly improves the performance of the technology in various practical applications.
Abstract: High-density oligonucleotide expression arrays are widely used in many areas of biomedical research. Affymetrix GeneChip arrays are the most popular. In the Affymetrix system, a fair amount of further preprocessing and data reduction occurs after the image-processing step. Statistical procedures developed by academic groups have been successful in improving the default algorithms provided by the Affymetrix system. In this article we present a solution to one of the preprocessing steps—background adjustment—based on a formal statistical framework. Our solution greatly improves the performance of the technology in various practical applications. These arrays use short oligonucleotides to probe for genes in an RNA sample. Typically, each gene is represented by 11–20 pairs of oligonucleotide probes. The first component of these pairs is referred to as a perfect match probe and is designed to hybridize only with transcripts from the intended gene (i. e., specific hybridization). However, hybridization by other...

1,925 citations


Journal ArticleDOI
TL;DR: It is proposed that GAM's with a ridge penalty provide a practical solution in such circumstances, and a multiple smoothing parameter selection method suitable for use in the presence of such a penalty is developed.
Abstract: Representation of generalized additive models (GAM's) using penalized regression splines allows GAM's to be employed in a straightforward manner using penalized regression methods. Not only is inference facilitated by this approach, but it is also possible to integrate model selection in the form of smoothing parameter selection into model fitting in a computationally efficient manner using well founded criteria such as generalized cross-validation. The current fitting and smoothing parameter selection methods for such models are usually effective, but do not provide the level of numerical stability to which users of linear regression packages, for example, are accustomed. In particular the existing methods cannot deal adequately with numerical rank deficiency of the GAM fitting problem, and it is not straightforward to produce methods that can do so, given that the degree of rank deficiency can be smoothing parameter dependent. In addition, models with the potential flexibility of GAM's can also present ...

1,657 citations




Journal ArticleDOI
TL;DR: In this paper, a unified strategy for selecting spatially balanced probability samples of natural resources is presented, which is based on creating a function that maps two-dimensional space into onedimensional space, thereby defining an ordered spatial address.
Abstract: The spatial distribution of a natural resource is an important consideration in designing an efficient survey or monitoring program for the resource. Generally, sample sites that are spatially balanced, that is, more or less evenly dispersed over the extent of the resource, are more efficient than simple random sampling. We review a unified strategy for selecting spatially balanced probability samples of natural resources. The technique is based on creating a function that maps two-dimensional space into one-dimensional space, thereby defining an ordered spatial address. We use a restricted randomization to randomly order the addresses, so that systematic sampling along the randomly ordered linear structure results in a spatially well-balanced random sample. Variable inclusion probability, proportional to an arbitrary positive ancillary variable, is easily accommodated. The basic technique selects points in a two-dimensional continuum, but is also applicable to sampling finite populations or one-dimension...

1,082 citations


Journal ArticleDOI
TL;DR: This paper developed the theoretical properties of the propensity function, which is a generalization of the Rosenbaum and Rubin propensity score, and developed theory and methods that encompass all of these techniques and widen their applicability by allowing for arbitrary treatment regimes.
Abstract: In this article we develop the theoretical properties of the propensity function, which is a generalization of the propensity score of Rosenbaum and Rubin. Methods based on the propensity score have long been used for causal inference in observational studies; they are easy to use and can effectively reduce the bias caused by nonrandom treatment assignment. Although treatment regimes need not be binary in practice, the propensity score methods are generally confined to binary treatment scenarios. Two possible exceptions have been suggested for ordinal and categorical treatments. In this article we develop theory and methods that encompass all of these techniques and widen their applicability by allowing for arbitrary treatment regimes. We illustrate our propensity function methods by applying them to two datasets; we estimate the effect of smoking on medical expenditure and the effect of schooling on wages. We also conduct simulation studies to investigate the performance of our methods.

859 citations


Journal ArticleDOI
TL;DR: The MSVM is proposed, which extends the binary SVM to the multicategory case and has good theoretical properties, and an approximate leave-one-out cross-validation function is derived, analogous to the binary case.
Abstract: Two-category support vector machines (SVM) have been very popular in the machine learning community for classification problems. Solving multicategory problems by a series of binary classifiers is quite common in the SVM paradigm; however, this approach may fail under various circumstances. We propose the multicategory support vector machine (MSVM), which extends the binary SVM to the multicategory case and has good theoretical properties. The proposed method provides a unifying framework when there are either equal or unequal misclassification costs. As a tuning criterion for the MSVM, an approximate leave-one-out cross-validation function, called Generalized Approximate Cross Validation, is derived, analogous to the binary case. The effectiveness of the MSVM is demonstrated through the applications to cancer classification using microarray data and cloud classification with satellite radiance profiles.

767 citations


Posted Content
TL;DR: This book is a valuable resource, both for the statistician needing an introduction to machine learning and related Ž elds and for the computer scientist wishing to learn more about statistics, and statisticians will especially appreciate that it is written in their own language.
Abstract: In the words of the authors, the goal of this book was to “bring together many of the important new ideas in learning, and explain them in a statistical framework.” The authors have been quite successful in achieving this objective, and their work is a welcome addition to the statistics and learning literatures. Statistics has always been interdisciplinary, borrowing ideas from diverse Ž elds and repaying the debt with contributions, both theoretical and practical, to the other intellectual disciplines. For statistical learning, this cross-fertilization is especially noticeable. This book is a valuable resource, both for the statistician needing an introduction to machine learning and related Ž elds and for the computer scientist wishing to learn more about statistics. Statisticians will especially appreciate that it is written in their own language. The level of the book is roughly that of a second-year doctoral student in statistics, and it will be useful as a textbook for such students. In a stimulating article, Breiman (2001) argued that statistics has been focused too much on a “data modeling culture,” where the model is paramount. Breiman argued instead for an “algorithmic modeling culture,” with emphasis on black-box types of prediction. Breiman’s article is controversial, and in his discussion, Efron objects that “prediction is certainly an interesting subject, but Leo’s paper overstates both its role and our profession’s lack of interest in it.” Although I mostly agree with Efron, I worry that the courses offered by most statistics departments include little, if any, treatment of statistical learning and prediction. (Stanford, where Efron and the authors of this book teach, is an exception.) Graduate students in statistics certainly need to know more than they do now about prediction, machine learning, statistical learning, and data mining (not disjoint subjects). I hope that graduate courses covering the topics of this book will become more common in statistics curricula. Most of the book is focused on supervised learning, where one has inputs and outputs from some system and wishes to predict unknown outputs corresponding to known inputs. The methods discussed for supervised learning include linear and logistic regression; basis expansion, such as splines and wavelets; kernel techniques, such as local regression, local likelihood, and radial basis functions; neural networks; additive models; decision trees based on recursive partitioning, such as CART; and support vector machines. There is a Ž nal chapter on unsupervised learning, including association rules, cluster analysis, self-organizing maps, principal components and curves, and independent component analysis. Many statisticians will be unfamiliar with at least some of these algorithms. Association rules are popular for mining commercial data in what is called “market basket analysis.” The aim is to discover types of products often purchased together. Such knowledge can be used to develop marketing strategies, such as store or catalog layouts. Self-organizing maps (SOMs) involve essentially constrained k-means clustering, where prototypes are mapped to a two-dimensional curved coordinate system. Independent components analysis is similar to principal components analysis and factor analysis, but it uses higher-order moments to achieve independence, not merely zero correlation between components. A strength of the book is the attempt to organize a plethora of methods into a coherent whole. The relationships among the methods are emphasized. I know of no other book that covers so much ground. Of course, with such broad coverage, it is not possible to cover any single topic in great depth, so this book will encourage further reading. Fortunately, each chapter includes bibliographic notes surveying the recent literature. These notes and the extensive references provide a good introduction to the learning literature, including much outside of statistics. The book might be more suitable as a textbook if less material were covered in greater depth; however, such a change would compromise the book’s usefulness as a reference, and so I am happier with the book as it was written.

631 citations


Journal ArticleDOI
TL;DR: In this article, a particle representation of the filtering distributions, and their evolution through time using sequential importance sampling and resampling ideas are developed for performing smoothing computations in general state-space models.
Abstract: We develop methods for performing smoothing computations in general state-space models. The methods rely on a particle representation of the filtering distributions, and their evolution through time using sequential importance sampling and resampling ideas. In particular, novel techniques are presented for generation of sample realizations of historical state sequences. This is carried out in a forward-filtering backward-smoothing procedure that can be viewed as the nonlinear, non-Gaussian counterpart of standard Kalman filter-based simulation smoothers in the linear Gaussian case. Convergence in the mean squared error sense of the smoothed trajectories is proved, showing the validity of our proposed method. The methods are tested in a substantial application for the processing of speech signals represented by a time-varying autoregression and parameterized in terms of time-varying partial correlation coefficients, comparing the results of our algorithm with those from a simple smoother based on the filte...


Journal ArticleDOI
TL;DR: In this paper, the authors evaluate the performance of full matching for the first time, modifying it in order to minimize variance as well as bias and then using it to compare coached and uncoached takers of the SAT.
Abstract: Among matching techniques for observational studies, full matching is in principle the best, in the sense that its alignment of comparable treated and control subjects is as good as that of any alternate method, and potentially much better. This article evaluates the practical performance of full matching for the first time, modifying it in order to minimize variance as well as bias and then using it to compare coached and uncoached takers of the SAT. In this new version, with restrictions on the ratio of treated subjects to controls within matched sets, full matching makes use of many more observations than does pair matching, but achieves far closer matches than does matching with k≥ 2 controls. Prior to matching, the coached and uncoached groups are separated on the propensity score by 1.1 SDs. Full matching reduces this separation to 1% or 2% of an SD. In older literature comparing matching and regression, Cochran expressed doubts that any method of adjustment could substantially reduce observed bias ...

Journal ArticleDOI
TL;DR: This article shows that cross-validation produces asymptotically optimal smoothing for relevant components, while eliminating irrelevant components by oversmoothing in the problem of nonparametric estimation of a conditional density.
Abstract: Many practical problems, especially some connected with forecasting, require nonparametric estimation of conditional densities from mixed data. For example, given an explanatory data vector X for a prospective customer, with components that could include the customer's salary, occupation, age, sex, marital status, and address, a company might wish to estimate the density of the expenditure, Y, that could be made by that person, basing the inference on observations of (X, Y) for previous clients. Choosing appropriate smoothing parameters for this problem can be tricky, not in the least because plug-in rules take a particularly complex form in the case of mixed data. An obvious difficulty is that there exists no general formula for the optimal smoothing parameters. More insidiously, and more seriously, it can be difficult to determine which components of X are relevant to the problem of conditional inference. For example, if the jth component of X is independent of Y, then that component is irrelevant to es...

Journal ArticleDOI
TL;DR: In this paper, it is shown that not all parameters in the Matern class can be estimated consistently if data are observed in an increasing density in a fixed domain, regardless of the estimation methods used.
Abstract: It is shown that in model-based geostatistics, not all parameters in the Matern class can be estimated consistently if data are observed in an increasing density in a fixed domain, regardless of the estimation methods used. Nevertheless, one quantity can be estimated consistently by the maximum likelihood method, and this quantity is more important to spatial interpolation. The results are established by using the properties of equivalence and orthogonality of probability measures. Some sufficient conditions are provided for both Gaussian and non-Gaussian equivalent measures, and necessary conditions are provided for Gaussian equivalent measures. Two simulation studies are presented that show that the fixed-domain asymptotic properties can explain some finite-sample behavior of both interpolation and estimation when the sample size is moderately large.

Journal ArticleDOI
TL;DR: A Rao–Blackwell type of relation is derived in which nonparametric methods such as cross-validation are seen to be randomized versions of their covariance penalty counterparts.
Abstract: Having constructed a data-based estimation rule, perhaps a logistic regression or a classification tree, the statistician would like to know its performance as a predictor of future cases. There are two main theories concerning prediction error: (1) penalty methods such as Cp, Akaike's information criterion, and Stein's unbiased risk estimate that depend on the covariance between data points and their corresponding predictions; and (2) cross-validation and related nonparametric bootstrap techniques. This article concerns the connection between the two theories. A Rao–Blackwell type of relation is derived in which nonparametric methods such as cross-validation are seen to be randomized versions of their covariance penalty counterparts. The model-based penalty methods offer substantially better accuracy, assuming that the model is believable.


Journal ArticleDOI
TL;DR: In this paper, two new approaches are proposed for estimating the regression coefficients in a semiparametric model and the asymptotic normality of the resulting estimators is established.
Abstract: Semiparametric regression models are very useful for longitudinal data analysis. The complexity of semiparametric models and the structure of longitudinal data pose new challenges to parametric inferences and model selection that frequently arise from longitudinal data analysis. In this article, two new approaches are proposed for estimating the regression coefficients in a semiparametric model. The asymptotic normality of the resulting estimators is established. An innovative class of variable selection procedures is proposed to select significant variables in the semiparametric models. The proposed procedures are distinguished from others in that they simultaneously select significant variables and estimate unknown parameters. Rates of convergence of the resulting estimators are established. With a proper choice of regularization parameters and penalty functions, the proposed variable selection procedures are shown to perform as well as an oracle estimator. A robust standard error formula is derived usi...

Journal ArticleDOI
TL;DR: In this paper, the authors consider the various proposals from a Bayesian decision-theoretic perspective for model selection from both frequentist and Bayesian perspectives, and propose a method for selecting the best model.
Abstract: Model selection is an important part of any statistical analysis and, indeed, is central to the pursuit of science in general. Many authors have examined the question of model selection from both frequentist and Bayesian perspectives, and many tools for selecting the “best model” have been suggested in the literature. This paper considers the various proposals from a Bayesian decision–theoretic perspective.

Journal ArticleDOI
TL;DR: In this article, the limiting distribution of a quantile autoregression estimator and its t-statistic is derived, which is a linear combination of the Dickey-Fuller distribution and the standard normal, with the weight determined by the correlation coefficient of related time series.
Abstract: We study statistical inference in quantile autoregression models when the largest autoregressive coefficient may be unity. The limiting distribution of a quantile autoregression estimator and its t-statistic is derived. The asymptotic distribution is not the conventional Dickey–Fuller distribution, but rather a linear combination of the Dickey–Fuller distribution and the standard normal, with the weight determined by the correlation coefficient of related time series. Inference methods based on the estimator are investigated asymptotically. Monte Carlo results indicate that the new inference procedures have power gains over the conventional least squares-based unit root tests in the presence of non-Gaussian disturbances. An empirical application of the model to U. S. macroeconomic time series data further illustrates the potential of the new approach.

Journal ArticleDOI
TL;DR: A model that describes dependence across random distributions in an analysis of variance (ANOVA)-type fashion is proposed that can be rewritten as a DP mixture of ANOVA models, which inherits all computational advantages of standard DP mixture models.
Abstract: We consider dependent nonparametric models for related random probability distributions. For example, the random distributions might be indexed by a categorical covariate indicating the treatment levels in a clinical trial and might represent random effects distributions under the respective treatment combinations. We propose a model that describes dependence across random distributions in an analysis of variance (ANOVA)-type fashion. We define a probability model in such a way that marginally each random measure follows a Dirichlet process (DP) and use the dependent Dirichlet process to define the desired dependence across the related random measures. The resulting probability model can alternatively be described as a mixture of ANOVA models with a DP prior on the unknown mixing measure. The main features of the proposed approach are ease of interpretation and computational simplicity. Because the model follows the standard ANOVAstructure, interpretation and inference parallels conventions for ANOVA mode...


Journal ArticleDOI
TL;DR: In this article, the authors adopt a decision-theoretic approach, using loss functions that combine the competing goals of discovering as many differentially expressed genes as possible, while keeping the number of false discoveries manageable.
Abstract: We consider the choice of an optimal sample size for multiple-comparison problems. The motivating application is the choice of the number of microarray experiments to be carried out when learning about differential gene expression. However, the approach is valid in any application that involves multiple comparisons in a large number of hypothesis tests. We discuss two decision problems in the context of this setup: the sample size selection and the decision about the multiple comparisons. We adopt a decision-theoretic approach, using loss functions that combine the competing goals of discovering as many differentially expressed genes as possible, while keeping the number of false discoveries manageable. For consistency, we use the same loss function for both decisions. The decision rule that emerges for the multiple-comparison problem takes the exact form of the rules proposed in the recent literature to control the posterior expected falsediscovery rate. For the sample size selection, we combine the expe...

Journal ArticleDOI
TL;DR: In this paper, the authors discuss the strengths and weaknesses of design-based and model-based inference for surveys and suggest that models that take into account the sample design and make weak parametric assumptions can produce reliable and efficient inferences in surveys settings.
Abstract: Finite population sampling is perhaps the only area of statistics in which the primary mode of analysis is based on the randomization distribution, rather than on statistical models for the measured variables. This article reviews the debate between design-based and model-based inference. The basic features of the two approaches are illustrated using the case of inference about the mean from stratified random samples. Strengths and weakness of design-based and model-based inference for surveys are discussed. It is suggested that models that take into account the sample design and make weak parametric assumptions can produce reliable and efficient inferences in surveys settings. These ideas are illustrated using the problem of inference from unequal probability samples. A model-based regression analysis that leads to a combination of design-based and model-based weighting is described.


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a regime switching space-time (RST) model to forecast wind power and wind speed at wind farms in the U.S. Pacific Northwest.
Abstract: With the global proliferation of wind power, the need for accurate short-term forecasts of wind resources at wind energy sites is becoming paramount. Regime-switching space–time (RST) models merge meteorological and statistical expertise to obtain accurate and calibrated, fully probabilistic forecasts of wind speed and wind power. The model formulation is parsimonious, yet takes into account all of the salient features of wind speed: alternating atmospheric regimes, temporal and spatial correlation, diurnal and seasonal nonstationarity, conditional heteroscedasticity, and non-Gaussianity. The RST method identifies forecast regimes at a wind energy site and fits a conditional predictive model for each regime. Geographically dispersed meteorological observations in the vicinity of the wind farm are used as off-site predictors. The RST technique was applied to 2-hour-ahead forecasts of hourly average wind speed near the Stateline wind energy center in the U. S. Pacific Northwest. The RST point forecasts and ...

Journal ArticleDOI
TL;DR: This work discusses examples in censored and truncated data, mixture modeling, multivariate imputation, stochastic processes, and multilevel models.
Abstract: Progress in statistical computation often leads to advances in statistical modeling. For example, it is surprisingly common that an existing model is reparameterized, solely for computational purposes, but then this new configuration motivates a new family of models that is useful in applied statistics. One reason why this phenomenon may not have been noticed in statistics is that reparameterizations do not change the likelihood. In a Bayesian framework, however, a transformation of parameters typically suggests a new family of prior distributions. We discuss examples in censored and truncated data, mixture modeling, multivariate imputation, stochastic processes, and multilevel models.

Journal ArticleDOI
TL;DR: A “borrow-strength estimation procedure” is proposed by first estimating the value of the latent variable from recurrent event data, then using the estimated value in the failure time model, which is flexible in that no parametric assumptions on the distributions of censoring times and latent variables are made.
Abstract: Recurrent event data are commonly encountered in longitudinal follow-up studies related to biomedical science, econometrics, reliability, and demography. In many studies, recurrent events serve as important measurements for evaluating disease progression, health deterioration, or insurance risk. When analyzing recurrent event data, an independent censoring condition is typically required for the construction of statistical methods. In some situations, however, the terminating time for observing recurrent events could be correlated with the recurrent event process, thus violating the assumption of independent censoring. In this article, we consider joint modeling of a recurrent event process and a failure time in which a common subject-specific latent variable is used to model the association between the intensity of the recurrent event process and the hazard of the failure time. The proposed joint model is flexible in that no parametric assumptions on the distributions of censoring times and latent variab...

Journal ArticleDOI
TL;DR: In this article, the authors provide improvements in semipara... In this paper, they provide an improvement in the quality of the time series analysis of air pollution and health, which is a critical component of the evidence used in the PM Criteria Document.
Abstract: In 2002, methodological issues around time series analyses of air pollution and health attracted the attention of the scientific community, policy makers, the press, and the diverse stakeholders concerned with air pollution. As the U. S. Environmental Protection Agency (EPA) was finalizing its most recent review of epidemiologic evidence on particulate matter air pollution (PM), statisticians and epidemiologists found that the S–PLUS implementation of generalized additive models (GAMs) can overestimate effects of air pollution and understate statistical uncertainty in time series studies of air pollution and health. This discovery delayed completion of the PM Criteria Document prepared as part of the review of the U. S. National Ambient Air Quality Standard, because the time series findings represented a critical component of the evidence. In addition, it raised concerns about the adequacy of current model formulations and their software implementations. In this article we provide improvements in semipara...

Journal ArticleDOI
TL;DR: The purpose of this article is to develop a line of argument that demonstrates that probability theory has a sufficiently rich structure for incorporating fuzzy sets within its framework, and thus probabilities of fuzzy events can be logically induced.
Abstract: The notion of fuzzy sets has proven useful in the context of control theory, pattern recognition, and medical diagnosis. However, it has also spawned the view that classical probability theory is unable to deal with uncertainties in natural language and machine learning, so that alternatives to probability are needed. One such alternative is what is known as “possibility theory.” Such alternatives have come into being because past attempts at making fuzzy set theory and probability theory work in concert have been unsuccessful. The purpose of this article is to develop a line of argument that demonstrates that probability theory has a sufficiently rich structure for incorporating fuzzy sets within its framework. Thus probabilities of fuzzy events can be logically induced. The philosophical underpinnings that make this happen are a subjectivistic interpretation of probability, an introduction of Laplace's famous genie, and the mathematics of encoding expert testimony. The benefit of making probability theo...