scispace - formally typeset
Search or ask a question

Showing papers in "Journal of the American Statistical Association in 2007"


Journal ArticleDOI
TL;DR: In this article, statistical methods in the Atmospheric Sciences are used to estimate the probability of a given event to be a hurricane or tropical cyclone, and the probability is determined by statistical methods.
Abstract: (2007). Statistical Methods in the Atmospheric Sciences. Journal of the American Statistical Association: Vol. 102, No. 477, pp. 380-380.

7,052 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a Bioinformatics and Computational Biology Solutions Using R and Bioconductor (BIBOS) using R and BIBOS, which is a combination of R and CRF.
Abstract: (2007). Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Journal of the American Statistical Association: Vol. 102, No. 477, pp. 388-389.

1,743 citations


Journal ArticleDOI
TL;DR: This paper analyzed data from 125,000 pedestrian stops by the New York Police Department over a 15-month period and compared stop rates by racial and ethnic groups, controlling for previous race-specific arrest rates.
Abstract: Recent studies by police departments and researchers confirm that police stop persons of racial and ethnic minority groups more often than whites relative to their proportions in the population. However, it has been argued that stop rates more accurately reflect rates of crimes committed by each ethnic group, or that stop rates reflect elevated rates in specific social areas, such as neighborhoods or precincts. Most of the research on stop rates and police–citizen interactions has focused on traffic stops, and analyses of pedestrian stops are rare. In this article we analyze data from 125,000 pedestrian stops by the New York Police Department over a 15-month period. We disaggregate stops by police precinct and compare stop rates by racial and ethnic group, controlling for previous race-specific arrest rates. We use hierarchical multilevel models to adjust for precinct-level variability, thus directly addressing the question of geographic heterogeneity that arises in the analysis of pedestrian stops. We fi...

669 citations


Journal ArticleDOI
TL;DR: In this article, the authors developed an information criterion for determining the number q of common shocks in the general dynamic factor model developed by Forni et al., as opposed to the restricted dynamic model considered by Bai and Ng and by Amengual and Watson.
Abstract: This article develops an information criterion for determining the number q of common shocks in the general dynamic factor model developed by Forni et al., as opposed to the restricted dynamic model considered by Bai and Ng and by Amengual and Watson. Our criterion is based on the fact that this number q is also the number of diverging eigenvalues of the spectral density matrix of the observations as the number n of series goes to infinity. We provide sufficient conditions for consistency of the criterion for large n and T (where T is the series length). We show how the method can be implemented and provide simulations and empirics illustrating its very good finite-sample performance. Application to real data adds a new empirical facet to an ongoing debate on the number of factors driving the U.S. economy.

581 citations



Journal ArticleDOI
TL;DR: In this article, a hierarchical model for the intensity and frequency of extreme precipitation events in a region in Colorado is presented, where the authors assume that the regional extreme precipitation is driven by a latent spatial process characterized by geographical and climatological covariates.
Abstract: Quantification of precipitation extremes is important for flood planning purposes, and a common measure of extreme events is the r-year return level. We present a method for producing maps of precipitation return levels and uncertainty measures and apply it to a region in Colorado. Separate hierarchical models are constructed for the intensity and the frequency of extreme precipitation events. For intensity, we model daily precipitation above a high threshold at 56 weather stations with the generalized Pareto distribution. For frequency, we model the number of exceedances at the stations as binomial random variables. Both models assume that the regional extreme precipitation is driven by a latent spatial process characterized by geographical and climatological covariates. Effects not fully described by the covariates are captured by spatial structure in the hierarchies. Spatial methods were improved by working in a space with climatological coordinates. Inference is provided by a Markov chain Monte Carlo ...

513 citations


Journal ArticleDOI
Bing Li, Shaoli Wang1
TL;DR: In this paper, the authors introduce directional regression (DR) as a method for dimension reduction, which is derived from empirical directions, but achieves higher accuracy and requires substantially less computation.
Abstract: We introduce directional regression (DR) as a method for dimension reduction. Like contour regression, DR is derived from empirical directions, but achieves higher accuracy and requires substantially less computation. DR naturally synthesizes the dimension reduction estimators based on conditional moments, such as sliced inverse regression and sliced average variance estimation, and in doing so combines the advantages of these methods. Under mild conditions, it provides exhaustive and -consistent estimate of the dimension reduction space. We develop the asymptotic distribution of the DR estimator, and from that a sequential test procedure to determine the dimension of the central space. We compare the performance of DR with that of existing methods by simulation and find strong evidence of its advantage over a wide range of models. Finally, we apply DR to analyze a data set concerning the identification of hand-written digits.

459 citations


Journal ArticleDOI
TL;DR: In a randomized experiment comparing two treatments, there is interference between units if applying the treatment to one unit may affect other units as discussed by the authors, which implies that treatment effects are not comparisons of two potential responses that a unit may exhibit, one under treatment and the other under control.
Abstract: In a randomized experiment comparing two treatments, there is interference between units if applying the treatment to one unit may affect other units. Interference implies that treatment effects are not comparisons of two potential responses that a unit may exhibit, one under treatment and the other under control, but instead are inherently more complex. Interference is common in social settings where people communicate, compete, or spread disease; in studies that treat one part of an organism using a symmetrical part as control; in studies that apply different treatments to the same organism at different times; and in many other situations. Available statistical tools are limited. For instance, Fisher's sharp null hypothesis of no treatment effect implicitly entails no interference, and so his randomization test may be used to test no effect, but conventional ways of inverting the test to obtain confidence intervals, say for an additive effect, are not applicable with interference. Another commonly used ...

372 citations


Journal ArticleDOI
TL;DR: The robust truncated hinge loss SVM (RSVM) is proposed, which is shown to be more robust to outliers and to deliver more accurate classifiers using a smaller set of SVs than the standard SVM.
Abstract: The support vector machine (SVM) has been widely applied for classification problems in both machine learning and statistics. Despite its popularity, however, SVM has some drawbacks in certain situations. In particular, the SVM classifier can be very sensitive to outliers in the training sample. Moreover, the number of support vectors (SVs) can be very large in many applications. To circumvent these drawbacks, we propose the robust truncated hinge loss SVM (RSVM), which uses a truncated hinge loss. The RSVM is shown to be more robust to outliers and to deliver more accurate classifiers using a smaller set of SVs than the standard SVM. Our theoretical results show that the RSVM is Fisher-consistent, even when there is no dominating class, a scenario that is particularly challenging for multicategory classification. Similar results are obtained for a class of margin-based classifiers.

367 citations


Journal ArticleDOI
TL;DR: If the adaptive LASSO penalty and a Bayes information criterion–type tuning parameter selector are used and the resulting LSA estimator can be as efficient as the oracle, the standard asymptotic theory can be established and the LARS algorithm can be applied.
Abstract: We propose a method of least squares approximation (LSA) for unified yet simple LASSO estimation. Our general theoretical framework includes ordinary least squares, generalized linear models, quantile regression, and many others as special cases. Specifically, LSA can transfer many different types of LASSO objective functions into their asymptotically equivalent least squares problems. Thereafter, the standard asymptotic theory can be established and the LARS algorithm can be applied. In particular, if the adaptive LASSO penalty and a Bayes information criterion–type tuning parameter selector are used, the resulting LSA estimator can be as efficient as the oracle. Extensive numerical studies confirm our theory.

352 citations


Journal ArticleDOI
TL;DR: A compound decision theory framework for multiple-testing problems is developed and an oracle rule based on the z values is derived that minimizes the false nondiscovery rate (FNR) and is more efficient than the conventional p value–based methods.
Abstract: We develop a compound decision theory framework for multiple-testing problems and derive an oracle rule based on the z values that minimizes the false nondiscovery rate (FNR) subject to a constraint on the false discovery rate (FDR). We show that many commonly used multiple-testing procedures, which are p value–based, are inefficient, and propose an adaptive procedure based on the z values. The z value–based adaptive procedure asymptotically attains the performance of the z value oracle procedure and is more efficient than the conventional p value–based methods. We investigate the numerical performance of the adaptive procedure using both simulated and real data. In particular, we demonstrate our method in an analysis of the microarray data from a human immunodeficiency virus study that involves testing a large number of hypotheses simultaneously.

Journal ArticleDOI
TL;DR: Efron, B., and Tibshirani, R. (1993), An Introduction to the Bootstrap, New York: Chapman & Hall as mentioned in this paper, and Franke, J., and Härdle, W. (1992), “On Bootstrapping Kernel Estimates,” The Annals of Statistics, 20, 121-145.
Abstract: Davison, A. C., and Hinkley, D. V. (1997), Bootstrap Methods and Their Application, Cambridge, U.K.: Cambridge University Press. Efron, B., and Tibshirani, R. (1993), An Introduction to the Bootstrap, New York: Chapman & Hall. Franke, J., and Härdle, W. (1992), “On Bootstrapping Kernel Estimates,” The Annals of Statistics, 20, 121–145. Rissanen, J. (1983), “A Universal Prior for Integers and Estimation by Minimum Description Length,” The Annals of Statistics, 11, 416–431.

Journal ArticleDOI
TL;DR: In this paper, the authors present inference procedures for evaluating binary classification rules based on various prediction precision measures quantified by the overall misclassification rate, sensitivity and specificity, and positive and negative predictive values.
Abstract: Suppose that we are interested in establishing simple but reliable rules for predicting future t-year survivors through censored regression models. In this article we present inference procedures for evaluating such binary classification rules based on various prediction precision measures quantified by the overall misclassification rate, sensitivity and specificity, and positive and negative predictive values. Specifically, under various working models, we derive consistent estimators for the above measures through substitution and cross-validation estimation procedures. Furthermore, we provide large-sample approximations to the distributions of these nonsmooth estimators without assuming that the working model is correctly specified. Confidence intervals, for example, for the difference of the precision measures between two competing rules can then be constructed. All of the proposals are illustrated with real examples, and their finite-sample properties are evaluated through a simulation study.

Journal ArticleDOI
TL;DR: The development of distance-weighted discrimination, which is based on second-order cone programming, a modern computationally intensive optimization method, is developed.
Abstract: High-dimension low–sample size statistical analysis is becoming increasingly important in a wide range of applied contexts. In such situations, the popular support vector machine suffers from "data piling" at the margin, which can diminish generalizability. This leads naturally to the development of distance-weighted discrimination, which is based on second-order cone programming, a modern computationally intensive optimization method.

Journal ArticleDOI
TL;DR: Test Equating, Scaling, and Linking: Methods and Practices as discussed by the authors is a method and practice for test equating, scaling, and linking that has been used extensively in the literature.
Abstract: (2007). Test Equating, Scaling, and Linking: Methods and Practices. Journal of the American Statistical Association: Vol. 102, No. 478, pp. 762-763.

Journal ArticleDOI
TL;DR: In this article, two versions of functional Principal Component Regression (PCR) are developed, both using B-splines and roughness penalties, and the regularized-components version applies such a penalty to the construction of the principal components.
Abstract: Regression of a scalar response on signal predictors, such as near-infrared (NIR) spectra of chemical samples, presents a major challenge when, as is typically the case, the dimension of the signals far exceeds their number. Most solutions to this problem reduce the dimension of the predictors either by regressing on components [e.g., principal component regression (PCR) and partial least squares (PLS)] or by smoothing methods, which restrict the coefficient function to the span of a spline basis. This article introduces functional versions of PCR and PLS, which combine both of the foregoing dimension-reduction approaches. Two versions of functional PCR are developed, both using B-splines and roughness penalties. The regularized-components version applies such a penalty to the construction of the principal components (i.e., it uses functional principal components), whereas the regularized-regression version incorporates a penalty in the regression. For the latter form of functional PCR, the penalty parame...

Journal ArticleDOI
TL;DR: In this article, periodic extensions of dynamic long-memory regression models with autoregressive conditional heteroscedastic errors are considered for the analysis of daily electricity spot prices, and the parameters of the model with mean and variance specifications are estimated simultaneously by the method of approximate maximum likelihood.
Abstract: Novel periodic extensions of dynamic long-memory regression models with autoregressive conditional heteroscedastic errors are considered for the analysis of daily electricity spot prices. The parameters of the model with mean and variance specifications are estimated simultaneously by the method of approximate maximum likelihood. The methods are implemented for time series of 1,200–4,400 daily price observations in four European power markets. Apart from persistence, heteroscedasticity, and extreme observations in prices, a novel empirical finding is the importance of day-of-the-week periodicity in the autocovariance function of electricity spot prices. In particular, the very persistent daily log prices from the Nord Pool power exchange of Norway are effectively modeled by our framework, which is also extended with explanatory variables to capture supply-and-demand effects. The daily log prices of the other three electricity markets—EEX in Germany, Powernext in France, and APX in The Netherlands—are less...

Journal ArticleDOI
TL;DR: A novel shotgun stochastic search (SSS) approach that explores “interesting” regions of the resulting high-dimensional model spaces and quickly identifies regions of high posterior probability over models.
Abstract: Model search in regression with very large numbers of candidate predictors raises challenges for both model specification and computation, for which standard approaches such as Markov chain Monte Carlo (MCMC) methods are often infeasible or ineffective. We describe a novel shotgun stochastic search (SSS) approach that explores “interesting” regions of the resulting high-dimensional model spaces and quickly identifies regions of high posterior probability over models. We describe algorithmic and modeling aspects, priors over the model space that induce sparsity and parsimony over and above the traditional dimension penalization implicit in Bayesian and likelihood analyses, and parallel computation using cluster computers. We discuss an example from gene expression cancer genomics, comparisons with MCMC and other methods, and theoretical and simulation-based aspects of performance characteristics in large-scale regression model searches. We also provide software implementing the methods.

Journal ArticleDOI
TL;DR: Some of the main adaptations of the multiple-imputation framework, including missing data in large and small samples, data confidentiality, and measurement error, are described, and the combining rules for each setting are reviewed and explained.
Abstract: Multiple imputation was first conceived as a tool that statistical agencies could use to handle nonresponse in large-sample public use surveys. In the last two decades, the multiple-imputation framework has been adapted for other statistical contexts. For example, individual researchers use multiple imputation to handle missing data in small samples, statistical agencies disseminate multiply-imputed data sets for purposes of protecting data confidentiality, and survey methodologists and epidemiologists use multiple imputation to correct for measurement errors. In some of these settings, Rubin's original rules for combining the point and variance estimates from the multiply-imputed data sets are not appropriate, because what is known—and thus the conditional expectations and variances used to derive inferential methods—differs from that in the missing-data context. These applications require new combining rules and methods of inference. In fact, more than 10 combining rules exist in the published literatur...

Journal ArticleDOI
TL;DR: A version of Whittle's approximation to the Gaussian log-likelihood for spatial regular lattices with missing values and for irregularly spaced datasets, which requires O(nlog2n) operations and does not involve calculating determinants.
Abstract: Likelihood approaches for large, irregularly spaced spatial datasets are often very difficult, if not infeasible, to implement due to computational limitations. Even when we can assume normality, exact calculations of the likelihood for a Gaussian spatial process observed at n locations requires O(n3) operations. We present a version of Whittle's approximation to the Gaussian log-likelihood for spatial regular lattices with missing values and for irregularly spaced datasets. This method requires O(nlog2n) operations and does not involve calculating determinants. We present simulations and theoretical results to show the benefits and the performance of the spatial likelihood approximation method presented here for spatial irregularly spaced datasets and lattices with missing values. We apply these methods to estimate the spatial structure of sea surface temperatures using satellite data with missing values.

Journal ArticleDOI
TL;DR: In this article, a class of semiparametric models for the covariance function by that imposes a parametric correlation structure while allowing a nonparametric variance function is proposed, and a kernel estimator is developed.
Abstract: Improving efficiency for regression coefficients and predicting trajectories of individuals are two important aspects in the analysis of longitudinal data. Both involve estimation of the covariance function. Yet challenges arise in estimating the covariance function of longitudinal data collected at irregular time points. A class of semiparametric models for the covariance function by that imposes a parametric correlation structure while allowing a nonparametric variance function is proposed. A kernel estimator for estimating the nonparametric variance function is developed. Two methods for estimating parameters in the correlation structure—a quasi-likelihood approach and a minimum generalized variance method—are proposed. A semiparametric varying coefficient partially linear model for longitudinal data is introduced, and an estimation procedure for model coefficients using a profile weighted least squares approach is proposed. Sampling properties of the proposed estimation procedures are studied, and asy...


Journal ArticleDOI
TL;DR: The stochastic approximation Monte Carlo (SAMC) algorithm is proposed, which overcomes the shortcomings of the WL algorithm and establishes a theorem concerning its convergence.
Abstract: The Wang–Landau (WL) algorithm is an adaptive Markov chain Monte Carlo algorithm used to calculate the spectral density for a physical system. A remarkable feature of the WL algorithm is that it is not trapped by local energy minima, which is very important for systems with rugged energy landscapes. This feature has led to many successful applications of the algorithm in statistical physics and biophysics; however, there does not exist rigorous theory to support its convergence, and the estimates produced by the algorithm can reach only a limited statistical accuracy. In this article we propose the stochastic approximation Monte Carlo (SAMC) algorithm, which overcomes the shortcomings of the WL algorithm. We establish a theorem concerning its convergence. The estimates produced by SAMC can be improved continuously as the simulation proceeds. SAMC also extends applications of the WL algorithm to continuum systems. The potential uses of SAMC in statistics are discussed through two classes of applications, i...

Journal ArticleDOI
TL;DR: A penalized likelihood approach for variable selection in FMR models is introduced that introduces penalties that depend on the size of the regression coefficients and the mixture structure and requires much less computing power than existing methods.
Abstract: In the applications of finite mixture of regression (FMR) models, often many covariates are used, and their contributions to the response variable vary from one component to another of the mixture model. This creates a complex variable selection problem. Existing methods, such as the Akaike information criterion and the Bayes information criterion, are computationally expensive as the number of covariates and components in the mixture model increases. In this article we introduce a penalized likelihood approach for variable selection in FMR models. The new method introduces penalties that depend on the size of the regression coefficients and the mixture structure. The new method is shown to be consistent for variable selection. A data-adaptive method for selecting tuning parameters and an EM algorithm for efficient numerical computations are developed. Simulations show that the method performs very well and requires much less computing power than existing methods. The new method is illustrated by analyzin...

Journal ArticleDOI
TL;DR: This paper developed a nonparametric foundation for assessing how assumptions on the reporting error process affect inferences on the employment gap between the disabled and nondisabled, and derived sets of bounds that formalize the identifying power of primitive non-parametric assumptions that appear to share broad consensus in the literature.
Abstract: Measurement error in health and disability status has been widely accepted as a central problem in social science research. Long-standing debates about the prevalence of disability, the role of health in labor market outcomes, and the influence of federal disability policy on declining employment rates have all emphasized issues regarding the reliability of self-reported disability. In addition to random error, inaccuracy in survey datasets may be produced by a host of economic, social, and psychological factors that can lead respondents to misreport work capacity. We develop a nonparametric foundation for assessing how assumptions on the reporting error process affect inferences on the employment gap between the disabled and nondisabled. Rather than imposing the strong assumptions required to obtain point identification, we derive sets of bounds that formalize the identifying power of primitive nonparametric assumptions that appear to share broad consensus in the literature. Within this framework, we int...

Journal ArticleDOI
TL;DR: The proposed wavelet methods can successfully remove the jumps in the price processes and the integrated volatility can be estimated as accurately as in the case with no presence of jumps, and have outstanding statistical efficiency.
Abstract: The wide availability of high-frequency data for many financial instruments stimulates an upsurge interest in statistical research on the estimation of volatility Jump-diffusion processes observed with market microstructure noise are frequently used to model high-frequency financial data Yet existing methods are developed for either noisy data from a continuous-diffusion price model or data from a jump-diffusion price model without noise We propose methods to cope with both jumps in the price and market microstructure noise in the observed data These methods allow us to estimate both integrated volatility and jump variation from the data sampled from jump-diffusion price processes, contaminated with the market microstructure noise Our approach is to first remove jumps from the data and then apply noise-resistant methods to estimate the integrated volatility The asymptotic analysis and the simulation study reveal that the proposed wavelet methods can successfully remove the jumps in the price process

Journal ArticleDOI
TL;DR: A hierarchical testing procedure that first tests clusters, then tests locations within rejected clusters is developed and it is shown formally that this procedure controls the desired location error rate asymptotically, and conjecture that this is also so for realistic settings by extensive simulations.
Abstract: The problem of multiple testing for the presence of signal in spatial data can involve numerous locations. Traditionally, each location is tested separately for signal presence, but then the findings are reported in terms of clusters of nearby locations. This is an indication that the units of interest for testing are clusters rather than individual locations. The investigator may know a priori these more natural units or an approximation to them. We suggest testing these cluster units rather than individual locations, thus increasing the signal-to-noise ratio within the unit tested as well as reducing the number of hypothesis tests conducted. Because the signal may be absent from part of each cluster, we define a cluster as containing a signal if the signal is present somewhere within the cluster. We suggest controlling the false discovery rate (FDR) on clusters (i.e., the expected proportion of clusters rejected erroneously out of all clusters rejected) or its extension to general weights (WFDR). We int...

Journal ArticleDOI
TL;DR: A new class of models, mixed HMMs (MHMMs), where both covariates and random effects are used to capture differences among processes, are presented, and it is shown that the model can describe the heterogeneity among patients.
Abstract: Hidden Markov models (HMMs) are a useful tool for capturing the behavior of overdispersed, autocorrelated data. These models have been applied to many different problems, including speech recognition, precipitation modeling, and gene finding and profiling. Typically, HMMs are applied to individual stochastic processes; HMMs for simultaneously modeling multiple processes—as in the longitudinal data setting—have not been widely studied. In this article I present a new class of models, mixed HMMs (MHMMs), where I use both covariates and random effects to capture differences among processes. I define the models using the framework of generalized linear mixed models and discuss their interpretation. I then provide algorithms for parameter estimation and illustrate the properties of the estimators via a simulation study. Finally, to demonstrate the practical uses of MHMMs, I provide an application to data on lesion counts in multiple sclerosis patients. I show that my model, while parsimonious, can describe the...

Journal ArticleDOI
TL;DR: In this paper, the authors consider quantile regression in reproducing kernel Hilbert spaces, which they call kernel quantile regressions (KQR), and propose an efficient algorithm that computes the entire solution path of the KQR, with essentially the same computational cost as fitting one kQR model.
Abstract: In this article we consider quantile regression in reproducing kernel Hilbert spaces, which we call kernel quantile regression (KQR). We make three contributions: (1) we propose an efficient algorithm that computes the entire solution path of the KQR, with essentially the same computational cost as fitting one KQR model; (2) we derive a simple formula for the effective dimension of the KQR model, which allows convenient selection of the regularization parameter; and (3) we develop an asymptotic theory for the KQR model.

Journal ArticleDOI
TL;DR: In this article, local empirical likelihood-based inference for a varying coefficient model with longitudinal data is investigated, and it is shown that the naive empirical likelihood ratio is asymptotically standard chi-squared when under-moothing is employed.
Abstract: In this article local empirical likelihood-based inference for a varying coefficient model with longitudinal data is investigated. First, we show that the naive empirical likelihood ratio is asymptotically standard chi-squared when undersmoothing is employed. The ratio is self-scale invariant and the plug-in estimate of the limiting variance is not needed. Second, to enhance the performance of the ratio, mean-corrected and residual-adjusted empirical likelihood ratios are recommended. The merit of these two bias corrections is that without undersmoothing, both also have standard chi-squared limits. Third, a maximum empirical likelihood estimator (MELE) of the time-varying coefficient is defined, the asymptotic equivalence to the weighted least-squares estimator (WLSE) is provided, and the asymptotic normality is shown. By the empirical likelihood ratios and the normal approximation of the MELE/WLSE, the confidence regions of the time-varying coefficients are constructed. Fourth, when some components are o...