scispace - formally typeset
Search or ask a question

Showing papers in "Journal of the American Statistical Association in 2010"


Journal ArticleDOI
TL;DR: In this paper, the authors investigated the applicability of synthetic control methods to comparative case studies and found that, following Proposition 99, tobacco consumption fell markedly in California relative to a comparable synthetic control region, and that by the year 2000 annual per-capita cigarette sales in California were about 26 packs lower than what they would have been in the absence of Proposition 99.
Abstract: Building on an idea in Abadie and Gardeazabal (2003), this article investigates the application of synthetic control methods to comparative case studies. We discuss the advantages of these methods and apply them to study the effects of Proposition 99, a large-scale tobacco control program that California implemented in 1988. We demonstrate that, following Proposition 99, tobacco consumption fell markedly in California relative to a comparable synthetic control region. We estimate that by the year 2000 annual per-capita cigarette sales in California were about 26 packs lower than what they would have been in the absence of Proposition 99. Using new inferential methods proposed in this article, we demonstrate the significance of our estimates. Given that many policy interventions and events of interest in social sciences take place at an aggregate level (countries, regions, cities, etc.) and affect a small number of aggregate units, the potential applicability of synthetic control methods to comparative cas...

2,815 citations


Journal ArticleDOI
TL;DR: A novel framework for sparse clustering is proposed, in which one clusters the observations using an adaptively chosen subset of the features, which uses a lasso-type penalty to select the features.
Abstract: We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters. These approaches are demonstrated on simulated and genomic data.

643 citations


Journal ArticleDOI
TL;DR: In this article, a flexible parametric family of matrix-valued covariance functions for multivariate spatial random fields, where each constituent component is a Matern process, is introduced, which can be interpretable in terms of process variance, smoothness, correlation length, and colocated correlation coefficients.
Abstract: We introduce a flexible parametric family of matrix-valued covariance functions for multivariate spatial random fields, where each constituent component is a Matern process. The model parameters are interpretable in terms of process variance, smoothness, correlation length, and colocated correlation coefficients, which can be positive or negative. Both the marginal and the cross-covariance functions are of the Matern type. In a data example on error fields for numerical predictions of surface pressure and temperature over the North American Pacific Northwest, we compare the bivariate Matern model to the traditional linear model of coregionalization.

409 citations


Journal ArticleDOI
TL;DR: In this paper, the authors developed inferentially practical, likelihood-based methods for fitting max-stable processes derived from a composite-likelihood approach, which is sufficiently reliable and versatile to permit the simultaneous modeling of marginal and dependence parameters in the spatial context at a moderate computational cost.
Abstract: The last decade has seen max-stable processes emerge as a common tool for the statistical modeling of spatial extremes. However, their application is complicated due to the unavailability of the multivariate density function, and so likelihood-based methods remain far from providing a complete and flexible framework for inference. In this article we develop inferentially practical, likelihood-based methods for fitting max-stable processes derived from a composite-likelihood approach. The procedure is sufficiently reliable and versatile to permit the simultaneous modeling of marginal and dependence parameters in the spatial context at a moderate computational cost. The utility of this methodology is examined via simulation, and illustrated by the analysis of United States precipitation extremes.

351 citations


Journal ArticleDOI
TL;DR: This work derives the distribution of the minimal depth and uses it for high-dimensional variable selection using random survival forests, and develops a new regularized algorithm, termed RSF-Variable Hunting.
Abstract: The minimal depth of a maximal subtree is a dimensionless order statistic measuring the predictiveness of a variable in a survival tree. We derive the distribution of the minimal depth and use it f...

346 citations


Journal ArticleDOI
TL;DR: In this article, the authors consider differential privacy from a statistical perspective and compare the convergence rate of distributions and densities constructed from the released data by computing the rate of convergence of distributions.
Abstract: One goal of statistical privacy research is to construct a data release mechanism that protects individual privacy while preserving information content. An example is a random mechanism that takes an input database X and outputs a random database Z according to a distribution Qn(⋅|X). Differential privacy is a particular privacy requirement developed by computer scientists in which Qn(⋅|X) is required to be insensitive to changes in one data point in X. This makes it difficult to infer from Z whether a given individual is in the original database X. We consider differential privacy from a statistical perspective. We consider several data-release mechanisms that satisfy the differential privacy requirement. We show that it is useful to compare these schemes by computing the rate of convergence of distributions and densities constructed from the released data. We study a general privacy method, called the exponential mechanism, introduced by McSherry and Talwar (2007). We show that the accuracy of this meth...

345 citations


Journal ArticleDOI
TL;DR: An improved practical version of this method is provided by combining it with a reduced version of the dynamic programming algorithm and it is proved that, in an appropriate asymptotic framework, this method provides consistent estimators of the change points with an almost optimal rate.
Abstract: We propose a new approach for dealing with the estimation of the location of change-points in one-dimensional piecewise constant signals observed in white noise. Our approach consists in reframing this task in a variable selection context. We use a penalized least-square criterion with a l1-type penalty for this purpose. We explain how to implement this method in practice by using the LARS / LASSO algorithm. We then prove that, in an appropriate asymptotic framework, this method provides consistent estimators of the change points with an almost optimal rate. We finally provide an improved practical version of this method by combining it with a reduced version of the dynamic programming algorithm and we successfully compare it with classical methods.

329 citations


Journal ArticleDOI
TL;DR: These tests for sphericity and identity of high-dimensional covariance matrices can accommodate situations where the data dimension is much larger than the sample size, namely the “large p, small n” situations.
Abstract: We propose tests for sphericity and identity of high-dimensional covariance matrices. The tests are nonparametric without assuming a specific parametric distribution for the data. They can accommodate situations where the data dimension is much larger than the sample size, namely the “large p, small n” situations. We demonstrate by both theoretical and empirical studies that the tests have good properties for a wide range of dimensions and sample sizes. We applied the proposed test on a microarray dataset on Yorkshire Gilts and tested for the covariance structure for the expression levels for sets of genes.

298 citations


Journal ArticleDOI
TL;DR: In this article, a consistent and efficient estimator of the high-frequency covariance (quadratic covariation) of two arbitrary assets, observed asynchronously with market microstructure noise, is proposed.
Abstract: This article proposes a consistent and efficient estimator of the high-frequency covariance (quadratic covariation) of two arbitrary assets, observed asynchronously with market microstructure noise. This estimator is built on the marriage of the quasi–maximum likelihood estimator of the quadratic variation and the proposed generalized synchronization scheme and thus is not influenced by the Epps effect. Moreover, the estimation procedure is free of tuning parameters or bandwidths and is readily implementable. Monte Carlo simulations show the advantage of this estimator by comparing it with a variety of estimators with specific synchronization methods. The empirical studies of six foreign exchange future contracts illustrate the time-varying correlations of the currencies during the 2008 global financial crisis, demonstrating the similarities and differences in their roles as key currencies in the global market.

286 citations


Journal ArticleDOI
TL;DR: This proposal makes a connection between the classical variable selection criteria and the regularization parameter selections for the nonconcave penalized likelihood approaches, and shows that the BIC-type selector enables identification of the true model consistently, and the resulting estimator possesses the oracle property in the terminology of Fan and Li (2001).
Abstract: We apply the nonconcave penalized likelihood approach to obtain variable selections as well as shrinkage estimators. This approach relies heavily on the choice of regularization parameter, which controls the model complexity. In this paper, we propose employing the generalized information criterion, encompassing the commonly used Akaike information criterion (AIC) and Bayesian information criterion (BIC), for selecting the regularization parameter. Our proposal makes a connection between the classical variable selection criteria and the regularization parameter selections for the nonconcave penalized likelihood approaches. We show that the BIC-type selector enables identification of the true model consistently, and the resulting estimator possesses the oracle property in the terminology of Fan and Li (2001). In contrast, however, the AIC-type selector tends to overfit with positive probability. We further show that the AIC-type selector is asymptotically loss efficient, while the BIC-type selector is not....

272 citations


Journal ArticleDOI
TL;DR: In this article, the bias and variance of the standard estimators of the posterior distribution are derived, which are based on rejection sampling and linear adjustment, and an original estimator is introduced based on quadratic adjustment and its bias contains a fewer number of terms than the estimator with linear adjustment.
Abstract: Approximate Bayesian Computation is a family of likelihood-free inference techniques that are well suited to models defined in terms of a stochastic generating mechanism. In a nutshell, Approximate Bayesian Computation proceeds by computing summary statistics sobs from the data and simulating summary statistics for different values of the parameter Θ. The posterior distribution is then approximated by an estimator of the conditional density g(Θ|sobs). In this paper, we derive the asymptotic bias and variance of the standard estimators of the posterior distribution which are based on rejection sampling and linear adjustment. Additionally, we introduce an original estimator of the posterior distribution based on quadratic adjustment and we show that its bias contains a fewer number of terms than the estimator with linear adjustment. Although we find that the estimators with adjustment are not universally superior to the estimator based on rejection sampling, we find that they can achieve better performance ...

Journal ArticleDOI
TL;DR: In this paper, a self-normalization (SN) based Kolmogorov-Smirnov test is proposed for the cumulative sum test for a change point in a time series, where the formation of the selfnormalizer takes the change point alternative into account.
Abstract: This article considers the CUSUM-based (cumulative sum) test for a change point in a time series. In the case of testing for a mean shift, the traditional Kolmogorov–Smirnov test statistic involves a consistent long-run variance estimator, which is needed to make the limiting null distribution free of nuisance parameters. The commonly used lag-window type long-run variance estimator requires to choose a bandwidth parameter and its selection is a difficult task in practice. The bandwidth that is a fixed function of the sample size (e.g., n1/3, where n is sample size) is not adaptive to the magnitude of the dependence in the series, whereas the data-dependent bandwidth could lead to nonmonotonic power as shown in previous studies. In this article, we propose a self-normalization (SN) based Kolmogorov–Smirnov test, where the formation of the self-normalizer takes the change point alternative into account. The resulting test statistic is asymptotically distribution free and its power is monotonic. Furthermore...


Journal ArticleDOI
TL;DR: A method to estimate both individual social network size and the distribution of network sizes in a population by asking respondents how many people they know in specific subpopulations by building on the scale-up method of Killworth et al. (1998b).
Abstract: In this article we develop a method to estimate both individual social network size (i.e., degree) and the distribution of network sizes in a population by asking respondents how many people they know in specific subpopulations (e.g., people named Michael). Building on the scale-up method of Killworth et al. (1998b) and other previous attempts to estimate individual network size, we propose a latent non-random mixing model which resolves three known problems with previous approaches. As a byproduct, our method also provides estimates of the rate of social mixing between population groups. We demonstrate the model using a sample of 1,370 adults originally collected by McCarty et al. (2001). Based on insights developed during the statistical modeling, we conclude by offering practical guidelines for the design of future surveys to estimate social network size. Most importantly, we show that if the first names asked about are chosen properly, the estimates from the simple scale-up model enjoy the same bias-r...

Journal ArticleDOI
TL;DR: This work considers the problem of variable selection in regression modeling in high-dimensional spaces where there is known structure among the covariates, and approaches this problem through the Bayesian variable selection framework, where it is assumed that the covariate lie on an undirected graph and an Ising prior is formulated on the model space for incorporating structural information.
Abstract: We consider the problem of variable selection in regression modeling in high-dimensional spaces where there is known structure among the covariates. This is an unconventional variable selection problem for two reasons: (1) The dimension of the covariate space is comparable, and often much larger, than the number of subjects in the study, and (2) the covariate space is highly structured, and in some cases it is desirable to incorporate this structural information in to the model building process. We approach this problem through the Bayesian variable selection framework, where we assume that the covariates lie on an undirected graph and formulate an Ising prior on the model space for incorporating structural information. Certain computational and statistical problems arise that are unique to such high-dimensional, structured settings, the most interesting being the phenomenon of phase transitions. We propose theoretical and computational schemes to mitigate these problems. We illustrate our methods on two ...

Journal ArticleDOI
TL;DR: Extensive simulations, along with an analysis of real-world data, demonstrate that variational methods achieve accuracy competitive with Markov chain Monte Carlo at a small fraction of the computational cost.
Abstract: Discrete choice models are commonly used by applied statisticians in numerous fields, such as marketing, economics, finance, and operations research. When agents in discrete choice models are assumed to have differing preferences, exact inference is often intractable. Markov chain Monte Carlo techniques make approximate inference possible, but the computational cost is prohibitive on the large datasets now becoming routinely available. Variational methods provide a deterministic alternative for approximation of the posterior distribution. We derive variational procedures for empirical Bayes and fully Bayesian inference in the mixed multinomial logit model of discrete choice. The algorithms require only that we solve a sequence of unconstrained optimization problems, which are shown to be convex. One version of the procedures relies on a new approximation to the variational objective function, based on the multivariate delta method. Extensive simulations, along with an analysis of real-world data, demonstr...

Journal ArticleDOI
TL;DR: In this paper, the authors use moving averages to develop new classes of models in a flexible modeling framework for stream networks, which can account for the volume and direction of flowing water.
Abstract: In this article we use moving averages to develop new classes of models in a flexible modeling framework for stream networks. Streams and rivers are among our most important resources, yet models with autocorrelated errors for spatially continuous stream networks have been described only recently. We develop models based on stream distance rather than on Euclidean distance. Spatial autocovariance models developed for Euclidean distance may not be valid when using stream distance. We begin by describing a stream topology. We then use moving averages to build several classes of valid models for streams. Various models are derived depending on whether the moving average has a “tail-up” stream, a “tail-down” stream, or a “two-tail” construction. These models also can account for the volume and direction of flowing water. The data for this article come from the Ecosystem Health Monitoring Program in Southeast Queensland, Australia, an important national program aimed at monitoring water quality. We model two w...

Journal ArticleDOI
TL;DR: In this paper, Gneiting et al. introduced a model for the average wind speed two hours ahead based on both spatial and temporal information, and the model is split into nonunique regimes based on the wind direction at an offsite location.
Abstract: The technology to harvest electricity from wind energy is now advanced enough to make entire cities powered by it a reality. High-quality, short-term forecasts of wind speed are vital to making this a more reliable energy source. Gneiting et al. (2006) have introduced a model for the average wind speed two hours ahead based on both spatial and temporal information. The forecasts produced by this model are accurate, and subject to accuracy, the predictive distribution is sharp, that is, highly concentrated around its center. However, this model is split into nonunique regimes based on the wind direction at an offsite location. This paper both generalizes and improves upon this model by treating wind direction as a circular variable and including it in the model. It is robust in many experiments, such as predicting wind at other locations. We compare this with the more common approach of modeling wind speeds and directions in the Cartesian space and use a skew-t distribution for the errors. The quality of t...

Journal ArticleDOI
TL;DR: In this article, the authors use data cloning, a simple computational method that exploits advances in Bayesian computation, in particular the Markov Chain Monte Carlo method, to obtain maximum likelihood estimators of the parameters in these models.
Abstract: Maximum likelihood estimation for Generalized Linear Mixed Models (GLMM), an important class of statistical models with substantial applications in epidemiology, medical statistics, and many other fields, poses significant computational difficulties. In this article, we use data cloning, a simple computational method that exploits advances in Bayesian computation, in particular the Markov Chain Monte Carlo method, to obtain maximum likelihood estimators of the parameters in these models. This method also leads to a simple estimator of the asymptotic variance of the maximum likelihood estimators. Determining estimability of the parameters in a mixed model is, in general, a very difficult problem. Data cloning provides a simple graphical test to not only check if the full set of parameters is estimable but also, and perhaps more importantly, if a specified function of the parameters is estimable. One of the goals of mixed models is to predict random effects. We suggest a frequentist method to obtain predict...

Journal ArticleDOI
TL;DR: Numerical results indicate that the LASSO method tends to remove irrelevant variables more effectively and provide better prediction performance than previous work and automatically enforces the heredity constraint.
Abstract: In this paper, we extend the LASSO method (Tibshirani 1996) for simultaneously fitting a regression model and identifying important interaction terms. Unlike most of the existing variable selection methods, our method automatically enforces the heredity constraint, that is, an interaction term can be included in the model only if the corresponding main terms are also included in the model. Furthermore, we extend our method to generalized linear models, and show that it performs as well as if the true model were given in advance, that is, the oracle property as in Fan and Li (2001) and Fan and Peng (2004). The proof of the oracle property is given in online supplemental materials. Numerical results on both simulation data and real data indicate that our method tends to remove irrelevant variables more effectively and provide better prediction performance than previous work (Yuan, Joseph, and Lin 2007 and Zhao, Rocha, and Yu 2009 as well as the classical LASSO method).

Journal ArticleDOI
TL;DR: In this article, an instrument is defined as a random nudge toward acceptance of a treatment that affects outcomes only to the extent that it affects acceptance of the treatment, i.e., it is used to extract bits of random treatment assignment from a setting that is quite biased in its treatment assignments.
Abstract: An instrument is a random nudge toward acceptance of a treatment that affects outcomes only to the extent that it affects acceptance of the treatment. Nonetheless, in settings in which treatment assignment is mostly deliberate and not random, there may exist some essentially random nudges to accept treatment, so that use of an instrument might extract bits of random treatment assignment from a setting that is otherwise quite biased in its treatment assignments. An instrument is weak if the random nudges barely influence treatment assignment or strong if the nudges are often decisive in influencing treatment assignment. Although ideally an ostensibly random instrument is perfectly random and not biased, it is not possible to be certain of this; thus a typical concern is that even the instrument might be biased to some degree. It is known from theoretical arguments that weak instruments are invariably sensitive to extremely small biases; for this reason, strong instruments are preferred. The strength of an ...

Journal ArticleDOI
TL;DR: In this article, a new resampling procedure, the dependent wild bootstrap, was proposed for stationary time series, which can be easily extended to irregularly spaced time series with no implementational difficulty.
Abstract: We propose a new resampling procedure, the dependent wild bootstrap, for stationary time series. As a natural extension of the traditional wild bootstrap to time series setting, the dependent wild bootstrap offers a viable alternative to the existing block-based bootstrap methods, whose properties have been extensively studied over the last two decades. Unlike all of the block-based bootstrap methods, the dependent wild bootstrap can be easily extended to irregularly spaced time series with no implementational difficulty. Furthermore, it preserves the favorable bias and mean squared error property of the tapered block bootstrap, which is the state-of-the-art block-based method in terms of asymptotic accuracy of variance estimation and distribution approximation. The consistency of the dependent wild bootstrap in distribution approximation is established under the framework of the smooth function model. In addition, we obtain the bias and variance expansions of the dependent wild bootstrap variance estimat...

Journal ArticleDOI
TL;DR: The benefit of considering group structure is demonstrated by presenting a p-value weighting procedure which utilizes the relative importance of each group while controlling the false discovery rate under weak conditions.
Abstract: In the context of large-scale multiple hypothesis testing, the hypotheses often possess certain group structures based on additional information such as Gene Ontology in gene expression data and phenotypes in genome-wide association studies. It is hence desirable to incorporate such information when dealing with multiplicity problems to increase statistical power. In this article, we demonstrate the benefit of considering group structure by presenting a p-value weighting procedure which utilizes the relative importance of each group while controlling the false discovery rate under weak conditions. The procedure is easy to implement and shown to be more powerful than the classical Benjamini–Hochberg procedure in both theoretical and simulation studies. By estimating the proportion of true null hypotheses, the data-driven procedure controls the false discovery rate asymptotically. Our analysis on one breast cancer dataset confirms that the procedure performs favorably compared with the classical method.

Journal ArticleDOI
TL;DR: This paper concerns the accuracy of summary statistics for the collection of normal variates, such as their empirical cdf or a false discovery rate statistic, and shows that good accuracy approximations can be based on the root mean square correlation over all N ⋅ (N − 1)/2 pairs.
Abstract: We consider large-scale studies in which there are hundreds or thousands of correlated cases to investigate, each represented by its own normal variate, typically a z-value. A familiar example is provided by a microarray experiment comparing healthy with sick subjects’ expression levels for thousands of genes. This paper concerns the accuracy of summary statistics for the collection of normal variates, such as their empirical cdf or a false discovery rate statistic. It seems like we must estimate an N by N correlation matrix, N the number of cases, but our main result shows that this is not necessary: good accuracy approximations can be based on the root mean square correlation over all N ⋅ (N − 1)/2 pairs, a quantity often easily estimated. A second result shows that z-values closely follow normal distributions even under nonnull conditions, supporting application of the main theorem. Practical application of the theory is illustrated for a large leukemia microarray study.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a complete methodology of cumulative slicing estimation to sufficient dimension reduction, which is termed as cumulative mean estimation, cumulative variance estimation, and cumulative directional regression.
Abstract: In this paper we offer a complete methodology of cumulative slicing estimation to sufficient dimension reduction. In parallel to the classical slicing estimation, we develop three methods that are termed, respectively, as cumulative mean estimation, cumulative variance estimation, and cumulative directional regression. The strong consistency for p=O(n1 / 2 / log n) and the asymptotic normality for p=o(n1 / 2) are established, where p is the dimension of the predictors and n is sample size. Such asymptotic results improve the rate p=o(n1 / 3) in many existing contexts of semiparametric modeling. In addition, we propose a modified BIC-type criterion to estimate the structural dimension of the central subspace. Its consistency is established when p=o(n1 / 2). Extensive simulations are carried out for comparison with existing methods and a real data example is presented for illustration.

Journal ArticleDOI
TL;DR: In this paper, the dependence structure of continuous-valued time series data using a sequence of bivariate copulas is expressed using a parsimonious representation of a time-inhomogeneous Markov process.
Abstract: Copulas have proven to be very successful tools for the flexible modeling of cross-sectional dependence. In this paper we express the dependence structure of continuous-valued time series data using a sequence of bivariate copulas. This corresponds to a type of decomposition recently called a “vine” in the graphical models literature, where each copula is entitled a “pair-copula.” We propose a Bayesian approach for the estimation of this dependence structure for longitudinal data. Bayesian selection ideas are used to identify any independence pair-copulas, with the end result being a parsimonious representation of a time-inhomogeneous Markov process of varying order. Estimates are Bayesian model averages over the distribution of the lag structure of the Markov process. Using a simulation study we show that the selection approach is reliable and can improve the estimates of both conditional and unconditional pairwise dependencies substantially. We also show that a vine with selection outperforms a Gaussian...

Journal ArticleDOI
TL;DR: In this paper, a Bayesian hierarchical model is proposed to reconstruct past temperatures that integrates information from different sources, such as proxies with different temporal resolution and forcings acting as the external drivers of large scale temperature evolution.
Abstract: Understanding the dynamics of climate change in its full richness requires the knowledge of long temperature time series. Although long-term, widely distributed temperature observations are not available, there are other forms of data, known as climate proxies, that can have a statistical relationship with temperatures and have been used to infer temperatures in the past before direct measurements. We propose a Bayesian hierarchical model to reconstruct past temperatures that integrates information from different sources, such as proxies with different temporal resolution and forcings acting as the external drivers of large scale temperature evolution. Additionally, this method allows us to quantify the uncertainty of the reconstruction in a rigorous manner. The reconstruction method is assessed, using a global climate model as the true climate system and with synthetic proxy data derived from the simulation. The target is to reconstruct Northern Hemisphere temperature from proxies that mimic the sampling...

Journal ArticleDOI
TL;DR: A novel homotopy method for computing an entire solution surface through regularization involving a piecewise linear penalty permits adaptive grouping and nearly unbiased estimation, which is treated with a novel concept of grouped subdifferentials and difference convex programming for efficient computation.
Abstract: Extracting grouping structure or identifying homogenous subgroups of predictors in regression is crucial for high-dimensional data analysis. A low-dimensional structure in particular—grouping, when captured in a regression model—enables to enhance predictive performance and to facilitate a model’s interpretability. Grouping pursuit extracts homogenous subgroups of predictors most responsible for outcomes of a response. This is the case in gene network analysis, where grouping reveals gene functionalities with regard to progression of a disease. To address challenges in grouping pursuit, we introduce a novel homotopy method for computing an entire solution surface through regularization involving a piecewise linear penalty. This nonconvex and overcomplete penalty permits adaptive grouping and nearly unbiased estimation, which is treated with a novel concept of grouped subdifferentials and difference convex programming for efficient computation. Finally, the proposed method not only achieves high performanc...

Journal ArticleDOI
TL;DR: This work introduces a new approach, “Variable selection using Adaptive Nonlinear Interaction Structures in High dimensions” (VANISH), that is based on a penalized least squares criterion and is designed for high dimensional nonlinear problems and suggests that VANISH should outperform certain natural competitors when the true interaction structure is sufficiently sparse.
Abstract: Numerous penalization based methods have been proposed for fitting a traditional linear regression model in which the number of predictors, p, is large relative to the number of observations, n. Most of these approaches assume sparsity in the underlying coefficients and perform some form of variable selection. Recently, some of this work has been extended to nonlinear additive regression models. However, in many contexts one wishes to allow for the possibility of interactions among the predictors. This poses serious statistical and computational difficulties when p is large, as the number of candidate interaction terms is of order p2. We introduce a new approach, “Variable selection using Adaptive Nonlinear Interaction Structures in High dimensions” (VANISH), that is based on a penalized least squares criterion and is designed for high dimensional nonlinear problems. Our criterion is convex and enforces the heredity constraint, in other words if an interaction term is added to the model, then the correspo...

Journal ArticleDOI
TL;DR: A composite likelihood version of the Bayes information criterion (BIC) is proposed and it is shown to be selection-consistent under some mild regularity conditions, where the number of potential model parameters is allowed to increase to infinity at a certain rate of the sample size.
Abstract: For high-dimensional data sets with complicated dependency structures, the full likelihood approach often leads to intractable computational complexity. This imposes difficulty on model selection, given that most traditionally used information criteria require evaluation of the full likelihood. We propose a composite likelihood version of the Bayes information criterion (BIC) and establish its consistency property for the selection of the true underlying marginal model. Our proposed BIC is shown to be selection-consistent under some mild regularity conditions, where the number of potential model parameters is allowed to increase to infinity at a certain rate of the sample size. Simulation studies demonstrate the empirical performance of this new BIC, especially for the scenario where the number of parameters increases with sample size. Technical proofs of our theoretical results are provided in the online supplemental materials.