scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Methodology in 2011"


Journal ArticleDOI
TL;DR: The purpose of this article is to clarify the distinction between explanatory and predictive modeling, to discuss its sources, and to reveal the practical implications of the distinction to each step in the modeling process.
Abstract: Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and description. In many disciplines there is near-exclusive use of statistical modeling for causal explanation and the assumption that models with high explanatory power are inherently of high predictive power. Conflation between explanation and prediction is common, yet the distinction must be understood for progressing scientific knowledge. While this distinction has been recognized in the philosophy of science, the statistical literature lacks a thorough discussion of the many differences that arise in the process of modeling for an explanatory versus a predictive goal. The purpose of this article is to clarify the distinction between explanatory and predictive modeling, to discuss its sources, and to reveal the practical implications of the distinction to each step in the modeling process.

1,384 citations


Posted Content
TL;DR: In this article, a constrained L1 minimization method is proposed for estimating a sparse inverse covariance matrix based on a sample of $n$ iid $p$-variate random variables.
Abstract: A constrained L1 minimization method is proposed for estimating a sparse inverse covariance matrix based on a sample of $n$ iid $p$-variate random variables. The resulting estimator is shown to enjoy a number of desirable properties. In particular, it is shown that the rate of convergence between the estimator and the true $s$-sparse precision matrix under the spectral norm is $s\sqrt{\log p/n}$ when the population distribution has either exponential-type tails or polynomial-type tails. Convergence rates under the elementwise $L_{\infty}$ norm and Frobenius norm are also presented. In addition, graphical model selection is considered. The procedure is easily implementable by linear programming. Numerical performance of the estimator is investigated using both simulated and real data. In particular, the procedure is applied to analyze a breast cancer dataset. The procedure performs favorably in comparison to existing methods.

674 citations


Posted Content
TL;DR: This work develops a novel estimation and uniformly valid inference method for the treatment effect in this setting, called the "post-double-selection" method, which resolves the problem of uniform inference after model selection for a large, interesting class of models.
Abstract: We propose robust methods for inference on the effect of a treatment variable on a scalar outcome in the presence of very many controls. Our setting is a partially linear model with possibly non-Gaussian and heteroscedastic disturbances. Our analysis allows the number of controls to be much larger than the sample size. To make informative inference feasible, we require the model to be approximately sparse; that is, we require that the effect of confounding factors can be controlled for up to a small approximation error by conditioning on a relatively small number of controls whose identities are unknown. The latter condition makes it possible to estimate the treatment effect by selecting approximately the right set of controls. We develop a novel estimation and uniformly valid inference method for the treatment effect in this setting, called the "post-double-selection" method. Our results apply to Lasso-type methods used for covariate selection as well as to any other model selection method that is able to find a sparse model with good approximation properties. The main attractive feature of our method is that it allows for imperfect selection of the controls and provides confidence intervals that are valid uniformly across a large class of models. In contrast, standard post-model selection estimators fail to provide uniform inference even in simple cases with a small, fixed number of controls. Thus our method resolves the problem of uniform inference after model selection for a large, interesting class of models. We illustrate the use of the developed methods with numerical simulations and an application to the effect of abortion on crime rates.

568 citations


Posted Content
TL;DR: In this article, the problem of estimating multiple related but distinct graphical models on the basis of a high-dimensional data set with observations that belong to distinct classes was considered, and a joint graphical lasso was proposed to solve the problem.
Abstract: We consider the problem of estimating multiple related but distinct graphical models on the basis of a high-dimensional data set with observations that belong to distinct classes. A motivating example occurs in the analysis of gene expression data for tissue samples with and without cancer. In this case, we might wish to estimate a gene expression network for the normal tissue and a gene expression network for the tumor tissue. We expect the two gene expression networks to be similar but not identical to each other, and so more accurate estimation of these two networks may be possible using a joint approach. We propose the joint graphical lasso for this purpose. Rather than estimating a graphical model for each class separately, or a single graphical model across all classes, we borrow strength across the classes in order to estimate multiple graphical models that share certain characteristics, such as the locations or weights of nonzero edges. Our approach is based upon maximizing a penalized log likelihood. We employ fused lasso or group lasso penalties, and implement a very fast computational approach that solves the joint graphical lasso problem. In a simulation study we demonstrate that our proposed approach leads to more accurate estimation of networks and covariance structure than competing approaches. We further illustrate our proposal on a publicly-available lung cancer gene expression data set.

511 citations


Journal ArticleDOI
TL;DR: This work proposes the new RFCI algorithm, which is much faster than FCI, and proves consistency of FCI and RFCI in sparse high-dimensional settings, and demonstrates in simulations that the estimation performances of the algorithms are very similar.
Abstract: We consider the problem of learning causal information between random variables in directed acyclic graphs (DAGs) when allowing arbitrarily many latent and selection variables. The FCI (Fast Causal Inference) algorithm has been explicitly designed to infer conditional independence and causal information in such settings. However, FCI is computationally infeasible for large graphs. We therefore propose the new RFCI algorithm, which is much faster than FCI. In some situations the output of RFCI is slightly less informative, in particular with respect to conditional independence information. However, we prove that any causal information in the output of RFCI is correct in the asymptotic limit. We also define a class of graphs on which the outputs of FCI and RFCI are identical. We prove consistency of FCI and RFCI in sparse high-dimensional settings, and demonstrate in simulations that the estimation performances of the algorithms are very similar. All software is implemented in the R-package pcalg.

277 citations


Posted Content
TL;DR: The Bag of Little Bootstraps (BLB) as mentioned in this paper is a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators.
Abstract: The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets---which are increasingly prevalent---the computation of bootstrap-based quantities can be prohibitively demanding computationally. While variants such as subsampling and the $m$ out of $n$ bootstrap can be used in principle to reduce the cost of bootstrap computations, we find that these methods are generally not robust to specification of hyperparameters (such as the number of subsampled data points), and they often require use of more prior information (such as rates of convergence of estimators) than the bootstrap. As an alternative, we introduce the Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators. BLB is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency of the bootstrap. We demonstrate BLB's favorable statistical performance via a theoretical analysis elucidating the procedure's properties, as well as a simulation study comparing BLB to the bootstrap, the $m$ out of $n$ bootstrap, and subsampling. In addition, we present results from a large-scale distributed implementation of BLB demonstrating its computational superiority on massive data, a method for adaptively selecting BLB's hyperparameters, an empirical study applying BLB to several real datasets, and an extension of BLB to time series data.

245 citations


Posted Content
TL;DR: This paper gives a graph theoretic criterion for two DAGs being Markov equivalent under interventions and shows that each interventional Markov equivalence class can be uniquely represented by a chain graph called interventional essential graph (also known as CPDAG in the observational case).
Abstract: The investigation of directed acyclic graphs (DAGs) encoding the same Markov property, that is the same conditional independence relations of multivariate observational distributions, has a long tradition; many algorithms exist for model selection and structure learning in Markov equivalence classes. In this paper, we extend the notion of Markov equivalence of DAGs to the case of interventional distributions arising from multiple intervention experiments. We show that under reasonable assumptions on the intervention experiments, interventional Markov equivalence defines a finer partitioning of DAGs than observational Markov equivalence and hence improves the identifiability of causal models. We give a graph theoretic criterion for two DAGs being Markov equivalent under interventions and show that each interventional Markov equivalence class can, analogously to the observational case, be uniquely represented by a chain graph called interventional essential graph (also known as CPDAG in the observational case). These are key insights for deriving a generalization of the Greedy Equivalence Search algorithm aimed at structure learning from interventional data. This new algorithm is evaluated in a simulation study.

243 citations


Posted Content
TL;DR: In this article, the half-Cauchy distribution is proposed as a default prior for a top-level scale parameter in Bayesian hierarchical models, at least for cases where a proper prior is necessary.
Abstract: This paper argues that the half-Cauchy distribution should replace the inverse-Gamma distribution as a default prior for a top-level scale parameter in Bayesian hierarchical models, at least for cases where a proper prior is necessary. Our arguments involve a blend of Bayesian and frequentist reasoning, and are intended to complement the original case made by Gelman (2006) in support of the folded-t family of priors. First, we generalize the half-Cauchy prior to the wider class of hypergeometric inverted-beta priors. We derive expressions for posterior moments and marginal densities when these priors are used for a top-level normal variance in a Bayesian hierarchical model. We go on to prove a proposition that, together with the results for moments and marginals, allows us to characterize the frequentist risk of the Bayes estimators under all global-shrinkage priors in the class. These theoretical results, in turn, allow us to study the frequentist properties of the half-Cauchy prior versus a wide class of alternatives. The half-Cauchy occupies a sensible 'middle ground' within this class: it performs very well near the origin, but does not lead to drastic compromises in other parts of the parameter space. This provides an alternative, classical justification for the repeated, routine use of this prior. We also consider situations where the underlying mean vector is sparse, where we argue that the usual conjugate choice of an inverse-gamma prior is particularly inappropriate, and can lead to highly distorted posterior inferences. Finally, we briefly summarize some open issues in the specification of default priors for scale terms in hierarchical models.

231 citations


Journal ArticleDOI
TL;DR: This work forms an ERGM for networks whose ties are counts and discusses issues that arise when moving beyond the binary case and introduces model terms that generalize and model common social network features for such data and apply these methods to a network dataset whose values are counts of interactions.
Abstract: Exponential-family random graph models (ERGMs) provide a principled and flexible way to model and simulate features common in social networks, such as propensities for homophily, mutuality, and friend-of-a-friend triad closure, through choice of model terms (sufficient statistics). However, those ERGMs modeling the more complex features have, to date, been limited to binary data: presence or absence of ties. Thus, analysis of valued networks, such as those where counts, measurements, or ranks are observed, has necessitated dichotomizing them, losing information and introducing biases. In this work, we generalize ERGMs to valued networks. Focusing on modeling counts, we formulate an ERGM for networks whose ties are counts and discuss issues that arise when moving beyond the binary case. We introduce model terms that generalize and model common social network features for such data and apply these methods to a network dataset whose values are counts of interactions.

215 citations


Journal ArticleDOI
TL;DR: In this article, the authors consider the problem of detecting multiple changepoints in large data sets and propose a new method for finding the minimum of such cost functions and hence the optimal number and location of changepoints that has a computational cost which, under mild conditions, is linear in the number of observations.
Abstract: We consider the problem of detecting multiple changepoints in large data sets. Our focus is on applications where the number of changepoints will increase as we collect more data: for example in genetics as we analyse larger regions of the genome, or in finance as we observe time-series over longer periods. We consider the common approach of detecting changepoints through minimising a cost function over possible numbers and locations of changepoints. This includes several established procedures for detecting changing points, such as penalised likelihood and minimum description length. We introduce a new method for finding the minimum of such cost functions and hence the optimal number and location of changepoints that has a computational cost which, under mild conditions, is linear in the number of observations. This compares favourably with existing methods for the same problem whose computational cost can be quadratic or even cubic. In simulation studies we show that our new method can be orders of magnitude faster than these alternative exact methods. We also compare with the Binary Segmentation algorithm for identifying changepoints, showing that the exactness of our approach can lead to substantial improvements in the accuracy of the inferred segmentation of the data.

205 citations


Posted Content
TL;DR: The method proposed turns the regression data into an approximate Gaussian sequence of point estimators of individual regression coefficients, which can be used to select variables after proper thresholding, and demonstrates the accuracy of the coverage probability and other desirable properties of the confidence intervals proposed.
Abstract: The purpose of this paper is to propose methodologies for statistical inference of low-dimensional parameters with high-dimensional data. We focus on constructing confidence intervals for individual coefficients and linear combinations of several of them in a linear regression model, although our ideas are applicable in a much broad context. The theoretical results presented here provide sufficient conditions for the asymptotic normality of the proposed estimators along with a consistent estimator for their finite-dimensional covariance matrices. These sufficient conditions allow the number of variables to far exceed the sample size. The simulation results presented here demonstrate the accuracy of the coverage probability of the proposed confidence intervals, strongly supporting the theoretical results.

Posted Content
TL;DR: A simple and effective classifier is introduced by estimating the product Ωδ directly through constrained ℓ1 minimization and it has superior finite sample performance and significant computational advantages over the existing methods that require separate estimation of Ω and δ.
Abstract: This paper considers sparse linear discriminant analysis of high-dimensional data. In contrast to the existing methods which are based on separate estimation of the precision matrix $\O$ and the difference $\de$ of the mean vectors, we introduce a simple and effective classifier by estimating the product $\O\de$ directly through constrained $\ell_1$ minimization. The estimator can be implemented efficiently using linear programming and the resulting classifier is called the linear programming discriminant (LPD) rule. The LPD rule is shown to have desirable theoretical and numerical properties. It exploits the approximate sparsity of $\O\de$ and as a consequence allows cases where it can still perform well even when $\O$ and/or $\de$ cannot be estimated consistently. Asymptotic properties of the LPD rule are investigated and consistency and rate of convergence results are given. The LPD classifier has superior finite sample performance and significant computational advantages over the existing methods that require separate estimation of $\O$ and $\de$. The LPD rule is also applied to analyze real datasets from lung cancer and leukemia studies. The classifier performs favorably in comparison to existing methods.

ReportDOI
TL;DR: This chapter reviews estimation methods for HDS models that make use of l 1 -penalization and then provides a set of novel inference results that illustrate the potential wide applicability of H DS models and methods in econometrics.
Abstract: Introduction We consider linear, high-dimensional sparse (HDS) regression models in econometrics. The HDS regression model allows for a large number of regressors, p , which is possibly much larger than the sample size, n , but imposes that the model is sparse. That is, we assume that only s ≪ n of these regressors are important for capturing the main features of the regression function. This assumption makes it possible to effectively estimate HDS models by searching for approximately the correct set of regressors. In this chapter, we review estimation methods for HDS models that make use of l 1 -penalization and then provide a set of novel inference results. We also provide empirical examples that illustrate the potential wide applicability of HDS models and methods in econometrics. The motivation for considering HDS models comes in part from the wide availability of datasets with many regressors. For example, the American Housing Survey records prices as well as a multitude of features of houses sold, and scanner datasets record prices and numerous characteristics of products sold at a store or on the Internet. HDS models also are partly motivated by the use of series methods in econometrics. Series methods use many constructed or series regressors – regressors formed as transformation of elementary regressors – to approximate regression functions. In these applications, it is important to have a parsimonious yet accurate approximation of the regression function. One way to achieve this is to use the data to select as mall of number of informative terms from among a very large set of control variables or approximating functions.

Journal ArticleDOI
TL;DR: An adjustment of the monitoring schemes to overcome problems of estimation of the in-control state is suggested, which guarantees, with a certain probability, a conditional performance given the estimated in- control state.
Abstract: To use control charts in practice, the in-control state usually has to be estimated. This estimation has a detrimental effect on the performance of control charts, which is often measured for example by the false alarm probability or the average run length. We suggest an adjustment of the monitoring schemes to overcome these problems. It guarantees, with a certain probability, a conditional performance given the estimated in-control state. The suggested method is based on bootstrapping the data used to estimate the in-control state. The method applies to different types of control charts, and also works with charts based on regression models, survival models, etc. If a nonparametric bootstrap is used, the method is robust to model errors. We show large sample properties of the adjustment. The usefulness of our approach is demonstrated through simulation studies.

Journal ArticleDOI
TL;DR: In this article, a spike-and-slab prior structure for function selection is proposed, which allows to include or exclude single coefficients as well as blocks of coefficients representing specific model terms.
Abstract: Structured additive regression provides a general framework for complex Gaussian and non-Gaussian regression models, with predictors comprising arbitrary combinations of nonlinear functions and surfaces, spatial effects, varying coefficients, random effects and further regression terms. The large flexibility of structured additive regression makes function selection a challenging and important task, aiming at (1) selecting the relevant covariates, (2) choosing an appropriate and parsimonious representation of the impact of covariates on the predictor and (3) determining the required interactions. We propose a spike-and-slab prior structure for function selection that allows to include or exclude single coefficients as well as blocks of coefficients representing specific model terms. A novel multiplicative parameter expansion is required to obtain good mixing and convergence properties in a Markov chain Monte Carlo simulation approach and is shown to induce desirable shrinkage properties. In simulation studies and with (real) benchmark classification data, we investigate sensitivity to hyperparameter settings and compare performance to competitors. The flexibility and applicability of our approach are demonstrated in an additive piecewise exponential model with time-varying effects for right-censored survival times of intensive care patients with sepsis. Geoadditive and additive mixed logit model applications are discussed in an extensive appendix.

Posted Content
TL;DR: The Bayesian bridge estimator for regularized regression and classification was proposed in this article, and two key mixture representations for the bridge model are developed: (1) a scale mixture of normals with respect to an alpha-stable random variable; and (2) a mixture of Bartlett-Fejer kernels (or triangle densities) with respect with a two-component mixture of gamma random variables.
Abstract: We propose the Bayesian bridge estimator for regularized regression and classification. Two key mixture representations for the Bayesian bridge model are developed: (1) a scale mixture of normals with respect to an alpha-stable random variable; and (2) a mixture of Bartlett--Fejer kernels (or triangle densities) with respect to a two-component mixture of gamma random variables. Both lead to MCMC methods for posterior simulation, and these methods turn out to have complementary domains of maximum efficiency. The first representation is a well known result due to West (1987), and is the better choice for collinear design matrices. The second representation is new, and is more efficient for orthogonal problems, largely because it avoids the need to deal with exponentially tilted stable random variables. It also provides insight into the multimodality of the joint posterior distribution, a feature of the bridge model that is notably absent under ridge or lasso-type priors. We prove a theorem that extends this representation to a wider class of densities representable as scale mixtures of betas, and provide an explicit inversion formula for the mixing distribution. The connections with slice sampling and scale mixtures of normals are explored. On the practical side, we find that the Bayesian bridge model outperforms its classical cousin in estimation and prediction across a variety of data sets, both simulated and real. We also show that the MCMC for fitting the bridge model exhibits excellent mixing properties, particularly for the global scale parameter. This makes for a favorable contrast with analogous MCMC algorithms for other sparse Bayesian models. All methods described in this paper are implemented in the R package BayesBridge. An extensive set of simulation results are provided in two supplemental files.

Posted Content
TL;DR: In this paper, the authors develop a nonparametric QR-series framework for performing inference on the entire conditional quantile function and its linear functionals, which is a principal regression method for analyzing the impact of covariates on outcomes, and they demonstrate the practical utility of these results with an example, where they estimate the price elasticity function and test the Slutsky condition of the individual demand for gasoline, as indexed by the individual propensity for gasoline consumption.
Abstract: Quantile regression (QR) is a principal regression method for analyzing the impact of covariates on outcomes. The impact is described by the conditional quantile function and its functionals. In this paper we develop the nonparametric QR-series framework, covering many regressors as a special case, for performing inference on the entire conditional quantile function and its linear functionals. In this framework, we approximate the entire conditional quantile function by a linear combination of series terms with quantile-specific coefficients and estimate the function-valued coefficients from the data. We develop large sample theory for the QR-series coefficient process, namely we obtain uniform strong approximations to the QR-series coefficient process by conditionally pivotal and Gaussian processes. Based on these strong approximations, or couplings, we develop four resampling methods (pivotal, gradient bootstrap, Gaussian, and weighted bootstrap) that can be used for inference on the entire QR-series coefficient function. We apply these results to obtain estimation and inference methods for linear functionals of the conditional quantile function, such as the conditional quantile function itself, its partial derivatives, average partial derivatives, and conditional average partial derivatives. Specifically, we obtain uniform rates of convergence and show how to use the four resampling methods mentioned above for inference on the functionals. All of the above results are for function-valued parameters, holding uniformly in both the quantile index and the covariate value, and covering the pointwise case as a by-product. We demonstrate the practical utility of these results with an example, where we estimate the price elasticity function and test the Slutsky condition of the individual demand for gasoline, as indexed by the individual unobserved propensity for gasoline consumption.

Journal ArticleDOI
TL;DR: A self-tuning Lasso method that simultaneously resolves three important practical problems in high-dimensional regression analysis, namely it handles the unknown scale, heteroscedasticity and (drastic) non-Gaussianity of the noise, and generates sharp bounds even in extreme cases, such as the infinite variance case and the noiseless case.
Abstract: We propose a self-tuning $\sqrt{\mathrm {Lasso}}$ method that simultaneously resolves three important practical problems in high-dimensional regression analysis, namely it handles the unknown scale, heteroscedasticity and (drastic) non-Gaussianity of the noise. In addition, our analysis allows for badly behaved designs, for example, perfectly collinear regressors, and generates sharp bounds even in extreme cases, such as the infinite variance case and the noiseless case, in contrast to Lasso. We establish various nonasymptotic bounds for $\sqrt{\mathrm {Lasso}}$ including prediction norm rate and sparsity. Our analysis is based on new impact factors that are tailored for bounding prediction norm. In order to cover heteroscedastic non-Gaussian noise, we rely on moderate deviation theory for self-normalized sums to achieve Gaussian-like results under weak conditions. Moreover, we derive bounds on the performance of ordinary least square (ols) applied to the model selected by $\sqrt{\mathrm {Lasso}}$ accounting for possible misspecification of the selected model. Under mild conditions, the rate of convergence of ols post $\sqrt{\mathrm {Lasso}}$ is as good as $\sqrt{\mathrm {Lasso}}$'s rate. As an application, we consider the use of $\sqrt{\mathrm {Lasso}}$ and ols post $\sqrt{\mathrm {Lasso}}$ as estimators of nuisance parameters in a generic semiparametric problem (nonlinear moment condition or $Z$-problem), resulting in a construction of $\sqrt{n}$-consistent and asymptotically normal estimators of the main parameters.

Journal ArticleDOI
TL;DR: This work proposes a new model that uses a combination of low-order regression terms and compactly supported correlation functions to recreate the desired predictive behavior of the emulator at a fraction of the computational cost.
Abstract: Statistical emulators of computer simulators have proven to be useful in a variety of applications. The widely adopted model for emulator building, using a Gaussian process model with strictly positive correlation function, is computationally intractable when the number of simulator evaluations is large. We propose a new model that uses a combination of low-order regression terms and compactly supported correlation functions to recreate the desired predictive behavior of the emulator at a fraction of the computational cost. Following the usual approach of taking the correlation to be a product of correlations in each input dimension, we show how to impose restrictions on the ranges of the correlations, giving sparsity, while also allowing the ranges to trade off against one another, thereby giving good predictive performance. We illustrate the method using data from a computer simulator of photometric redshift with 20,000 simulator evaluations and 80,000 predictions.

Journal ArticleDOI
TL;DR: In this paper, the authors consider the problem of selecting an objective prior from a large number of sources of information or past data and compare the performance of different objective priors in the presence of nuisance parameters, and show that a prior different from Jeffreys' is optimal under the chi-square divergence criterion even in the absence of nuisance parameter.
Abstract: Bayesian methods are increasingly applied in these days in the theory and practice of statistics. Any Bayesian inference depends on a likelihood and a prior. Ideally one would like to elicit a prior from related sources of information or past data. However, in its absence, Bayesian methods need to rely on some "objective" or "default" priors, and the resulting posterior inference can still be quite valuable. Not surprisingly, over the years, the catalog of objective priors also has become prohibitively large, and one has to set some specific criteria for the selection of such priors. Our aim is to review some of these criteria, compare their performance, and illustrate them with some simple examples. While for very large sample sizes, it does not possibly matter what objective prior one uses, the selection of such a prior does influence inference for small or moderate samples. For regular models where asymptotic normality holds, Jeffreys' general rule prior, the positive square root of the determinant of the Fisher information matrix, enjoys many optimality properties in the absence of nuisance parameters. In the presence of nuisance parameters, however, there are many other priors which emerge as optimal depending on the criterion selected. One new feature in this article is that a prior different from Jeffreys' is shown to be optimal under the chi-square divergence criterion even in the absence of nuisance parameters. The latter is also invariant under one-to-one reparameterization.

Journal ArticleDOI
TL;DR: This survey walks the reader through the construction of several specific MM algorithms, a special case of a more general algorithm called the MM algorithm, which has potential in solving high-dimensional optimization and estimation problems.
Abstract: The EM algorithm is a special case of a more general algorithm called the MM algorithm. Specific MM algorithms often have nothing to do with missing data. The first M step of an MM algorithm creates a surrogate function that is optimized in the second M step. In minimization, MM stands for majorize--minimize; in maximization, it stands for minorize--maximize. This two-step process always drives the objective function in the right direction. Construction of MM algorithms relies on recognizing and manipulating inequalities rather than calculating conditional expectations. This survey walks the reader through the construction of several specific MM algorithms. The potential of the MM algorithm in solving high-dimensional optimization and estimation problems is its most attractive feature. Our applications to random graph models, discriminant analysis and image restoration showcase this ability.

Posted Content
TL;DR: This work proposes an alternative approach that involves linear projection of all the data points onto a lower-dimensional subspace and demonstrates the superiority of this approach from a theoretical perspective and through simulated and real data examples.
Abstract: Gaussian processes (GPs) are widely used in nonparametric regression, classification and spatio-temporal modeling, motivated in part by a rich literature on theoretical properties. However, a well known drawback of GPs that limits their use is the expensive computation, typically O($n^3$) in performing the necessary matrix inversions with $n$ denoting the number of data points. In large data sets, data storage and processing also lead to computational bottlenecks and numerical stability of the estimates and predicted values degrades with $n$. To address these problems, a rich variety of methods have been proposed, with recent options including predictive processes in spatial data analysis and subset of regressors in machine learning. The underlying idea in these approaches is to use a subset of the data, leading to questions of sensitivity to the subset and limitations in estimating fine scale structure in regions that are not well covered by the subset. Motivated by the literature on compressive sensing, we propose an alternative random projection of all the data points onto a lower-dimensional subspace. We demonstrate the superiority of this approach from a theoretical perspective and through the use of simulated and real data examples. Some Keywords: Bayesian; Compressive Sensing; Dimension Reduction; Gaussian Processes; Random Projections; Subset Selection

Journal ArticleDOI
TL;DR: A discriminative latent mixture (DLM) model which fits the data in a latent orthonormal discrim inative subspace with an intrinsic dimension lower than the dimension of the original space is presented.
Abstract: Clustering in high-dimensional spaces is nowadays a recurrent problem in many scientific domains but remains a difficult task from both the clustering accuracy and the result understanding points of view. This paper presents a discriminative latent mixture (DLM) model which fits the data in a latent orthonormal discriminative subspace with an intrinsic dimension lower than the dimension of the original space. By constraining model parameters within and between groups, a family of 12 parsimonious DLM models is exhibited which allows to fit onto various situations. An estimation algorithm, called the Fisher-EM algorithm, is also proposed for estimating both the mixture parameters and the discriminative subspace. Experiments on simulated and real datasets show that the proposed approach performs better than existing clustering methods while providing a useful representation of the clustered data. The method is as well applied to the clustering of mass spectrometry data.

Posted Content
TL;DR: A new class of normal scale mixtures is proposed through a novel generalized beta distribution that encompasses many interesting priors as special cases and develops a class of variational Bayes approximations that will scale more efficiently to the types of truly massive data sets that are now encountered routinely.
Abstract: In recent years, a rich variety of shrinkage priors have been proposed that have great promise in addressing massive regression problems. In general, these new priors can be expressed as scale mixtures of normals, but have more complex forms and better properties than traditional Cauchy and double exponential priors. We first propose a new class of normal scale mixtures through a novel generalized beta distribution that encompasses many interesting priors as special cases. This encompassing framework should prove useful in comparing competing priors, considering properties and revealing close connections. We then develop a class of variational Bayes approximations through the new hierarchy presented that will scale more efficiently to the types of truly massive data sets that are now encountered routinely.

Posted Content
TL;DR: A kriging surrogate of the performance function is proposed to use as a means to build a quasi-optimal importance sampling density and proves efficient up to 100 random variables.
Abstract: Structural reliability methods aim at computing the probability of failure of systems with respect to some prescribed performance functions. In modern engineering such functions usually resort to running an expensive-to-evaluate computational model (e.g. a finite element model). In this respect simulation methods, which may require $10^{3-6}$ runs cannot be used directly. Surrogate models such as quadratic response surfaces, polynomial chaos expansions or kriging (which are built from a limited number of runs of the original model) are then introduced as a substitute of the original model to cope with the computational cost. In practice it is almost impossible to quantify the error made by this substitution though. In this paper we propose to use a kriging surrogate of the performance function as a means to build a quasi-optimal importance sampling density. The probability of failure is eventually obtained as the product of an augmented probability computed by substituting the meta-model for the original performance function and a correction term which ensures that there is no bias in the estimation even if the meta-model is not fully accurate. The approach is applied to analytical and finite element reliability problems and proves efficient up to 100 random variables.

Posted Content
TL;DR: In this paper, a model-assisted approach is proposed to sample hard-to-reach human populations by link-tracing over their social networks, where each person sampled is given a small number of uniquely identified coupons to distribute to other members of the target population, making them eligible for enrollment in the study.
Abstract: Respondent-Driven Sampling is a method to sample hard-to-reach human populations by link-tracing over their social networks. Beginning with a convenience sample, each person sampled is given a small number of uniquely identified coupons to distribute to other members of the target population, making them eligible for enrollment in the study. This can be an effective means to collect large diverse samples from many populations. Inference from such data requires specialized techniques for two reasons. Unlike in standard sampling designs, the sampling process is both partially beyond the control of the researcher, and partially implicitly defined. Therefore, it is not generally possible to directly compute the sampling weights necessary for traditional design-based inference. Any likelihood-based inference requires the modeling of the complex sampling process often beginning with a convenience sample. We introduce a model-assisted approach, resulting in a design-based estimator leveraging a working model for the structure of the population over which sampling is conducted. We demonstrate that the new estimator has improved performance compared to existing estimators and is able to adjust for the bias induced by the selection of the initial sample. We present sensitivity analyses for unknown population sizes and the misspecification of the working network model. We develop a bootstrap procedure to compute measures of uncertainty. We apply the method to the estimation of HIV prevalence in a population of injecting drug users (IDU) in the Ukraine, and show how it can be extended to include application-specific information.

Posted Content
TL;DR: In this article, a class of nonparametric covariance regression models, which allow an unknown p x p covariance matrix to change flexibly with predictors, is proposed.
Abstract: Although there is a rich literature on methods for allowing the variance in a univariate regression model to vary with predictors, time and other factors, relatively little has been done in the multivariate case. Our focus is on developing a class of nonparametric covariance regression models, which allow an unknown p x p covariance matrix to change flexibly with predictors. The proposed modeling framework induces a prior on a collection of covariance matrices indexed by predictors through priors for predictor-dependent loadings matrices in a factor model. In particular, the predictor-dependent loadings are characterized as a sparse combination of a collection of unknown dictionary functions (e.g, Gaussian process random functions). The induced covariance is then a regularized quadratic function of these dictionary elements. Our proposed framework leads to a highly-flexible, but computationally tractable formulation with simple conjugate posterior updates that can readily handle missing data. Theoretical properties are discussed and the methods are illustrated through simulations studies and an application to the Google Flu Trends data.

Journal ArticleDOI
TL;DR: It is argued that the Calibrated Bayesian (CB) approach to statistical inference capitalizes on the strength of Bayesian and frequen- tist approaches to Statistical inference.
Abstract: It is argued that the Calibrated Bayesian (CB) approach to statistical inference capitalizes on the strength of Bayesian and frequentist approaches to statistical inference. In the CB approach, inferences under a particular model are Bayesian, but frequentist methods are useful for model development and model checking. In this article the CB approach is outlined. Bayesian methods for missing data are then reviewed from a CB perspective. The basic theory of the Bayesian approach, and the closely related technique of multiple imputation, is described. Then applications of the Bayesian approach to normal models are described, both for monotone and nonmonotone missing data patterns. Sequential Regression Multivariate Imputation and Penalized Spline of Propensity Models are presented as two useful approaches for relaxing distributional assumptions.

Journal ArticleDOI
TL;DR: In this paper, an approach based on both metamodels and advanced simulation techniques is explored to develop a strategy for solving reliability-based design optimization (RBDO) problems that remains applicable when the performance models are expensive to evaluate starting from the premise that simulation-based approaches are not affordable for such problems, and that the most-probable-failure-pointbased approaches do not permit to quantify the error on the estimation of the failure probability.
Abstract: The aim of the present paper is to develop a strategy for solving reliability-based design optimization (RBDO) problems that remains applicable when the performance models are expensive to evaluate Starting with the premise that simulation-based approaches are not affordable for such problems, and that the most-probable-failure-point-based approaches do not permit to quantify the error on the estimation of the failure probability, an approach based on both metamodels and advanced simulation techniques is explored The kriging metamodeling technique is chosen in order to surrogate the performance functions because it allows one to genuinely quantify the surrogate error The surrogate error onto the limit-state surfaces is propagated to the failure probabilities estimates in order to provide an empirical error measure This error is then sequentially reduced by means of a population-based adaptive refinement technique until the kriging surrogates are accurate enough for reliability analysis This original refinement strategy makes it possible to add several observations in the design of experiments at the same time Reliability and reliability sensitivity analyses are performed by means of the subset simulation technique for the sake of numerical efficiency The adaptive surrogate-based strategy for reliability estimation is finally involved into a classical gradient-based optimization algorithm in order to solve the RBDO problem The kriging surrogates are built in a so-called augmented reliability space thus making them reusable from one nested RBDO iteration to the other The strategy is compared to other approaches available in the literature on three academic examples in the field of structural mechanics

Journal ArticleDOI
TL;DR: The authors developed and applied Bayesian association methods to address the problem of heterogeneity in genetic association meta-analyses, and applied these tools to two large genetic association studies: one a meta-analysis of genome-wide association studies from the Global Lipids consortium, and the second a cross-population analysis for expression quantitative trait loci (eQTLs).
Abstract: Genetic association analyses often involve data from multiple potentially-heterogeneous subgroups. The expected amount of heterogeneity can vary from modest (e.g., a typical meta-analysis) to large (e.g., a strong gene--environment interaction). However, existing statistical tools are limited in their ability to address such heterogeneity. Indeed, most genetic association meta-analyses use a "fixed effects" analysis, which assumes no heterogeneity. Here we develop and apply Bayesian association methods to address this problem. These methods are easy to apply (in the simplest case, requiring only a point estimate for the genetic effect and its standard error, from each subgroup) and effectively include standard frequentist meta-analysis methods, including the usual "fixed effects" analysis, as special cases. We apply these tools to two large genetic association studies: one a meta-analysis of genome-wide association studies from the Global Lipids consortium, and the second a cross-population analysis for expression quantitative trait loci (eQTLs). In the Global Lipids data we find, perhaps surprisingly, that effects are generally quite homogeneous across studies. In the eQTL study we find that eQTLs are generally shared among different continental groups, and discuss consequences of this for study design.