scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Methodology in 2008"


Journal ArticleDOI
TL;DR: Doubly robust (DR) procedures apply both types of model simultaneously and produce a consistent estimate of the parameter if either of the two models has been correctly specified as mentioned in this paper. But it does not demonstrate that, in at least some settings, two wrong models are not better than one.
Abstract: When outcomes are missing for reasons beyond an investigator's control, there are two different ways to adjust a parameter estimate for covariates that may be related both to the outcome and to missingness. One approach is to model the relationships between the covariates and the outcome and use those relationships to predict the missing values. Another is to model the probabilities of missingness given the covariates and incorporate them into a weighted or stratified estimate. Doubly robust (DR) procedures apply both types of model simultaneously and produce a consistent estimate of the parameter if either of the two models has been correctly specified. In this article, we show that DR estimates can be constructed in many ways. We compare the performance of various DR and non-DR estimates of a population mean in a simulated example where both models are incorrect but neither is grossly misspecified. Methods that use inverse-probabilities as weights, whether they are DR or not, are sensitive to misspecification of the propensity model when some estimated propensities are small. Many DR methods perform better than simple inverse-probability weighting. None of the DR methods we tried, however, improved upon the performance of simple regression-based prediction of the missing values. This study does not represent every missing-data problem that will arise in practice. But it does demonstrate that, in at least some settings, two wrong models are not better than one.

529 citations


Posted Content
TL;DR: Inference across multiple random splits can be aggregated while maintaining asymptotic control over the inclusion of noise variables, and it is shown that the resulting p-values can be used for control of both family-wise error and false discovery rate.
Abstract: Assigning significance in high-dimensional regression is challenging. Most computationally efficient selection algorithms cannot guard against inclusion of noise variables. Asymptotically valid p-values are not available. An exception is a recent proposal by Wasserman and Roeder (2008) which splits the data into two parts. The number of variables is then reduced to a manageable size using the first split, while classical variable selection techniques can be applied to the remaining variables, using the data from the second split. This yields asymptotic error control under minimal conditions. It involves, however, a one-time random split of the data. Results are sensitive to this arbitrary choice: it amounts to a `p-value lottery' and makes it difficult to reproduce results. Here, we show that inference across multiple random splits can be aggregated, while keeping asymptotic control over the inclusion of noise variables. We show that the resulting p-values can be used for control of both family-wise error (FWER) and false discovery rate (FDR). In addition, the proposed aggregation is shown to improve power while reducing the number of falsely selected variables substantially.

337 citations


Journal ArticleDOI
TL;DR: Inverse Probability Weights are Highly Variable as discussed by the authors, the performance of double-robust estimators when inverse probability weights are highly variable is discussed. But the performance is not discussed.
Abstract: Comment on ``Performance of Double-Robust Estimators When ``Inverse Probability'' Weights Are Highly Variable'' [arXiv:0804.2958]

292 citations


Journal ArticleDOI
TL;DR: This paper proposes to use summary measures of the set of possible causal effects to determine variable importance and uses the minimum absolute value of this set, since that is a lower bound on the size of the causal effect.
Abstract: We assume that we have observational data generated from an unknown underlying directed acyclic graph (DAG) model. A DAG is typically not identifiable from observational data, but it is possible to consistently estimate the equivalence class of a DAG. Moreover, for any given DAG, causal effects can be estimated using intervention calculus. In this paper, we combine these two parts. For each DAG in the estimated equivalence class, we use intervention calculus to estimate the causal effects of the covariates on the response. This yields a collection of estimated causal effects for each covariate. We show that the distinct values in this set can be consistently estimated by an algorithm that uses only local information of the graph. This local approach is computationally fast and feasible in high-dimensional problems. We propose to use summary measures of the set of possible causal effects to determine variable importance. In particular, we use the minimum absolute value of this set, since that is a lower bound on the size of the causal effect. We demonstrate the merits of our methods in a simulation study and on a data set about riboflavin production.

288 citations


Journal ArticleDOI
TL;DR: This paper provides an explanation for the similar behavior of LASSO and forward stagewise regression, and provides a fast implementation of both.
Abstract: Least Angle Regression is a promising technique for variable selection applications, offering a nice alternative to stepwise regression. It provides an explanation for the similar behavior of LASSO ($\ell_1$-penalized regression) and forward stagewise regression, and provides a fast implementation of both. The idea has caught on rapidly, and sparked a great deal of research interest. In this paper, we give an overview of Least Angle Regression and the current state of related research.

268 citations


Journal ArticleDOI
TL;DR: The Bayesian "sum-of-trees" (BART) model as mentioned in this paper is a nonparametric Bayesian regression approach which uses dimensionally adaptive random basis elements to constrain each tree to be a weak learner.
Abstract: We develop a Bayesian "sum-of-trees" model where each tree is constrained by a regularization prior to be a weak learner, and fitting and inference are accomplished via an iterative Bayesian backfitting MCMC algorithm that generates samples from a posterior. Effectively, BART is a nonparametric Bayesian regression approach which uses dimensionally adaptive random basis elements. Motivated by ensemble methods in general, and boosting algorithms in particular, BART is defined by a statistical model: a prior and a likelihood. This approach enables full posterior inference including point and interval estimates of the unknown regression function as well as the marginal effects of potential predictors. By keeping track of predictor inclusion frequencies, BART can also be used for model-free variable selection. BART's many features are illustrated with a bake-off against competing methods on 42 different data sets, with a simulation experiment and on a drug discovery classification problem.

266 citations


Posted Content
TL;DR: In this paper, the Gaussian process model which gives analytical expressions of Sobol indices is discussed, and the techniques are finally applied to a real case of hydrogeological modeling.
Abstract: Global sensitivity analysis of complex numerical models can be performed by calculating variance-based importance measures of the input variables, such as the Sobol indices. However, these techniques, requiring a large number of model evaluations, are often unacceptable for time expensive computer codes. A well known and widely used decision consists in replacing the computer code by a metamodel, predicting the model responses with a negligible computation time and rending straightforward the estimation of Sobol indices. In this paper, we discuss about the Gaussian process model which gives analytical expressions of Sobol indices. Two approaches are studied to compute the Sobol indices: the first based on the predictor of the Gaussian process model and the second based on the global stochastic process model. Comparisons between the two estimates, made on analytical examples, show the superiority of the second approach in terms of convergence and robustness. Moreover, the second approach allows to integrate the modeling error of the Gaussian process model by directly giving some confidence intervals on the Sobol indices. These techniques are finally applied to a real case of hydrogeological modeling.

206 citations


Journal ArticleDOI
Abstract: The use of quantiles to obtain insights about multivariate data is addressed. It is argued that incisive insights can be obtained by considering directional quantiles, the quantiles of projections. Directional quantile envelopes are proposed as a way to condense this kind of information; it is demonstrated that they are essentially halfspace (Tukey) depth levels sets, coinciding for elliptic distributions (in particular multivariate normal) with density contours. Relevant questions concerning their indexing, the possibility of the reverse retrieval of directional quantile information, invariance with respect to affine transformations, and approximation/asymptotic properties are studied. It is argued that the analysis in terms of directional quantiles and their envelopes offers a straightforward probabilistic interpretation and thus conveys a concrete quantitative meaning; the directional definition can be adapted to elaborate frameworks, like estimation of extreme quantiles and directional quantile regression, the regression of depth contours on covariates. The latter facilitates the construction of multivariate growth charts---the question that motivated all the development.

98 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose selection criteria based on a fully Bayes formulation with a generalization of Zellner's $g$-prior which allows for $p>n.
Abstract: For the normal linear model variable selection problem, we propose selection criteria based on a fully Bayes formulation with a generalization of Zellner's $g$-prior which allows for $p>n$. A special case of the prior formulation is seen to yield tractable closed forms for marginal densities and Bayes factors which reveal new model evaluation characteristics of potential interest.

95 citations


Journal ArticleDOI
TL;DR: Comment on ``Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data''
Abstract: Comment on ``Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data'' [arXiv:0804.2958]

92 citations


Journal ArticleDOI
TL;DR: A completely new direction will be considered here to study BVS with a Gibbs posterior originating in statistical mechanics, and a convenient Markov chain Monte Carlo algorithm is developed to implement B VS with the Gibbs posterior.
Abstract: In the popular approach of "Bayesian variable selection" (BVS), one uses prior and posterior distributions to select a subset of candidate variables to enter the model. A completely new direction will be considered here to study BVS with a Gibbs posterior originating in statistical mechanics. The Gibbs posterior is constructed from a risk function of practical interest (such as the classification error) and aims at minimizing a risk function without modeling the data probabilistically. This can improve the performance over the usual Bayesian approach, which depends on a probability model which may be misspecified. Conditions will be provided to achieve good risk performance, even in the presence of high dimensionality, when the number of candidate variables "$K$" can be much larger than the sample size "$n$." In addition, we develop a convenient Markov chain Monte Carlo algorithm to implement BVS with the Gibbs posterior.

Journal ArticleDOI
TL;DR: This paper proposes two novel types of regularization in the context of the multicategory SVM for simultaneous classification and variable selection, which lead to sparse multi-classifiers with enhanced interpretability and improved accuracy, especially for high dimensional low sample size data.
Abstract: The Support Vector Machine (SVM) is a popular classification paradigm in machine learning and has achieved great success in real applications. However, the standard SVM can not select variables automatically and therefore its solution typically utilizes all the input variables without discrimination. This makes it difficult to identify important predictor variables, which is often one of the primary goals in data analysis. In this paper, we propose two novel types of regularization in the context of the multicategory SVM (MSVM) for simultaneous classification and variable selection. The MSVM generally requires estimation of multiple discriminating functions and applies the argmax rule for prediction. For each individual variable, we propose to characterize its importance by the supnorm of its coefficient vector associated with different functions, and then minimize the MSVM hinge loss function subject to a penalty on the sum of supnorms. To further improve the supnorm penalty, we propose the adaptive regularization, which allows different weights imposed on different variables according to their relative importance. Both types of regularization automate variable selection in the process of building classifiers, and lead to sparse multi-classifiers with enhanced interpretability and improved accuracy, especially for high dimensional low sample size data. One big advantage of the supnorm penalty is its easy implementation via standard linear programming. Several simulated examples and one real gene data analysis demonstrate the outstanding performance of the adaptive supnorm penalty in various data settings.

ReportDOI
TL;DR: This paper proposes an ‘1 penalized pseudo-likelihood estimate for the inverse covariance matrix, and names it SPLICE, which gives the best overall performance in terms of three metrics on the precision matrix and ROC curve for model selection.
Abstract: Givenn observations of ap-dimensional random vector, the covariance matrix and its inverse (precision matrix) are needed in a wide range of applications. Sample covariance (e.g. its eigenstructure) can misbehave when p is comparable to the sample size n. Regularization is often used to mitigate the problem. In this paper, we proposed an ‘1 penalized pseudo-likelihood estimate for the inverse covariance matrix. This estimate is sparse due to the ‘1 penalty, and we term this method SPLICE. Its regularization path can be computed via an algorithm based on the homotopy/LARS-Lasso algorithm. Simulation studies are carried out for various inverse covariance structures for p = 15 and n = 20; 1000. We compare SPLICE with the ‘1 penalized likelihood estimate and a ‘1 penalized Cholesky decomposition based method. SPLICE gives the best overall performance in terms of three metrics on the precision matrix and ROC curve for model selection. Moreover, our simulation results demonstrate that the SPLICE estimates are positive-denite for most of the regularization path even though the restriction is not enforced.

Journal ArticleDOI
TL;DR: In this paper, variance reduction techniques for the estimation of quantiles of the output of a complex model with random input parameters are discussed, based on the use of a reduced model, such as a metamodel or a response surface.
Abstract: In this paper we propose and discuss variance reduction techniques for the estimation of quantiles of the output of a complex model with random input parameters. These techniques are based on the use of a reduced model, such as a metamodel or a response surface. The reduced model can be used as a control variate; or a rejection method can be implemented to sample the realizations of the input parameters in prescribed relevant strata; or the reduced model can be used to determine a good biased distribution of the input parameters for the implementation of an importance sampling strategy. The different strategies are analyzed and the asymptotic variances are computed, which shows the benefit of an adaptive controlled stratification method. This method is finally applied to a real example (computation of the peak cladding temperature during a large-break loss of coolant accident in a nuclear reactor).

Journal ArticleDOI
TL;DR: This paper is to present a review of approaches to spatial design to enable informed decisions to be made about developing practical and optimal spatial designs for future monitoring of streams.
Abstract: Spatial designs for monitoring stream networks, especially ephemeral systems, are typically non-standard, `sparse' and can be very complex, reflecting the complexity of the ecosystem being monitored, the scale of the population, and the competing multiple monitoring objectives The main purpose of this paper is to present a review of approaches to spatial design to enable informed decisions to be made about developing practical and optimal spatial designs for future monitoring of streams

Posted Content
TL;DR: A two-stage regularization method able to learn linear models characterized by a high prediction performance and to trade sparsity for the inclusion of correlated genes and to produce gene lists which are almost perfectly nested is proposed.
Abstract: Gene expression analysis aims at identifying the genes able to accurately predict biological parameters like, for example, disease subtyping or progression. While accurate prediction can be achieved by means of many different techniques, gene identification, due to gene correlation and the limited number of available samples, is a much more elusive problem. Small changes in the expression values often produce different gene lists, and solutions which are both sparse and stable are difficult to obtain. We propose a two-stage regularization method able to learn linear models characterized by a high prediction performance. By varying a suitable parameter these linear models allow to trade sparsity for the inclusion of correlated genes and to produce gene lists which are almost perfectly nested. Experimental results on synthetic and microarray data confirm the interesting properties of the proposed method and its potential as a starting point for further biological investigations

Book ChapterDOI
TL;DR: In this paper, a phase II Shewhart-type chart is considered for location, based on reference data from phase I analysis and the well-known Mann-Whitney statistic, and the in-control performance of the proposed chart is shown to be much superior to the classical SheWhart X¯ chart.
Abstract: Nonparametric or distribution-free charts can be useful in statistical process control problems when there is limited or lack of knowledge about the underlying process distribution. In this paper, a phase II Shewhart-type chart is considered for location, based on reference data from phase I analysis and the well-known Mann-Whitney statistic. Control limits are computed using Lugannani-Rice-saddlepoint, Edgeworth, and other approximations along with Monte Carlo estimation. The derivations take account of estimation and the dependence from the use of a reference sample. An illustrative numerical example is presented. The in-control performance of the proposed chart is shown to be much superior to the classical Shewhart $\bar{X}$ chart. Further comparisons on the basis of some percentiles of the out-of-control conditional run length distribution and the unconditional out-of-control ARL show that the proposed chart is almost as good as the Shewhart $\bar{X}$ chart for the normal distribution, but is more powerful for a heavy-tailed distribution such as the Laplace, or for a skewed distribution such as the Gamma. Interactive software, enabling a complete implementation of the chart, is made available on a website.

Journal ArticleDOI
TL;DR: In this paper, a latent structure on the concentration matrix is used to drive a penalty matrix and thus to recover a graphical model with a constrained topology, which corresponds to estimating the graph of conditional dependencies between the variables.
Abstract: Our concern is selecting the concentration matrix's nonzero coefficients for a sparse Gaussian graphical model in a high-dimensional setting. This corresponds to estimating the graph of conditional dependencies between the variables. We describe a novel framework taking into account a latent structure on the concentration matrix. This latent structure is used to drive a penalty matrix and thus to recover a graphical model with a constrained topology. Our method uses an $\ell_1$ penalized likelihood criterion. Inference of the graph of conditional dependencies between the variates and of the hidden variables is performed simultaneously in an iterative \textsc{em}-like algorithm. The performances of our method is illustrated on synthetic as well as real data, the latter concerning breast cancer.

Journal ArticleDOI
TL;DR: A traffic performance measurement system, PeMS as mentioned in this paper, currently functions as a statewide repository for traffic data gathered by thousands of automatic sensors It has integrated data collection, processing and communications infrastructure with data storage and analytical tools.
Abstract: A traffic performance measurement system, PeMS, currently functions as a statewide repository for traffic data gathered by thousands of automatic sensors It has integrated data collection, processing and communications infrastructure with data storage and analytical tools In this paper, we discuss statistical issues that have emerged as we attempt to process a data stream of 2 GB per day of wildly varying quality In particular, we focus on detecting sensor malfunction, imputation of missing or bad data, estimation of velocity and forecasting of travel times on freeway networks

Posted Content
TL;DR: In this paper, a review of recent developments in optimality criteria and comparison of non-regular fractional factorial designs is presented, including projection properties, generalized resolution, various generalized minimum aberration criteria, optimality results, construction methods and analysis strategies.
Abstract: Nonregular fractional factorial designs such as Plackett-Burman designs and other orthogonal arrays are widely used in various screening experiments for their run size economy and flexibility. The traditional analysis focuses on main effects only. Hamada and Wu (1992) went beyond the traditional approach and proposed an analysis strategy to demonstrate that some interactions could be entertained and estimated beyond a few significant main effects. Their groundbreaking work stimulated much of the recent developments in design criterion creation, construction and analysis of nonregular designs. This paper reviews important developments in optimality criteria and comparison, including projection properties, generalized resolution, various generalized minimum aberration criteria, optimality results, construction methods and analysis strategies for nonregular designs.

Posted Content
TL;DR: The authors investigated the predictive probabilities that underlie the Dirichlet and Pitman-Yor processes and the implicit "rich-get-richer" characteristic of the resulting partitions.
Abstract: Prior distributions play a crucial role in Bayesian approaches to clustering. Two commonly-used prior distributions are the Dirichlet and Pitman-Yor processes. In this paper, we investigate the predictive probabilities that underlie these processes, and the implicit "rich-get-richer" characteristic of the resulting partitions. We explore an alternative prior for nonparametric Bayesian clustering -- the uniform process -- for applications where the "rich-get-richer" property is undesirable. We also explore the cost of this process: partitions are no longer exchangeable with respect to the ordering of variables. We present new asymptotic and simulation-based results for the clustering characteristics of the uniform process and compare these with known results for the Dirichlet and Pitman-Yor processes. We compare performance on a real document clustering task, demonstrating the practical advantage of the uniform process despite its lack of exchangeability over orderings.

Posted Content
TL;DR: In this paper, a hierarchical model and estimation procedure for pooling principal axes across several populations is developed, based on a matrix-valued antipodally symmetric Bingham distribution that can flexibly describe notions of ''center'' and ''spread'' for a population of orthonormal matrices.
Abstract: While a set of covariance matrices corresponding to different populations are unlikely to be exactly equal they can still exhibit a high degree of similarity. For example, some pairs of variables may be positively correlated across most groups, while the correlation between other pairs may be consistently negative. In such cases much of the similarity across covariance matrices can be described by similarities in their principal axes, the axes defined by the eigenvectors of the covariance matrices. Estimating the degree of across-population eigenvector heterogeneity can be helpful for a variety of estimation tasks. Eigenvector matrices can be pooled to form a central set of principal axes, and to the extent that the axes are similar, covariance estimates for populations having small sample sizes can be stabilized by shrinking their principal axes towards the across-population center. To this end, this article develops a hierarchical model and estimation procedure for pooling principal axes across several populations. The model for the across-group heterogeneity is based on a matrix-valued antipodally symmetric Bingham distribution that can flexibly describe notions of ``center'' and ``spread'' for a population of orthonormal matrices.

Journal ArticleDOI
TL;DR: The appearance of Marshall and Olkin's 1979 book on inequalities with special emphasis on majorization generated a surge of interest in potential applications of majorization and Schur convexity in a broad spectrum of fields.
Abstract: The appearance of Marshall and Olkin's 1979 book on inequalities with special emphasis on majorization generated a surge of interest in potential applications of majorization and Schur convexity in a broad spectrum of fields. After 25 years this continues to be the case. The present article presents a sampling of the diverse areas in which majorization has been found to be useful in the past 25 years.

Book ChapterDOI
Susan Holmes1
TL;DR: In this article, the authors present exploratory techniques for multivariate data, many of them well known to French statisticians and ecologists, but few well understood in North American culture.
Abstract: This paper presents exploratory techniques for multivariate data, many of them well known to French statisticians and ecologists, but few well understood in North American culture. We present the general framework of duality diagrams which encompasses discriminant analysis, correspondence analysis and principal components and we show how this framework can be generalized to the regression of graphs on covariates.

Posted Content
TL;DR: Stability selection as discussed by the authors is based on subsampling in combination with (high-dimensional) selection algorithms and provides finite sample control for some error rates of false discoveries and hence a transparent principle to choose a proper amount of regularisation for structure estimation.
Abstract: Estimation of structure, such as in variable selection, graphical modelling or cluster analysis is notoriously difficult, especially for high-dimensional data We introduce stability selection It is based on subsampling in combination with (high-dimensional) selection algorithms As such, the method is extremely general and has a very wide range of applicability Stability selection provides finite sample control for some error rates of false discoveries and hence a transparent principle to choose a proper amount of regularisation for structure estimation Variable selection and structure estimation improve markedly for a range of selection methods if stability selection is applied We prove for randomised Lasso that stability selection will be variable selection consistent even if the necessary conditions needed for consistency of the original Lasso method are violated We demonstrate stability selection for variable selection and Gaussian graphical modelling, using real and simulated data

Posted Content
TL;DR: The indirect cross-validation method uniformly outperforms LSCV in a simulation study, a real data example, and a simulated example in which bandwidths are chosen locally.
Abstract: A new method of bandwidth selection for kernel density estimators is proposed. The method, termed indirect cross-validation, or ICV, makes use of so-called selection kernels. Least squares cross-validation (LSCV) is used to select the bandwidth of a selection-kernel estimator, and this bandwidth is appropriately rescaled for use in a Gaussian kernel estimator. The proposed selection kernels are linear combinations of two Gaussian kernels, and need not be unimodal or positive. Theory is developed showing that the relative error of ICV bandwidths can converge to 0 at a rate of $n^{-1/4}$, which is substantially better than the $n^{-1/10}$ rate of LSCV. Interestingly, the selection kernels that are best for purposes of bandwidth selection are very poor if used to actually estimate the density function. This property appears to be part of the larger and well-documented paradox to the effect that "the harder the estimation problem, the better cross-validation performs." The ICV method uniformly outperforms LSCV in a simulation study, a real data example, and a simulated example in which bandwidths are chosen locally.

Journal ArticleDOI
TL;DR: Karl Pearson played an enormous role in determining the content and organization of statistical research in his day, through his research, his teaching, his establishment of laboratories, and his initiation of a vast publishing program.
Abstract: Karl Pearson played an enormous role in determining the content and organization of statistical research in his day, through his research, his teaching, his establishment of laboratories, and his initiation of a vast publishing program. His technical contributions had initially and continue today to have a profound impact upon the work of both applied and theoretical statisticians, partly through their inadequately acknowledged influence upon Ronald A. Fisher. Particular attention is drawn to two of Pearson's major errors that nonetheless have left a positive and lasting impression upon the statistical world.

Posted Content
TL;DR: DB priors have simple forms and desirable properties and are often similar to other existing proposals like intrinsic priors and in normal linear model scenarios, they reproduce the Jeffreys–Zellner–Siow priors exactly.
Abstract: In this paper we introduce objective proper prior distributions for hypothesis testing and model selection based on measures of divergence between the competing models; we call them divergence based (DB) priors. DB priors have simple forms and desirable properties, like information (finite sample) consistency; often, they are similar to other existing proposals like the intrinsic priors; moreover, in normal linear models scenarios, they exactly reproduce Jeffreys-Zellner-Siow priors. Most importantly, in challenging scenarios such as irregular models and mixture models, the DB priors are well defined and very reasonable, while alternative proposals are not. We derive approximations to the DB priors as well as MCMC and asymptotic expressions for the associated Bayes factors.

Journal ArticleDOI
TL;DR: In this paper, a statistical perspective on boosting is presented, with special emphasis on estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis.
Abstract: We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and corresponding Akaike or Bayesian information criteria, particularly useful for regularization and variable selection in high-dimensional covariate spaces, are discussed as well. The practical aspects of boosting procedures for fitting statistical models are illustrated by means of the dedicated open-source software package mboost. This package implements functions which can be used for model fitting, prediction and variable selection. It is flexible, allowing for the implementation of new boosting algorithms optimizing user-specified loss functions.

Posted Content
TL;DR: This work develops a prior to the Gaussian processes which encode the linear model, and shows that its practical benefits extend well beyond the computational and conceptual simplicity of thelinear model.
Abstract: Gaussian processes retain the linear model either as a special case, or in the limit. We show how this relationship can be exploited when the data are at least partially linear. However from the perspective of the Bayesian posterior, the Gaussian processes which encode the linear model either have probability of nearly zero or are otherwise unattainable without the explicit construction of a prior with the limiting linear model in mind. We develop such a prior, and show that its practical benefits extend well beyond the computational and conceptual simplicity of the linear model. For example, linearity can be extracted on a per-dimension basis, or can be combined with treed partition models to yield a highly efficient nonstationary model. Our approach is demonstrated on synthetic and real datasets of varying linearity and dimensionality.