scispace - formally typeset
Search or ask a question

Showing papers in "Statistics and Computing in 2021"


Journal ArticleDOI
TL;DR: In this article, the authors review and advocate against the use of permute-and-predict (PaP) methods for interpreting black box functions, and provide further demonstrations of these drawbacks along with a detailed explanation as to why they occur.
Abstract: This paper reviews and advocates against the use of permute-and-predict (PaP) methods for interpreting black box functions. Methods such as the variable importance measures proposed for random forests, partial dependence plots, and individual conditional expectation plots remain popular because they are both model-agnostic and depend only on the pre-trained model output, making them computationally efficient and widely available in software. However, numerous studies have found that these tools can produce diagnostics that are highly misleading, particularly when there is strong dependence among features. The purpose of our work here is to (i) review this growing body of literature, (ii) provide further demonstrations of these drawbacks along with a detailed explanation as to why they occur, and (iii) advocate for alternative measures that involve additional modeling. In particular, we describe how breaking dependencies between features in hold-out data places undue emphasis on sparse regions of the feature space by forcing the original model to extrapolate to regions where there is little to no data. We explore these effects across various model setups and find support for previous claims in the literature that PaP metrics can vastly over-emphasize correlated features in both variable importance measures and partial dependence plots. As an alternative, we discuss and recommend more direct approaches that involve measuring the change in model performance after muting the effects of the features under investigation.

29 citations


Journal ArticleDOI
TL;DR: In this paper, a single-pass algorithm for computing the gradient and Fisher information of Vecchia's Gaussian process loglikelihood approximation is proposed, which provides a computationally efficient means for applying the Fisher scoring algorithm for maximizing the log likelihood.
Abstract: We derive a single-pass algorithm for computing the gradient and Fisher information of Vecchia’s Gaussian process loglikelihood approximation, which provides a computationally efficient means for applying the Fisher scoring algorithm for maximizing the loglikelihood. The advantages of the optimization techniques are demonstrated in numerical examples and in an application to Argo ocean temperature data. The new methods find the maximum likelihood estimates much faster and more reliably than an optimization method that uses only function evaluations, especially when the covariance function has many parameters. This allows practitioners to fit nonstationary models to large spatial and spatial–temporal datasets.

19 citations


Journal ArticleDOI
TL;DR: In this paper, the authors analyzed the continuous version of EKI, a coupled SDE system, and proved the mean-field limit of the SDE in the weakly nonlinear case.
Abstract: Ensemble Kalman inversion (EKI) has been a very popular algorithm used in Bayesian inverse problems (Iglesias et al. in Inverse Probl 29: 045001, 2013). It samples particles from a prior distribution and introduces a motion to move the particles around in pseudo-time. As the pseudo-time goes to infinity, the method finds the minimizer of the objective function, and when the pseudo-time stops at 1, the ensemble distribution of the particles resembles, in some sense, the posterior distribution in the linear setting. The ideas trace back further to ensemble Kalman filter and the associated analysis (Evensen in J Geophys Res: Oceans 99: 10143–10162, 1994; Reich in BIT Numer Math 51: 235–249, 2011), but to today, when viewed as a sampling method, why EKI works, and in what sense with what rate the method converges is still largely unknown. In this paper, we analyze the continuous version of EKI, a coupled SDE system, and prove the mean-field limit of this SDE system. In particular, we will show that 1. as the number of particles goes to infinity, the empirical measure of particles following SDE converges to the solution to a Fokker–Planck equation in Wasserstein 2-distance with an optimal rate, for both linear and weakly nonlinear case; 2. the solution to the Fokker–Planck equation reconstructs the target distribution in finite time in the linear case, as suggested in Iglesias et al. (Inverse Probl 29: 045001, 2013).

17 citations


Journal ArticleDOI
TL;DR: In this paper, the maximum a posteriori estimate of a Gauss-Markov prior with an iterated extended Kalman smoother was studied under mild conditions on the vector field and convergence rates were obtained via nonlinear analysis and scattered data approximation.
Abstract: There is a growing interest in probabilistic numerical solutions to ordinary differential equations. In this paper, the maximum a posteriori estimate is studied under the class of $$ u $$ times differentiable linear time-invariant Gauss–Markov priors, which can be computed with an iterated extended Kalman smoother. The maximum a posteriori estimate corresponds to an optimal interpolant in the reproducing kernel Hilbert space associated with the prior, which in the present case is equivalent to a Sobolev space of smoothness $$ u +1$$ . Subject to mild conditions on the vector field, convergence rates of the maximum a posteriori estimate are then obtained via methods from nonlinear analysis and scattered data approximation. These results closely resemble classical convergence results in the sense that a $$ u $$ times differentiable prior process obtains a global order of $$ u $$ , which is demonstrated in numerical examples.

14 citations


Journal ArticleDOI
TL;DR: In this article, the authors consider the problem of estimating a signal from its scaled, cyclically shifted and noisy observations, and derive a procedure which is proved to consistently estimate the signal.
Abstract: In recent years, there is a growing need for processing methods aimed at extracting useful information from large datasets. In many cases, the challenge is to discover a low-dimensional structure in the data, often concealed by the existence of nuisance parameters and noise. Motivated by such challenges, we consider the problem of estimating a signal from its scaled, cyclically shifted and noisy observations. We focus on the particularly challenging regime of low signal-to-noise ratio (SNR), where different observations cannot be shift-aligned. We show that an accurate estimation of the signal from its noisy observations is possible, and derive a procedure which is proved to consistently estimate the signal. The asymptotic sample complexity (the number of observations required to recover the signal) of the procedure is $$1{/}{\text {SNR}}^4$$ . Additionally, we propose a procedure which is experimentally shown to improve the sample complexity by a factor equal to the signal’s length. Finally, we present numerical experiments which demonstrate the performance of our algorithms and corroborate our theoretical findings.

13 citations


Journal ArticleDOI
TL;DR: In this article, the authors develop a new methodology to unbiasedly estimate the gradient of the log-likelihood with respect to the unknown parameter, i.e. the expectation of the estimate has no discretization bias.
Abstract: We consider the problem of estimating a parameter $$\theta \in \Theta \subseteq {\mathbb {R}}^{d_{\theta }}$$ associated with a Bayesian inverse problem. Typically one must resort to a numerical approximation of gradient of the log-likelihood and also adopt a discretization of the problem in space and/or time. We develop a new methodology to unbiasedly estimate the gradient of the log-likelihood with respect to the unknown parameter, i.e. the expectation of the estimate has no discretization bias. Such a property is not only useful for estimation in terms of the original stochastic model of interest, but can be used in stochastic gradient algorithms which benefit from unbiased estimates. Under appropriate assumptions, we prove that our estimator is not only unbiased but of finite variance. In addition, when implemented on a single processor, we show that the cost to achieve a given level of error is comparable to multilevel Monte Carlo methods, both practically and theoretically. However, the new algorithm is highly amenable to parallel computation.

11 citations


Journal ArticleDOI
TL;DR: In this article, a cascade of strategies for planning the selection of local inducing points is provided, and comparisons are drawn to related methodology with emphasis on computer surrogate modeling applications, and the proposed methodology hybridizes global inducing point and data subset-based local GP approximation.
Abstract: Gaussian processes (GPs) serve as flexible surrogates for complex surfaces, but buckle under the cubic cost of matrix decompositions with big training data sizes. Geospatial and machine learning communities suggest pseudo-inputs, or inducing points, as one strategy to obtain an approximation easing that computational burden. However, we show how placement of inducing points and their multitude can be thwarted by pathologies, especially in large-scale dynamic response surface modeling tasks. As remedy, we suggest porting the inducing point idea, which is usually applied globally, over to a more local context where selection is both easier and faster. In this way, our proposed methodology hybridizes global inducing point and data subset-based local GP approximation. A cascade of strategies for planning the selection of local inducing points is provided, and comparisons are drawn to related methodology with emphasis on computer surrogate modeling applications. We show that local inducing points extend their global and data subset component parts on the accuracy–computational efficiency frontier. Illustrative examples are provided on benchmark data and a large-scale real-simulation satellite drag interpolation problem.

10 citations


Journal ArticleDOI
TL;DR: The Ensemble Slice Sampling (ESS) algorithm as discussed by the authors adapts to the characteristics of the target distribution with minimal hand-tuning by using an ensemble of parallel walkers in order to handle strong correlations between parameters.
Abstract: Slice sampling has emerged as a powerful Markov Chain Monte Carlo algorithm that adapts to the characteristics of the target distribution with minimal hand-tuning. However, Slice Sampling’s performance is highly sensitive to the user-specified initial length scale hyperparameter and the method generally struggles with poorly scaled or strongly correlated distributions. This paper introduces Ensemble Slice Sampling (ESS), a new class of algorithms that bypasses such difficulties by adaptively tuning the initial length scale and utilising an ensemble of parallel walkers in order to efficiently handle strong correlations between parameters. These affine-invariant algorithms are trivial to construct, require no hand-tuning, and can easily be implemented in parallel computing environments. Empirical tests show that Ensemble Slice Sampling can improve efficiency by more than an order of magnitude compared to conventional MCMC methods on a broad range of highly correlated target distributions. In cases of strongly multimodal target distributions, Ensemble Slice Sampling can sample efficiently even in high dimensions. We argue that the parallel, black-box and gradient-free nature of the method renders it ideal for use in scientific fields such as physics, astrophysics and cosmology which are dominated by a wide variety of computationally expensive and non-differentiable models.

10 citations


Journal ArticleDOI
TL;DR: In this article, the Langevin Langevin algorithm is used to construct the stochastic approximation, which leads to a highly efficient algorithm with favorable convergence properties that can be quantified explicitly and easily checked.
Abstract: Stochastic approximation methods play a central role in maximum likelihood estimation problems involving intractable likelihood functions, such as marginal likelihoods arising in problems with missing or incomplete data, and in parametric empirical Bayesian estimation. Combined with Markov chain Monte Carlo algorithms, these stochastic optimisation methods have been successfully applied to a wide range of problems in science and industry. However, this strategy scales poorly to large problems because of methodological and theoretical difficulties related to using high-dimensional Markov chain Monte Carlo algorithms within a stochastic approximation scheme. This paper proposes to address these difficulties by using unadjusted Langevin algorithms to construct the stochastic approximation. This leads to a highly efficient stochastic optimisation methodology with favourable convergence properties that can be quantified explicitly and easily checked. The proposed methodology is demonstrated with three experiments, including a challenging application to statistical audio analysis and a sparse Bayesian logistic regression with random effects problem.

10 citations


Journal ArticleDOI
TL;DR: The Zig-Zag sampler as mentioned in this paper is a rejection-free sampling scheme based on a non-reversible continuous piecewise deterministic Markov process, which expands the diffusion path in a truncated Faber-Schauder basis.
Abstract: We introduce the use of the Zig-Zag sampler to the problem of sampling conditional diffusion processes (diffusion bridges). The Zig-Zag sampler is a rejection-free sampling scheme based on a non-reversible continuous piecewise deterministic Markov process. Similar to the Levy–Ciesielski construction of a Brownian motion, we expand the diffusion path in a truncated Faber–Schauder basis. The coefficients within the basis are sampled using a Zig-Zag sampler. A key innovation is the use of the fully local algorithm for the Zig-Zag sampler that allows to exploit the sparsity structure implied by the dependency graph of the coefficients and by the subsampling technique to reduce the complexity of the algorithm. We illustrate the performance of the proposed methods in a number of examples.

9 citations


Journal ArticleDOI
TL;DR: In this paper, the authors show that the filtering, predictive and smoothing distributions in dynamic probit models with Gaussian state variables are, in fact, available and belong to a class of unified skewnormals (sun) whose parameters can be updated recursively in time via analytical expressions.
Abstract: Non-Gaussian state-space models arise in several applications, and within this framework the binary time series setting provides a relevant example. However, unlike for Gaussian state-space models — where filtering, predictive and smoothing distributions are available in closed form — binary state-space models require approximations or sequential Monte Carlo strategies for inference and prediction. This is due to the apparent absence of conjugacy between the Gaussian states and the likelihood induced by the observation equation for the binary data. In this article we prove that the filtering, predictive and smoothing distributions in dynamic probit models with Gaussian state variables are, in fact, available and belong to a class of unified skew-normals (sun) whose parameters can be updated recursively in time via analytical expressions. Also the key functionals of these distributions are, in principle, available, but their calculation requires the evaluation of multivariate Gaussian cumulative distribution functions. Leveraging sun properties, we address this issue via novel Monte Carlo methods based on independent samples from the smoothing distribution, that can easily be adapted to the filtering and predictive case, thus improving state-of-the-art approximate and sequential Monte Carlo inference in small-to-moderate dimensional studies. Novel sequential Monte Carlo procedures that exploit the sun properties are also developed to deal with online inference in high dimensions. Performance gains over competitors are outlined in a financial application.

Journal ArticleDOI
TL;DR: The hpHawkes as discussed by the authors package provides a high-performance computing framework for Bayesian analysis of big gunshot data generated in Washington D.C. between 2006 and 2019, thereby extending a past analysis of the same data from under 10,000 to over 85,000 observations.
Abstract: The Hawkes process and its extensions effectively model self-excitatory phenomena including earthquakes, viral pandemics, financial transactions, neural spike trains and the spread of memes through social networks. The usefulness of these stochastic process models within a host of economic sectors and scientific disciplines is undercut by the processes’ computational burden: complexity of likelihood evaluations grows quadratically in the number of observations for both the temporal and spatiotemporal Hawkes processes. We show that, with care, one may parallelize these calculations using both central and graphics processing unit implementations to achieve over 100-fold speedups over single-core processing. Using a simple adaptive Metropolis–Hastings scheme, we apply our high-performance computing framework to a Bayesian analysis of big gunshot data generated in Washington D.C. between the years of 2006 and 2019, thereby extending a past analysis of the same data from under 10,000 to over 85,000 observations. To encourage widespread use, we provide hpHawkes, an open-source R package, and discuss high-level implementation and program design for leveraging aspects of computational hardware that become necessary in a big data setting.

Journal ArticleDOI
TL;DR: In this paper, the adaptive scaled independence sampler (ASIS) algorithm is applied to logistic regression and accelerated failure time models, and compared with data augmentation, Laplace approximation and correlated pseudo-marginal method.
Abstract: Bayesian variable selection is an important method for discovering variables which are most useful for explaining the variation in a response. The widespread use of this method has been restricted by the challenging computational problem of sampling from the corresponding posterior distribution. Recently, the use of adaptive Monte Carlo methods has been shown to lead to performance improvement over traditionally used algorithms in linear regression models. This paper looks at applying one of these algorithms (the adaptively scaled independence sampler) to logistic regression and accelerated failure time models. We investigate the use of this algorithm with data augmentation, Laplace approximation and the correlated pseudo-marginal method. The performance of the algorithms is compared on several genomic data sets.

Journal ArticleDOI
TL;DR: MOTR-BART as mentioned in this paper is an extension of BART that considers piecewise linear functions at node levels instead of piecewise constants, rather than having a unique value at node level for the prediction, instead of estimating a linear predictor considering the covariates that have been used as the split variables in the corresponding tree.
Abstract: Bayesian additive regression trees (BART) is a tree-based machine learning method that has been successfully applied to regression and classification problems. BART assumes regularisation priors on a set of trees that work as weak learners and is very flexible for predicting in the presence of nonlinearity and high-order interactions. In this paper, we introduce an extension of BART, called model trees BART (MOTR-BART), that considers piecewise linear functions at node levels instead of piecewise constants. In MOTR-BART, rather than having a unique value at node level for the prediction, a linear predictor is estimated considering the covariates that have been used as the split variables in the corresponding tree. In our approach, local linearities are captured more efficiently and fewer trees are required to achieve equal or better performance than BART. Via simulation studies and real data applications, we compare MOTR-BART to its main competitors. R code for MOTR-BART implementation is available at https://github.com/ebprado/MOTR-BART .

Journal ArticleDOI
TL;DR: In this paper, a delayed-acceptance kernel for Markov chain Monte Carlo (MCMC) is proposed to reduce the number of expensive likelihoods evaluations required to approximate a posterior expectation.
Abstract: Delayed-acceptance is a technique for reducing computational effort for Bayesian models with expensive likelihoods. Using a delayed-acceptance kernel for Markov chain Monte Carlo can reduce the number of expensive likelihoods evaluations required to approximate a posterior expectation. Delayed-acceptance uses a surrogate, or approximate, likelihood to avoid evaluation of the expensive likelihood when possible. Within the sequential Monte Carlo framework, we utilise the history of the sampler to adaptively tune the surrogate likelihood to yield better approximations of the expensive likelihood and use a surrogate first annealing schedule to further increase computational efficiency. Moreover, we propose a framework for optimising computation time whilst avoiding particle degeneracy, which encapsulates existing strategies in the literature. Overall, we develop a novel algorithm for computationally efficient SMC with expensive likelihood functions. The method is applied to static Bayesian models, which we demonstrate on toy and real examples.

Journal ArticleDOI
TL;DR: In this article, a novel method for data imputation in multivariate non-stationary time series, based on the so-called locally stationary wavelet modelling paradigm, is introduced.
Abstract: Many multivariate time series observed in practice are second order nonstationary, i.e. their covariance properties vary over time. In addition, missing observations in such data are encountered in many applications of interest, due to recording failures or sensor dropout, hindering successful analysis. This article introduces a novel method for data imputation in multivariate nonstationary time series, based on the so-called locally stationary wavelet modelling paradigm. Our methodology is shown to perform well across a range of simulation scenarios, with a variety of missingness structures, as well as being competitive in the stationary time series setting. We also demonstrate our technique on data arising in a health monitoring application.

Journal ArticleDOI
TL;DR: In this paper, the adaptive importance samplers (OAIS) were investigated and the convergence rate of the adaptive Monte Carlo algorithm was shown to be bounded by nonasymptotic error bounds for the mean squared errors (MSEs).
Abstract: Adaptive importance samplers are adaptive Monte Carlo algorithms to estimate expectations with respect to some target distribution which adapt themselves to obtain better estimators over a sequence of iterations. Although it is straightforward to show that they have the same $$\mathcal {O}(1/\sqrt{N})$$ convergence rate as standard importance samplers, where N is the number of Monte Carlo samples, the behaviour of adaptive importance samplers over the number of iterations has been left relatively unexplored. In this work, we investigate an adaptation strategy based on convex optimisation which leads to a class of adaptive importance samplers termed optimised adaptive importance samplers (OAIS). These samplers rely on the iterative minimisation of the $$\chi ^2$$ -divergence between an exponential family proposal and the target. The analysed algorithms are closely related to the class of adaptive importance samplers which minimise the variance of the weight function. We first prove non-asymptotic error bounds for the mean squared errors (MSEs) of these algorithms, which explicitly depend on the number of iterations and the number of samples together. The non-asymptotic bounds derived in this paper imply that when the target belongs to the exponential family, the $$L_2$$ errors of the optimised samplers converge to the optimal rate of $$\mathcal {O}(1/\sqrt{N})$$ and the rate of convergence in the number of iterations are explicitly provided. When the target does not belong to the exponential family, the rate of convergence is the same but the asymptotic $$L_2$$ error increases by a factor $$\sqrt{\rho ^\star } > 1$$ , where $$\rho ^\star - 1$$ is the minimum $$\chi ^2$$ -divergence between the target and an exponential family proposal.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a new particle-based variational inference (EVI) framework, which minimizes the VI objective function based on a prescribed energy-dissipation law.
Abstract: We introduce a new variational inference (VI) framework, called energetic variational inference (EVI). It minimizes the VI objective function based on a prescribed energy-dissipation law. Using the EVI framework, we can derive many existing particle-based variational inference (ParVI) methods, including the popular Stein variational gradient descent (SVGD). More importantly, many new ParVI schemes can be created under this framework. For illustration, we propose a new particle-based EVI scheme, which performs the particle-based approximation of the density first and then uses the approximated density in the variational procedure, or “Approximation-then-Variation” for short. Thanks to this order of approximation and variation, the new scheme can maintain the variational structure at the particle level, and can significantly decrease the KL-divergence in each iteration. Numerical experiments show the proposed method outperforms some existing ParVI methods in terms of fidelity to the target distribution.

Journal ArticleDOI
TL;DR: In this article, the performance of a class of block-adaptive particle filters (PFs) that can automatically tune their computational complexity by evaluating online certain predictive statistics which are invariant for a broad class of state-space models is investigated.
Abstract: We investigate the performance of a class of particle filters (PFs) that can automatically tune their computational complexity by evaluating online certain predictive statistics which are invariant for a broad class of state-space models. To be specific, we propose a family of block-adaptive PFs based on the methodology of Elvira et al. (IEEE Trans Signal Process 65(7):1781–1794, 2017). In this class of algorithms, the number of Monte Carlo samples (known as particles) is adjusted periodically, and we prove that the theoretical error bounds of the PF actually adapt to the updates in the number of particles. The evaluation of the predictive statistics that lies at the core of the methodology is done by generating fictitious observations, i.e., particles in the observation space. We study, both analytically and numerically, the impact of the number K of these particles on the performance of the algorithm. In particular, we prove that if the predictive statistics with K fictitious observations converged exactly, then the particle approximation of the filtering distribution would match the first K elements in a series of moments of the true filter. This result can be understood as a converse to some convergence theorems for PFs. From this analysis, we deduce an alternative predictive statistic that can be computed (for some models) without sampling any fictitious observations at all. Finally, we conduct an extensive simulation study that illustrates the theoretical results and provides further insights into the complexity, performance and behavior of the new class of algorithms.

Journal ArticleDOI
TL;DR: In this article, a tile-low-rank representation of covariance matrices with a block-reordering scheme for efficient quasi-Monte Carlo simulation is presented. But the method is not suitable for high-dimensional skew-normal random fields.
Abstract: We present a preconditioned Monte Carlo method for computing high-dimensional multivariate normal and Student-t probabilities arising in spatial statistics. The approach combines a tile-low-rank representation of covariance matrices with a block-reordering scheme for efficient quasi-Monte Carlo simulation. The tile-low-rank representation decomposes the high-dimensional problem into many diagonal-block-size problems and low-rank connections. The block-reordering scheme reorders between and within the diagonal blocks to reduce the impact of integration variables from right to left, thus improving the Monte Carlo convergence rate. Simulations up to dimension 65,536 suggest that the new method can improve the run time by an order of magnitude compared with the hierarchical quasi-Monte Carlo method and two orders of magnitude compared with the dense quasi-Monte Carlo method. Our method also forms a strong substitute for the approximate conditioning methods as a more robust estimation with error guarantees. An application study to wind stochastic generators is provided to illustrate that the new computational method makes the maximum likelihood estimation feasible for high-dimensional skew-normal random fields.

Journal ArticleDOI
TL;DR: In this paper, the authors recast Fast Incremental Expectation Maximization (FIEM) and other incremental EM type algorithms in the Stochastic Approximation within EM framework and provided nonasymptotic bounds for the convergence in expectation as a function of the number of examples n and of the maximal number of iterations.
Abstract: Fast incremental expectation maximization (FIEM) is a version of the EM framework for large datasets. In this paper, we first recast FIEM and other incremental EM type algorithms in the Stochastic Approximation within EM framework. Then, we provide nonasymptotic bounds for the convergence in expectation as a function of the number of examples n and of the maximal number of iterations $$K_\mathrm {max}$$ . We propose two strategies for achieving an $$\epsilon $$ -approximate stationary point, respectively with $$K_\mathrm {max}= O(n^{2/3}/\epsilon )$$ and $$K_\mathrm {max}= O(\sqrt{n}/\epsilon ^{3/2})$$ , both strategies relying on a random termination rule before $$K_\mathrm {max}$$ and on a constant step size in the Stochastic Approximation step. Our bounds provide some improvements on the literature. First, they allow $$K_\mathrm {max}$$ to scale as $$\sqrt{n}$$ which is better than $$n^{2/3}$$ which was the best rate obtained so far; it is at the cost of a larger dependence upon the tolerance $$\epsilon $$ , thus making this control relevant for small to medium accuracy with respect to the number of examples n. Second, for the $$n^{2/3}$$ -rate, the numerical illustrations show that thanks to an optimized choice of the step size and of the bounds in terms of quantities characterizing the optimization problem at hand, our results design a less conservative choice of the step size and provide a better control of the convergence in expectation.

Journal ArticleDOI
TL;DR: In this paper, the authors consider the problem of learning the level set for which a noisy black-box function exceeds a given threshold, and investigate Gaussian process (GP) metamodels.
Abstract: We consider the problem of learning the level set for which a noisy black-box function exceeds a given threshold. To efficiently reconstruct the level set, we investigate Gaussian process (GP) metamodels. Our focus is on strongly stochastic samplers, in particular with heavy-tailed simulation noise and low signal-to-noise ratio. To guard against noise misspecification, we assess the performance of three variants: (i) GPs with Student-$t$ observations; (ii) Student-$t$ processes (TPs); and (iii) classification GPs modeling the sign of the response. In conjunction with these metamodels, we analyze several acquisition functions for guiding the sequential experimental designs, extending existing stepwise uncertainty reduction criteria to the stochastic contour-finding context. This also motivates our development of (approximate) updating formulas to efficiently compute such acquisition functions. Our schemes are benchmarked by using a variety of synthetic experiments in 1--6 dimensions. We also consider an application of level set estimation for determining the optimal exercise policy of Bermudan options in finance.

Journal ArticleDOI
TL;DR: The Parsimonious Online Gaussian Process (POGP) as discussed by the authors is the first online Gaussian process approximation that preserves convergence to the population posterior, i.e., asymptotic posterior consistency, while ameliorating its intractable complexity growth with the sample size.
Abstract: Gaussian processes provide a framework for nonlinear nonparametric Bayesian inference widely applicable across science and engineering. Unfortunately, their computational burden scales cubically with the training sample size, which in the case that samples arrive in perpetuity, approaches infinity. This issue necessitates approximations for use with streaming data, which to date mostly lack convergence guarantees. Thus, we develop the first online Gaussian process approximation that preserves convergence to the population posterior, i.e., asymptotic posterior consistency, while ameliorating its intractable complexity growth with the sample size. We propose an online compression scheme that, following each a posteriori update, fixes an error neighborhood with respect to the Hellinger metric centered at the current posterior, and greedily tosses out past kernel dictionary elements until its boundary is hit. We call the resulting method Parsimonious Online Gaussian Processes (POG). For diminishing error radius, asymptotic statistical stationarity is achieved (Theorem 1ii) at the cost of unbounded memory in the limit. On the other hand, for constant error radius, POG converges to a neighborhood of stationarity (Theorem 1ii) but with finite memory at-worst determined by the metric entropy of the feature space (Theorem 2). Here stationarity refers to the distributional distance between sequential marginal posteriors approaching null with the time index. Experimental results are presented on several nonlinear regression problems which illuminates the merits of this approach as compared with alternatives that fix the subspace dimension defining the history of past points.

Journal ArticleDOI
TL;DR: This paper presents and derive new expressions for the extension of an algorithm classically used for single-factor LMM parameter estimation, Fisher Scoring, to multiple, crossed-factor designs, and provides a new method for LMM Satterthwaite degrees of freedom estimation based on analytical results, which does not require iterative gradient estimation.
Abstract: The analysis of longitudinal, heterogeneous or unbalanced clustered data is of primary importance to a wide range of applications. The linear mixed model (LMM) is a popular and flexible extension of the linear model specifically designed for such purposes. Historically, a large proportion of material published on the LMM concerns the application of popular numerical optimization algorithms, such as Newton–Raphson, Fisher Scoring and expectation maximization to single-factor LMMs (i.e. LMMs that only contain one “factor” by which observations are grouped). However, in recent years, the focus of the LMM literature has moved towards the development of estimation and inference methods for more complex, multi-factored designs. In this paper, we present and derive new expressions for the extension of an algorithm classically used for single-factor LMM parameter estimation, Fisher Scoring, to multiple, crossed-factor designs. Through simulation and real data examples, we compare five variants of the Fisher Scoring algorithm with one another, as well as against a baseline established by the R package lme4, and find evidence of correctness and strong computational efficiency for four of the five proposed approaches. Additionally, we provide a new method for LMM Satterthwaite degrees of freedom estimation based on analytical results, which does not require iterative gradient estimation. Via simulation, we find that this approach produces estimates with both lower bias and lower variance than the existing methods.

Journal ArticleDOI
TL;DR: In this paper, the authors propose a general approach to achieve robustness in fitting generalized additive models for location, scale and shape (GAMLSS) by limiting the contribution of observations with low log-likelihood values.
Abstract: The validity of estimation and smoothing parameter selection for the wide class of generalized additive models for location, scale and shape (GAMLSS) relies on the correct specification of a likelihood function. Deviations from such assumption are known to mislead any likelihood-based inference and can hinder penalization schemes meant to ensure some degree of smoothness for nonlinear effects. We propose a general approach to achieve robustness in fitting GAMLSSs by limiting the contribution of observations with low log-likelihood values. Robust selection of the smoothing parameters can be carried out either by minimizing information criteria that naturally arise from the robustified likelihood or via an extended Fellner–Schall method. The latter allows for automatic smoothing parameter selection and is particularly advantageous in applications with multiple smoothing parameters. We also address the challenge of tuning robust estimators for models with nonlinear effects by proposing a novel median downweighting proportion criterion. This enables a fair comparison with existing robust estimators for the special case of generalized additive models, where our estimator competes favorably. The overall good performance of our proposal is illustrated by further simulations in the GAMLSS setting and by an application to functional magnetic resonance brain imaging using bivariate smoothing splines.

Journal ArticleDOI
TL;DR: In this paper, a state-space approach to deep Gaussian process regression is proposed, which is based on a non-linear hierarchical system of linear stochastic differential equations, where each SDE corresponds to a conditional GP.
Abstract: This paper is concerned with a state-space approach to deep Gaussian process (DGP) regression. We construct the DGP by hierarchically putting transformed Gaussian process (GP) priors on the length scales and magnitudes of the next level of Gaussian processes in the hierarchy. The idea of the state-space approach is to represent the DGP as a non-linear hierarchical system of linear stochastic differential equations (SDEs), where each SDE corresponds to a conditional GP. The DGP regression problem then becomes a state estimation problem, and we can estimate the state efficiently with sequential methods by using the Markov property of the state-space DGP. The computational complexity scales linearly with respect to the number of measurements. Based on this, we formulate state-space MAP as well as Bayesian filtering and smoothing solutions to the DGP regression problem. We demonstrate the performance of the proposed models and methods on synthetic non-stationary signals and apply the state-space DGP to detection of the gravitational waves from LIGO measurements.

Journal ArticleDOI
TL;DR: In this paper, a probabilistic uncertainty quantification for the unknown solution of a non-linear partial differential equation (PDE) is proposed, which is based on discretisation of the nonlinear differential operator.
Abstract: The numerical solution of differential equations can be formulated as an inference problem to which formal statistical approaches can be applied. However, nonlinear partial differential equations (PDEs) pose substantial challenges from an inferential perspective, most notably the absence of explicit conditioning formula. This paper extends earlier work on linear PDEs to a general class of initial value problems specified by nonlinear PDEs, motivated by problems for which evaluations of the right-hand-side, initial conditions, or boundary conditions of the PDE have a high computational cost. The proposed method can be viewed as exact Bayesian inference under an approximate likelihood, which is based on discretisation of the nonlinear differential operator. Proof-of-concept experimental results demonstrate that meaningful probabilistic uncertainty quantification for the unknown solution of the PDE can be performed, while controlling the number of times the right-hand-side, initial and boundary conditions are evaluated. A suitable prior model for the solution of PDEs is identified using novel theoretical analysis of the sample path properties of Matern processes, which may be of independent interest.

Journal ArticleDOI
TL;DR: In this article, a discrete method of constructing Gaussian Random Fields based on a combination of modified spectral representations, Fourier and Blob is proposed for Direct Numerical Simulations of the V-Langevin equations.
Abstract: We propose a novel discrete method of constructing Gaussian Random Fields based on a combination of modified spectral representations, Fourier and Blob. The method is intended for Direct Numerical Simulations of the V-Langevin equations. The latter are stereotypical descriptions of anomalous stochastic transport in various physical systems. From an Eulerian perspective, our method is designed to exhibit improved convergence rates. From a Lagrangian perspective, our method offers a pertinent description of particle trajectories in turbulent velocity fields: the exact Lagrangian invariant laws are well reproduced. From a computational perspective, the computing time is reduced by a factor of two in comparison with Fourier-like or Blob-like methods and an order of magnitude in comparison with FFT algorithms.

Journal ArticleDOI
TL;DR: In this paper, Monte Carlo integration with variance reduction by means of control variates can be implemented by the ordinary least squares estimator for the intercept in a multiple linear regression model with the integrand as response and the control variable as covariates.
Abstract: Monte Carlo integration with variance reduction by means of control variates can be implemented by the ordinary least squares estimator for the intercept in a multiple linear regression model with the integrand as response and the control variates as covariates. Even without special knowledge on the integrand, significant efficiency gains can be obtained if the control variate space is sufficiently large. Incorporating a large number of control variates in the ordinary least squares procedure may however result in (i) a certain instability of the ordinary least squares estimator and (ii) a possibly prohibitive computation time. Regularizing the ordinary least squares estimator by preselecting appropriate control variates via the Lasso turns out to increase the accuracy without additional computational cost. The findings in the numerical experiment are confirmed by concentration inequalities for the integration error.

Journal ArticleDOI
TL;DR: This paper developed a deep version of mixtures of unigrams for unsupervised classification of very short documents with a large number of terms, by allowing for models with further deeper latent layers; the proposal is derived in a Bayesian framework.
Abstract: Mixtures of unigrams are one of the simplest and most efficient tools for clustering textual data, as they assume that documents related to the same topic have similar distributions of terms, naturally described by multinomials. When the classification task is particularly challenging, such as when the document-term matrix is high-dimensional and extremely sparse, a more composite representation can provide better insight into the grouping structure. In this work, we developed a deep version of mixtures of unigrams for the unsupervised classification of very short documents with a large number of terms, by allowing for models with further deeper latent layers; the proposal is derived in a Bayesian framework. The behavior of the deep mixtures of unigrams is empirically compared with that of other traditional and state-of-the-art methods, namely k-means with cosine distance, k-means with Euclidean distance on data transformed according to semantic analysis, partition around medoids, mixture of Gaussians on semantic-based transformed data, hierarchical clustering according to Ward’s method with cosine dissimilarity, latent Dirichlet allocation, mixtures of unigrams estimated via the EM algorithm, spectral clustering and affinity propagation clustering. The performance is evaluated in terms of both correct classification rate and Adjusted Rand Index. Simulation studies and real data analysis prove that going deep in clustering such data highly improves the classification accuracy.