scispace - formally typeset
Search or ask a question

Showing papers on "Gaussian process published in 2018"


Proceedings Article
20 Jun 2018
TL;DR: This talk will introduce this formalism and give a number of results on the Neural Tangent Kernel and explain how they give us insight into the dynamics of neural networks during training and into their generalization features.
Abstract: At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function (which maps input vectors to output vectors) follows the so-called kernel gradient associated with a new object, which we call the Neural Tangent Kernel (NTK). This kernel is central to describe the generalization features of ANNs. While the NTK is random at initialization and varies during training, in the infinite-width limit it converges to an explicit limiting kernel and stays constant during training. This makes it possible to study the training of ANNs in function space instead of parameter space. Convergence of the training can then be related to the positive-definiteness of the limiting NTK. We then focus on the setting of least-squares regression and show that in the infinite-width limit, the network function follows a linear differential equation during training. The convergence is fastest along the largest kernel principal components of the input data with respect to the NTK, hence suggesting a theoretical motivation for early stopping. Finally we study the NTK numerically, observe its behavior for wide networks, and compare it to the infinite-width limit.

1,787 citations


Proceedings Article
15 Feb 2018
TL;DR: The exact equivalence between infinitely wide deep networks and GPs is derived and it is found that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite- width networks.
Abstract: It has long been known that a single-layer fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP), in the limit of infinite network width. This correspondence enables exact Bayesian inference for infinite width neural networks on regression tasks by means of evaluating the corresponding GP. Recently, kernel functions which mimic multi-layer random neural networks have been developed, but only outside of a Bayesian framework. As such, previous work has not identified that these kernels can be used as covariance functions for GPs and allow fully Bayesian prediction with a deep neural network. In this work, we derive the exact equivalence between infinitely wide deep networks and GPs. We further develop a computationally efficient pipeline to compute the covariance function for these GPs. We then use the resulting GPs to perform Bayesian inference for wide deep neural networks on MNIST and CIFAR-10. We observe that trained neural network accuracy approaches that of the corresponding GP with increasing layer width, and that the GP uncertainty is strongly correlated with trained network prediction error. We further find that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite-width networks. Finally we connect the performance of these GPs to the recent theory of signal propagation in random neural networks.

757 citations


Journal ArticleDOI
TL;DR: This tutorial introduces the reader to Gaussian process regression as an expressive tool to model, actively explore and exploit unknown functions and describes a situation modelling risk-averse exploration in which an additional constraint needs to be accounted for.

585 citations


Posted Content
TL;DR: This work presents an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set, and reduces the computation time of self-attention from quadratic to linear in the number of Elements in the set.
Abstract: Many machine learning tasks such as multiple instance learning, 3D shape recognition, and few-shot image classification are defined on sets of instances. Since solutions to such problems do not depend on the order of elements of the set, models used to address them should be permutation invariant. We present an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set. The model consists of an encoder and a decoder, both of which rely on attention mechanisms. In an effort to reduce computational complexity, we introduce an attention scheme inspired by inducing point methods from sparse Gaussian process literature. It reduces the computation time of self-attention from quadratic to linear in the number of elements in the set. We show that our model is theoretically attractive and we evaluate it on a range of tasks, demonstrating the state-of-the-art performance compared to recent methods for set-structured data.

500 citations


Posted Content
TL;DR: In this paper, the authors study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes with a recursive kernel definition and show that, under broad conditions, as they make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process.
Abstract: Whilst deep neural networks have shown great empirical success, there is still much work to be done to understand their theoretical properties. In this paper, we study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes with a recursive kernel definition. We show that, under broad conditions, as we make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks. To evaluate convergence rates empirically, we use maximum mean discrepancy. We then compare finite Bayesian deep networks from the literature to Gaussian processes in terms of the key predictive quantities of interest, finding that in some cases the agreement can be very close. We discuss the desirability of Gaussian process behaviour and review non-Gaussian alternative models from the literature.

257 citations


Journal ArticleDOI
TL;DR: An FD technique combining the generalized CCA with the threshold-setting based on the randomized algorithm is proposed and applied to the simulated traction drive control system of high-speed trains and shows that the proposed method is able to improve the detection performance significantly in comparison with the standard generalized C CA-based FD method.
Abstract: In this paper, we first study a generalized canonical correlation analysis (CCA)-based fault detection (FD) method aiming at maximizing the fault detectability under an acceptable false alarm rate. More specifically, two residual signals are generated for detecting of faults in input and output subspaces, respectively. The minimum covariances of the two residual signals are achieved by taking the correlation between input and output into account. Considering the limited application scope of the generalized CCA due to the Gaussian assumption on the process noises, an FD technique combining the generalized CCA with the threshold-setting based on the randomized algorithm is proposed and applied to the simulated traction drive control system of high-speed trains. The achieved results show that the proposed method is able to improve the detection performance significantly in comparison with the standard generalized CCA-based FD method.

252 citations


Posted Content
TL;DR: This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian processes on the one side, and frequentist kernel methods based on reproducing kernel Hilbert spaces on the other.
Abstract: This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian processes on the one side, and frequentist kernel methods based on reproducing kernel Hilbert spaces on the other. It is widely known in machine learning that these two formalisms are closely related; for instance, the estimator of kernel ridge regression is identical to the posterior mean of Gaussian process regression. However, they have been studied and developed almost independently by two essentially separate communities, and this makes it difficult to seamlessly transfer results between them. Our aim is to overcome this potential difficulty. To this end, we review several old and new results and concepts from either side, and juxtapose algorithmic quantities from each framework to highlight close similarities. We also provide discussions on subtle philosophical and theoretical differences between the two approaches.

224 citations


Journal ArticleDOI
TL;DR: A Gaussian Process with a quasi-periodic covariance kernel function that will enable hierarchical studies involving stellar rotation, particularly those involving population modelling, such as inferring stellar ages, obliquities in exoplanet systems, or characterising star-planet interactions.
Abstract: Variability in the light curves of spotted, rotating stars is often non-sinusoidal and quasi-periodic --- spots move on the stellar surface and have finite lifetimes, causing stellar flux variations to slowly shift in phase. A strictly periodic sinusoid therefore cannot accurately model a rotationally modulated stellar light curve. Physical models of stellar surfaces have many drawbacks preventing effective inference, such as highly degenerate or high-dimensional parameter spaces. In this work, we test an appropriate effective model: a Gaussian Process with a quasi-periodic covariance kernel function. This highly flexible model allows sampling of the posterior probability density function of the periodic parameter, marginalising over the other kernel hyperparameters using a Markov Chain Monte Carlo approach. To test the effectiveness of this method, we infer rotation periods from 333 simulated stellar light curves, demonstrating that the Gaussian process method produces periods that are more accurate than both a sine-fitting periodogram and an autocorrelation function method. We also demonstrate that it works well on real data, by inferring rotation periods for 275 Kepler stars with previously measured periods. We provide a table of rotation periods for these 1132 Kepler objects of interest and their posterior probability density function samples. Because this method delivers posterior probability density functions, it will enable hierarchical studies involving stellar rotation, particularly those involving population modelling, such as inferring stellar ages, obliquities in exoplanet systems, or characterising star-planet interactions. The code used to implement this method is available online.

171 citations


Journal ArticleDOI
TL;DR: A new algorithm is proposed, TSEMO, which uses Gaussian processes as surrogates, which gives a simple algorithm without the requirement of a priori knowledge, reduced hypervolume calculations to approach linear scaling with respect to the number of objectives, the capacity to handle noise and the ability for batch-sequential usage.
Abstract: Many engineering problems require the optimization of expensive, black-box functions involving multiple conflicting criteria, such that commonly used methods like multiobjective genetic algorithms are inadequate. To tackle this problem several algorithms have been developed using surrogates. However, these often have disadvantages such as the requirement of a priori knowledge of the output functions or exponentially scaling computational cost with respect to the number of objectives. In this paper a new algorithm is proposed, TSEMO, which uses Gaussian processes as surrogates. The Gaussian processes are sampled using spectral sampling techniques to make use of Thompson sampling in conjunction with the hypervolume quality indicator and NSGA-II to choose a new evaluation point at each iteration. The reference point required for the hypervolume calculation is estimated within TSEMO. Further, a simple extension was proposed to carry out batch-sequential design. TSEMO was compared to ParEGO, an expected hypervolume implementation, and NSGA-II on nine test problems with a budget of 150 function evaluations. Overall, TSEMO shows promising performance, while giving a simple algorithm without the requirement of a priori knowledge, reduced hypervolume calculations to approach linear scaling with respect to the number of objectives, the capacity to handle noise and lastly the ability for batch-sequential usage.

167 citations


Journal ArticleDOI
TL;DR: A unified view of likelihood based Gaussian progress regression for simulation experiments exhibiting input-dependent noise is presented, and a latent-variable idea from machine learning is borrowed to address heteroscedasticity, thereby simultaneously leveraging the computational and statistical efficiency of designs with replication.
Abstract: We present a unified view of likelihood based Gaussian progress regression for simulation experiments exhibiting input-dependent noise. Replication plays an important role in that context, however ...

166 citations


Journal ArticleDOI
TL;DR: This article investigates the state-of-the-art multi-output Gaussian processes (MOGPs) that can transfer the knowledge across related outputs in order to improve prediction quality and gives some recommendations regarding the usage of MOGPs.
Abstract: Multi-output regression problems have extensively arisen in modern engineering community. This article investigates the state-of-the-art multi-output Gaussian processes (MOGPs) that can transfer the knowledge across related outputs in order to improve prediction quality. We classify existing MOGPs into two main categories as (1) symmetric MOGPs that improve the predictions for all the outputs, and (2) asymmetric MOGPs, particularly the multi-fidelity MOGPs, that focus on the improvement of high fidelity output via the useful information transferred from related low fidelity outputs. We review existing symmetric/asymmetric MOGPs and analyze their characteristics, e.g., the covariance functions (separable or non-separable), the modeling process (integrated or decomposed), the information transfer (bidirectional or unidirectional), and the hyperparameter inference (joint or separate). Besides, we assess the performance of ten representative MOGPs thoroughly on eight examples in symmetric/asymmetric scenarios by considering, e.g., different training data (heterotopic or isotopic), different training sizes (small, moderate and large), different output correlations (low or high), and different output sizes (up to four outputs). Based on the qualitative and quantitative analysis, we give some recommendations regarding the usage of MOGPs and highlight potential research directions.

Journal ArticleDOI
TL;DR: In this article, a Gaussian process-based surrogate model of the laser powder-bed-fusion (L-PBF) process is used to predict melt pool depth in single-track experiments given a laser power, scan speed, and laser beam size combination.
Abstract: Laser Powder-Bed Fusion (L-PBF) metal-based additive manufacturing (AM) is complex and not fully understood Successful processing for one material, might not necessarily apply to a different material This paper describes a workflow process that aims at creating a material data sheet standard that describes regimes where the process can be expected to be robust The procedure consists of building a Gaussian process-based surrogate model of the L-PBF process that predicts melt pool depth in single-track experiments given a laser power, scan speed, and laser beam size combination The predictions are then mapped onto a power versus scan speed diagram delimiting the conduction from the keyhole melting controlled regimes This statistical framework is shown to be robust even for cases where experimental training data might be suboptimal in quality, if appropriate physics-based filters are applied Additionally, it is demonstrated that a high-fidelity simulation model of L-PBF can equally be successfully used for building a surrogate model, which is beneficial since simulations are getting more efficient and are more practical to study the response of different materials, than to re-tool an AM machine for new material powder

Journal ArticleDOI
TL;DR: This paper model the shape variations with a Gaussian process, which they represent using the leading components of its Karhunen-Loève expansion, and introduces a simple algorithm for fitting a GPMM to a surface or image, which results in a non-rigid registration approach whose regularization properties are defined by a G PMM.
Abstract: Models of shape variations have become a central component for the automated analysis of images. An important class of shape models are point distribution models (PDMs). These models represent a class of shapes as a normal distribution of point variations, whose parameters are estimated from example shapes. Principal component analysis (PCA) is applied to obtain a low-dimensional representation of the shape variation in terms of the leading principal components. In this paper, we propose a generalization of PDMs, which we refer to as Gaussian Process Morphable Models (GPMMs). We model the shape variations with a Gaussian process, which we represent using the leading components of its Karhunen-Loeve expansion. To compute the expansion, we make use of an approximation scheme based on the Nystrom method. The resulting model can be seen as a continuous analog of a standard PDM. However, while for PDMs the shape variation is restricted to the linear span of the example data, with GPMMs we can define the shape variation using any Gaussian process. For example, we can build shape models that correspond to classical spline models and thus do not require any example data. Furthermore, Gaussian processes make it possible to combine different models. For example, a PDM can be extended with a spline model, to obtain a model that incorporates learned shape characteristics but is flexible enough to explain shapes that cannot be represented by the PDM. We introduce a simple algorithm for fitting a GPMM to a surface or image. This results in a non-rigid registration approach whose regularization properties are defined by a GPMM. We show how we can obtain different registration schemes, including methods for multi-scale or hybrid registration, by constructing an appropriate GPMM. As our approach strictly separates modeling from the fitting process, this is all achieved without changes to the fitting algorithm. To demonstrate the applicability and versatility of GPMMs, we perform a set of experiments in typical usage scenarios in medical image analysis and computer vision: The model-based segmentation of 3D forearm images and the building of a statistical model of the face. To complement the paper, we have made all our methods available as open source.

Journal ArticleDOI
TL;DR: This paper re-fit an accurate PES of formaldehyde and compares PES errors on the entire point set used to solve the vibrational Schrödinger equation, i.e., the only error that matters in quantum dynamics calculations.
Abstract: For molecules with more than three atoms, it is difficult to fit or interpolate a potential energy surface (PES) from a small number of (usually ab initio) energies at points. Many methods have been proposed in recent decades, each claiming a set of advantages. Unfortunately, there are few comparative studies. In this paper, we compare neural networks (NNs) with Gaussian process (GP) regression. We re-fit an accurate PES of formaldehyde and compare PES errors on the entire point set used to solve the vibrational Schrodinger equation, i.e., the only error that matters in quantum dynamics calculations. We also compare the vibrational spectra computed on the underlying reference PES and the NN and GP potential surfaces. The NN and GP surfaces are constructed with exactly the same points, and the corresponding spectra are computed with the same points and the same basis. The GP fitting error is lower, and the GP spectrum is more accurate. The best NN fits to 625/1250/2500 symmetry unique potential energy poin...

Proceedings Article
16 Aug 2018
TL;DR: It is shown that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP) in the limit of infinitely many Convolutional filters, extending similar results for dense networks.
Abstract: We show that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP) in the limit of infinitely many convolutional filters, extending similar results for dense networks. For a CNN, the equivalent kernel can be computed exactly and, unlike "deep kernels", has very few parameters: only the hyperparameters of the original CNN. Further, we show that this kernel has two properties that allow it to be computed efficiently; the cost of evaluating the kernel for a pair of images is similar to a single forward pass through the original CNN with only one filter per layer. The kernel equivalent to a 32-layer ResNet obtains 0.84% classification error on MNIST, a new record for GPs with a comparable number of parameters.

Journal ArticleDOI
TL;DR: The authors present a scheme to construct classical $n$-body force fields using Gaussian Process (GP) Regression, appropriately mapped over explicit n-body functions (M-FFs), which are as fast as classical parametrized potentials, since they avoid lengthy summations over database entries or weight parameters.
Abstract: The authors present a scheme to construct classical $n$-body force fields using Gaussian Process (GP) Regression, appropriately mapped over explicit n-body functions (M-FFs). The procedure is possible, and will yield accurate forces, whenever prior knowledge allows to restrict the interactions to a finite order $n$, so that the ``universal approximator'' resolving power of standard GPs or Neural Networks is not needed. Under these conditions, the proposed construction preserves flexibility of training, systematically improvable accuracy, and a clear framework for validation of the underlying machine learning technique. Moreover, the M-FFs are as fast as classical parametrized potentials, since they avoid lengthy summations over database entries or weight parameters.

Journal ArticleDOI
TL;DR: In this article, the performance of permutationally invariant polynomials, neural networks, and Gaussian approximation potentials (GAPs) in representing water two-body and three-body interaction energies was investigated.
Abstract: The accurate representation of multidimensional potential energy surfaces is a necessary requirement for realistic computer simulations of molecular systems. The continued increase in computer power accompanied by advances in correlated electronic structure methods nowadays enables routine calculations of accurate interaction energies for small systems, which can then be used as references for the development of analytical potential energy functions (PEFs) rigorously derived from many-body (MB) expansions. Building on the accuracy of the MB-pol many-body PEF, we investigate here the performance of permutationally invariant polynomials (PIPs), neural networks, and Gaussian approximation potentials (GAPs) in representing water two-body and three-body interaction energies, denoting the resulting potentials PIP-MB-pol, Behler-Parrinello neural network-MB-pol, and GAP-MB-pol, respectively. Our analysis shows that all three analytical representations exhibit similar levels of accuracy in reproducing both two-body and three-body reference data as well as interaction energies of small water clusters obtained from calculations carried out at the coupled cluster level of theory, the current gold standard for chemical accuracy. These results demonstrate the synergy between interatomic potentials formulated in terms of a many-body expansion, such as MB-pol, that are physically sound and transferable, and machine-learning techniques that provide a flexible framework to approximate the short-range interaction energy terms.

Posted Content
TL;DR: This work introduces a class of neural latent variable models which it calls Neural Processes (NPs), combining the best of both worlds: probabilistic, data-efficient and flexible, however they are also computationally intensive and thus limited in their applicability.
Abstract: A neural network (NN) is a parameterised function that can be tuned via gradient descent to approximate a labelled collection of data with high precision. A Gaussian process (GP), on the other hand, is a probabilistic model that defines a distribution over possible functions, and is updated in light of data via the rules of probabilistic inference. GPs are probabilistic, data-efficient and flexible, however they are also computationally intensive and thus limited in their applicability. We introduce a class of neural latent variable models which we call Neural Processes (NPs), combining the best of both worlds. Like GPs, NPs define distributions over functions, are capable of rapid adaptation to new observations, and can estimate the uncertainty in their predictions. Like NNs, NPs are computationally efficient during training and evaluation but also learn to adapt their priors to data. We demonstrate the performance of NPs on a range of learning tasks, including regression and optimisation, and compare and contrast with related models in the literature.

Journal ArticleDOI
TL;DR: In this article, the authors consider the information contained in the residuals in the regions where the experimental information exists and evaluate the predictive power of global mass models towards more unstable neutron-rich nuclei and provide uncertainty quantification of predictions.
Abstract: Background: The mass, or binding energy, is the basis property of the atomic nucleus. It determines its stability and reaction and decay rates. Quantifying the nuclear binding is important for understanding the origin of elements in the universe. The astrophysical processes responsible for the nucleosynthesis in stars often take place far from the valley of stability, where experimental masses are not known. In such cases, missing nuclear information must be provided by theoretical predictions using extreme extrapolations. To take full advantage of the information contained in mass model residuals, i.e., deviations between experimental and calculated masses, one can utilize Bayesian machine-learning techniques to improve predictions. Purpose: To improve the quality of model-based predictions of nuclear properties of rare isotopes far from stability, we consider the information contained in the residuals in the regions where the experimental information exist. As a case in point, we discuss two-neutron separation energies S2n of even-even nuclei. Through this observable, we assess the predictive power of global mass models towards more unstable neutron-rich nuclei and provide uncertainty quantification of predictions. Methods: We consider 10 global models based on nuclear density functional theory with realistic energy density functionals as well as two more phenomenological mass models. The emulators of S2n residuals and credibility intervals (Bayesian confidence intervals) defining theoretical error bars are constructed using Bayesian Gaussian processes and Bayesian neural networks. We consider a large training dataset pertaining to nuclei whose masses were measured before 2003. For the testing datasets, we considered those exotic nuclei whose masses have been determined after 2003. By establishing statistical methodology and parameters, we carried out extrapolations toward the 2n dripline. Results: While both Gaussian processes and Bayesian neural networks reduce the root-mean-square (rms) deviation from experiment significantly, GP offers a better and much more stable performance. The increase in the predictive power of microscopic models aided by the statistical treatment is quite astonishing: The resulting rms deviations from experiment on the testing dataset are similar to those of more phenomenological models. We found that Bayesian neural networks results are prone to instabilities caused by the large number of parameters in this method. Moreover, since the classical sigmoid activation function used in this approach has linear tails that do not vanish, it is poorly suited for a bounded extrapolation. The empirical coverage probability curves we obtain match very well the reference values, in a slightly conservative way in most cases, which is highly desirable to ensure honesty of uncertainty quantification. The estimated credibility intervals on predictions make it possible to evaluate predictive power of individual models and also make quantified predictions using groups of models. Conclusions: The proposed robust statistical approach to extrapolation of nuclear model results can be useful for assessing the impact of current and future experiments in the context of model developments. The new Bayesian capability to evaluate residuals is also expected to impact research in the domains where experiments are currently impossible, for instance, in simulations of the astrophysical r process.

Proceedings Article
15 Feb 2018
TL;DR: In this article, the authors study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes with a recursive kernel definition and show that, under broad conditions, as they make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process.
Abstract: Whilst deep neural networks have shown great empirical success, there is still much work to be done to understand their theoretical properties. In this paper, we study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes with a recursive kernel definition. We show that, under broad conditions, as we make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks. To evaluate convergence rates empirically, we use maximum mean discrepancy. We then compare finite Bayesian deep networks from the literature to Gaussian processes in terms of the key predictive quantities of interest, finding that in some cases the agreement can be very close. We discuss the desirability of Gaussian process behaviour and review non-Gaussian alternative models from the literature.

Proceedings Article
01 Jan 2018
TL;DR: This work proposes a multi-task adaptive Bayesian linear regression model for transfer learning in BO, whose complexity is linear in the function evaluations: one Bayesianlinear regression model is associated to each black-box function optimization problem (or task), while transfer learning is achieved by coupling the models through a shared deep neural net.
Abstract: Bayesian optimization (BO) is a model-based approach for gradient-free black-box function optimization, such as hyperparameter optimization. Typically, BO relies on conventional Gaussian process (GP) regression, whose algorithmic complexity is cubic in the number of evaluations. As a result, GP-based BO cannot leverage large numbers of past function evaluations, for example, to warm-start related BO runs. We propose a multi-task adaptive Bayesian linear regression model for transfer learning in BO, whose complexity is linear in the function evaluations: one Bayesian linear regression model is associated to each black-box function optimization problem (or task), while transfer learning is achieved by coupling the models through a shared deep neural net. Experiments show that the neural net learns a representation suitable for warm-starting the black-box optimization problems and that BO runs can be accelerated when the target black-box function (e.g., validation loss) is learned together with other related signals (e.g., training loss). The proposed method was found to be at least one order of magnitude faster that methods recently published in the literature.

Posted Content
TL;DR: In this article, an equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs) was derived for CNNs both with and without pooling layers, and achieved state-of-the-art results on CIFAR10 for GPs without trainable kernels.
Abstract: There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and achieve state of the art results on CIFAR10 for GPs without trainable kernels. We also introduce a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible. Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs with and without weight sharing are identical. As a consequence, translation equivariance, beneficial in finite channel CNNs trained with stochastic gradient descent (SGD), is guaranteed to play no role in the Bayesian treatment of the infinite channel limit - a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally, that while in some scenarios the performance of SGD-trained finite CNNs approaches that of the corresponding GPs as the channel count increases, with careful tuning SGD-trained CNNs can significantly outperform their corresponding GPs, suggesting advantages from SGD training compared to fully Bayesian parameter estimation.

Journal ArticleDOI
TL;DR: It is shown that the dynamic GP produces sharper prediction intervals (PIs) than the static GP with significant lower computational burden, but at the cost of the ability to capture sharp peaks.

Journal ArticleDOI
TL;DR: This paper presents a framework for creating a lightweight thermal prediction system suitable for run-time management decisions, and develops alternative methods using neural network and linear regression-based methods to perform a comprehensive comparative study of prediction methods.
Abstract: Elevated temperatures limit the peak performance of systems because of frequent interventions by thermal throttling. Non-uniform thermal states across system nodes also cause performance variation within seemingly equivalent nodes leading to significant degradation of overall performance. In this paper we present a framework for creating a lightweight thermal prediction system suitable for run-time management decisions. We pursue two avenues to explore optimized lightweight thermal predictors. First, we use feature selection algorithms to improve the performance of previously designed machine learning methods. Second, we develop alternative methods using neural network and linear regression-based methods to perform a comprehensive comparative study of prediction methods. We show that our optimized models achieve improved performance with better prediction accuracy and lower overhead as compared with the Gaussian process model proposed previously. Specifically we present a reduced version of the Gaussian process model, a neural network–based model, and a linear regression–based model. Using the optimization methods, we are able to reduce the average prediction errors in the Gaussian process from $4.2^\circ$ C to $2.9^\circ$ C. We also show that the newly developed models using neural network and Lasso linear regression have average prediction errors of $2.9^\circ$ C and $3.8^\circ$ C respectively. The prediction overheads are 0.22, 0.097, and 0.026 ms per prediction for reduced Gaussian process, neural network, and Lasso linear regression models, respectively, compared with 0.57 ms per prediction for the previous Gaussian process model. We have implemented our proposed thermal prediction models on a two-node system configuration to help identify the optimal task placement. The task placement identified by the models reduces the average system temperature by up to $11.9^\circ$ C without any performance degradation. Furthermore, these models respectively achieve 75, 82.5, and 74.17 percent success rates in correctly pointing to those task placements with better thermal response, compared with 72.5 percent success for the original model in achieving the same objective. Finally, we extended our analysis to a 16-node system and we were able to train models and execute them in real time to guide task migration and achieve on average 17 percent reduction in the overall system cooling power.

Journal ArticleDOI
TL;DR: In this paper, a systematic study of how ordering affects the accuracy of Vecchia's approximation of Gaussian process parameters is presented, showing that random orderings can give dramatically sharper approximations than default coordinate-based orderings.
Abstract: Vecchia’s approximate likelihood for Gaussian process parameters depends on how the observations are ordered, which has been cited as a deficiency. This article takes the alternative standpoint that the ordering can be tuned to sharpen the approximations. Indeed, the first part of the article includes a systematic study of how ordering affects the accuracy of Vecchia’s approximation. We demonstrate the surprising result that random orderings can give dramatically sharper approximations than default coordinate-based orderings. Additional ordering schemes are described and analyzed numerically, including orderings capable of improving on random orderings. The second contribution of this article is a new automatic method for grouping calculations of components of the approximation. The grouping methods simultaneously improve approximation accuracy and reduce computational burden. In common settings, reordering combined with grouping reduces Kullback–Leibler divergence from the target model by more th...

Journal ArticleDOI
TL;DR: A single framework is proposed that unifies, extends, and improves a general-purpose modelling strategy, based on the assumption that any process can emerge by transforming a specific “parent” Gaussian process, and is augmented with flexible parametric correlation structures that parsimoniously describe observed correlations.

Journal ArticleDOI
TL;DR: This work proposes to learn individual surrogate models on the observations of each data set and then combine all surrogates to a joint one using ensembling techniques, and extends the framework to directly estimate the acquisition function in the same setting, using a novel technique which is name the “transfer acquisition function”.
Abstract: Algorithm selection as well as hyperparameter optimization are tedious task that have to be dealt with when applying machine learning to real-world problems. Sequential model-based optimization (SMBO), based on so-called “surrogate models”, has been employed to allow for faster and more direct hyperparameter optimization. A surrogate model is a machine learning regression model which is trained on the meta-level instances in order to predict the performance of an algorithm on a specific data set given the hyperparameter settings and data set descriptors. Gaussian processes, for example, make good surrogate models as they provide probability distributions over labels. Recent work on SMBO also includes meta-data, i.e. observed hyperparameter performances on other data sets, into the process of hyperparameter optimization. This can, for example, be accomplished by learning transfer surrogate models on all available instances of meta-knowledge; however, the increasing amount of meta-information can make Gaussian processes infeasible, as they require the inversion of a large covariance matrix which grows with the number of instances. Consequently, instead of learning a joint surrogate model on all of the meta-data, we propose to learn individual surrogate models on the observations of each data set and then combine all surrogates to a joint one using ensembling techniques. The final surrogate is a weighted sum of all data set specific surrogates plus an additional surrogate that is solely learned on the target observations. Within our framework, any surrogate model can be used and explore Gaussian processes in this scenario. We present two different strategies for finding the weights used in the ensemble: the first is based on a probabilistic product of experts approach, and the second is based on kernel regression. Additionally, we extend the framework to directly estimate the acquisition function in the same setting, using a novel technique which we name the “transfer acquisition function”. In an empirical evaluation including comparisons to the current state-of-the-art on two publicly available meta-data sets, we are able to demonstrate that our proposed approach does not only scale to large meta-data, but also finds the stronger prediction models.

Journal ArticleDOI
TL;DR: In this article, a Gaussian process (a nonparametric machine learning approach) based algorithm for condition monitoring is proposed, which uses the standard IEC binned power curve together with individual bin probability distributions to identify operational anomalies.
Abstract: The penetration of wind energy into power systems is steadily increasing; this highlights the importance of operations and maintenance, and specifically the role of condition monitoring. Wind turbine power curves based on supervisory control and data acquisition data provide a cost-effective approach to wind turbine health monitoring. This study proposes a Gaussian process (a non-parametric machine learning approach) based algorithm for condition monitoring. The standard IEC binned power curve together with individual bin probability distributions can be used to identify operational anomalies. The IEC approach can also be modified to create a form of real-time power curve. Both of these approaches will be compared with a Gaussian process model to assess both speed and accuracy of anomaly detection. Significant yaw misalignment, reflecting a yaw control error or fault, results in a loss of power. Such a fault is quite common and early detection is important to prevent loss of power generation. Yaw control error provides a useful case study to demonstrate the effectiveness of the proposed algorithms and allows the advantages and limitations of the proposed methods to be determined.

Journal ArticleDOI
19 Feb 2018
TL;DR: In this paper, the authors present a derivation and implementation of efficient and scalable gradient computations using the celerite algorithm for Gaussian Process (GP) modeling, which can be easily integrated into existing automatic differentiation frameworks to provide a scalable method for evaluating the gradients of the GP likelihood with respect to all input parameters.
Abstract: This research note presents a derivation and implementation of efficient and scalable gradient computations using the celerite algorithm for Gaussian Process (GP) modeling. The algorithms are derived in a "reverse accumulation" or "backpropagation" framework and they can be easily integrated into existing automatic differentiation frameworks to provide a scalable method for evaluating the gradients of the GP likelihood with respect to all input parameters. The algorithm derived in this note uses less memory and is more efficient than versions using automatic differentiation and the computational cost scales linearly with the number of data points.

Journal ArticleDOI
TL;DR: This study proposes a complete method based on Gaussian Processes data pre-filtering and ANN modeling of wind turbine power curves that improves the network performance significantly, and saves substantial time and resources.