scispace - formally typeset
Open AccessJournal Article

Variational Fourier features for Gaussian processes

James Hensman, +2 more
- 01 Jan 2017 - 
- Vol. 18, Iss: 1, pp 5537-5588
Reads0
Chats0
TLDR
In this article, the authors combine the variational approach to sparse approximation and the spectral representation of Gaussian processes to obtain an approximation with the representational power and computational scalability of spectral representations.
Abstract
This work brings together two powerful concepts in Gaussian processes: the variational approach to sparse approximation and the spectral representation of Gaussian processes. This gives rise to an approximation that inherits the benefits of the variational approach but with the representational power and computational scalability of spectral representations. The work hinges on a key result that there exist spectral features related to a finite domain of the Gaussian process which exhibit almost-independent covariances. We derive these expressions for Matern kernels in one dimension, and generalize to more dimensions using kernels with specific structures. Under the assumption of additive Gaussian noise, our method requires only a single pass through the data set, making for very fast and accurate computation. We fit a model to 4 million training points in just a few minutes on a standard laptop. With non-conjugate likelihoods, our MCMC scheme reduces the cost of computation from O(NM2) (for a sparse Gaussian process) to O(NM) per iteration, where N is the number of data and M is the number of features.

read more

Content maybe subject to copyright    Report

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail.
Powered by TCPDF (www.tcpdf.org)
This material is protected by copyright and other intellectual property rights, and duplication or sale of all or
part of any of the repository collections is not permitted, except that material may be duplicated by you for
your research use or educational purposes in electronic or print form. You must obtain permission for any
other use. Electronic or print copies may not be offered, whether for sale or otherwise to anyone who is not
an authorised user.
Hensman, James; Durrande, Nicolas; Solin, Arno
Variational Fourier Features for Gaussian Processes
Published in:
Journal of Machine Learning Research
Published: 01/01/2018
Document Version
Publisher's PDF, also known as Version of record
Published under the following license:
CC BY
Please cite the original version:
Hensman, J., Durrande, N., & Solin, A. (2018). Variational Fourier Features for Gaussian Processes. Journal of
Machine Learning Research, 18(1), 1-52. [151]. https://dl.acm.org/citation.cfm?id=3242008

Journal of Machine Learning Research 18 (2018) 1-52 Submitted 11/16; Revised 11/17; Published 4/18
Variational Fourier Features for Gaussian Processes
James Hensman
james@prowler.io
Nicolas Durrande
nicolas@prowler.io
PROWLER.io
66-68 Hills Road
Cambridge, CB2 1LA, UK
Arno Solin arno.solin@aalto.fi
Aalto University
FI-00076 AALTO
Espoo, Finland
Editor: Manfred Opper
Abstract
This work bri ngs together two powerful concepts in Gaussian processes: the variational
approach to sparse approximation and the spectral representation of Gaussian processes.
This gives rise to an approximation that inherits the benefits of the variational approach but
with the representational power and computational scalabi l ity of spect r al representations.
The work hinges on a key result th at the re e xi s t spectral feat u re s rel at e d to a finit e domai n
of the Gaussian process which exhibit almost-independent covariances. We derive these
expressions for Mat´ern kernels in one dimensi on, and generalize to more dimens i ons using
kernels with specific structures. Under the assumption of additive Gaussian noise, our
method requires only a single pass through the data set, making for very fast and accurate
computation. We t a model to 4 million trai ni n g points in just a few minutes on a
standard laptop. With non-conjugate likelihoods, our MCMC scheme reduces the cost of
computation from O(NM
2
) (for a sparse Gaussian process) to O(N M) per iteration, where
N is the number of data and M is the number of features.
Keywords: Gaussian processes, Fourier features, variational inference
1. Introduction
Efficient computation in Gauss i an process (GP) models is of broad interest across machine
learning and statistics, with applications in the fields of spatial epidemiology (D iggl e , 2013;
Banerjee et al., 2014), robotics and control (Deisenroth et al., 2015), signal pr ocess i ng (e.g.
arkk¨a et al., 2013), Bayesian optimization and probabilistic numerics, (e.g. Osborne, 2010;
Briol et al., 2016; Hennig et al., 2015), and many oth er s.
Computation is challenging for several reasons. First, the comput at i onal complexity
usually scales cubically with the numbe r of data N, as on e must decompose a N ×N dense
covari an ce mat ri x . Second, the posterior is both high-dimensional and non-Gaussi an if the
. This work was partially completed whilst JH was a li a ted to the Centre for Health Information, Com-
putation, and Statistcs (CHICAS), Faculty of Health and Medicine, Lancaster University, UK.
. This work was partially completed whilst ND was affiliated to Institut Fayol–LIMOS, Mines Saint-
´
Etienne, 158 cours Fauriel, Saint-
´
Etienne, France.
c
2018 James Hensman, Nicolas Durrande, and Arno Solin.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v18/16-579.html.

Hensman, Durrande, and Solin
data likelihood is not conjugate. Third, G P models have a hierarchical structure, with the
latent values of the function being conditioned on parameters of the covarian ce function.
In this work, we adopt a variational Bayesian framework for deal i n g with these issues.
By this we mean that we minimize the Kullback-Leibler (KL) divergence KL[q||p], where
q de not es an approximation to the posterior, and p is th e posterior itself. Many authors
have applied this framework to GP problems before: perhaps the earlie st example is Csat´o
et al. (2002). Seeger (2003) also made use of this KL criterion in the context of Gaussian
processes. Particularly influentuial works include Titsias (2009), who developed the first
sparse GP using the KL divergence, and Opper and Archambeau (2009), who made a case
for this approximation in the non-conjugate ( bu t not sparse) case.
The variational framework for Gaussian processes has several advantages. First, the
three challenges mentioned above can be tackled by a single unified objective. Second,
the accuracy of t h e approximation can easily be s hown to increase monotonically with
increasing c om pl ex i ty of the proposed approximating family (i.e. more inducing points is
always better). Third, the framework provides an evidence lower bound (or ELBO), which
can be evaluated to compare different approximations.
In this work, we combine the variational approach with Fourier features; we refer to
our method as Variational Fourier Features (VFF). Whilst most sparse Gaussian process
approximations rely on inducing or pseudo inputs, which lie in the same domain as the
inputs to the GP, Fourier featu r es lie i n the spectr al domain of the pr ocess. As inducing-
point approximations build up the posterior through kernel functions, Fourier features build
up the approximate posterior through sinusoids. Our approach differs from existing works
which use random fr eq u en ci es (Rahimi and Recht, 2008, 2009) or optimized frequencies
(L´azaro-Gredilla et al., 2010), in that we use frequencies placed on a regular grid (cf. Solin
and arkk¨a, 2014), which is key in our derivation of a computationally convenient matrix
structure.
Naive combination of Fourier features with sparse G P approaches is not feasible because
the standard Fourier transf or m of a stochastic process does not converge, meaning that these
features are associated with random variables of infinit e variance . Matthews et al. (2016)
have shown that valid inducing features for the variational approach require finite, deter-
ministic projections of the process. To combat this, Figueiras-Vidal and azaro-Gredilla
(2009) used a windowing technique, which made the integrals converge to valid inducin g
features (for the squared -e xponential kernel), but lost perhaps the most important aspect
of a Fourier approach: that the features should be independent. In this work, we combine
a finite window with a Reproducing Kern el Hilbert Space (R KHS ) projection to construct
features that are almost independent, having covariance structures that are easily decom-
posed. This gives r i se to a complexity of O(NM), comparing favou rab l y with the O(NM
2
)
complexity of other variat ion al approaches, where M denotes the number of features.
The contributions of this paper are t h e following:
We introduce a novel approximation scheme for representing a GP using a Fourier
decomposition. We limit our interest to the widely used Mat´ern family of station-
ary kernels, for wh i ch our proposed decomposition exhibits an almost-independent
structure.
2

Variational Fourier Features
We combine our approximation scheme with the vari at i onal Bayesian fr ame work for
Gaussian processes, and present how the methodology generalizes to GP inference
problems with general likelihoods.
Consequently, this paper aims to provide an ap pr oximate Gaussian process inference scheme
which is theoretically sound, has strong representative power, is extremely fast, and—most
importantly—works well in practice.
The rest of th i s paper is structured as follows. Section 2 reviews the relevant background
related to Gauss ian process inference and previous work in sparse and spectral approxima-
tions to GP models. In Section 3 the core methodology for Variational Fourier Features is
presented for the one-dimensional case. In Section 4 th is is extended to multidimensional
cases with additive and product structures. Implementation details and com pu t at ion al
complexity of the method are covered in Section 5. Section 6 is d ed i cat ed to a set of il -
lustrative toy examples and a number of emp ir i cal experiments, where practical aspects
of Variational Fourier Feature inference are demonstrated. Finally, we conclude th e paper
with a discussion in Section 7.
2. Background
In t h i s section, we first set up the family of Gaussian process models that we will consider,
which allows us to establish some notation. Next, we review some basic results regard i ng the
spectrum of stationary Gaussian processes, and recall how Fourier features approximate t he
kernel. We relate sparse Gaussian process approximations and Fourier approximations by
explicating them as alternative models (Qui˜nonero-Candela and Rasmussen, 2005). We then
recap the var iat i on al approximation to Gaussian process es , including expressions for spar se
approximations and approximations for non-conju gat e likelihoods. The two final subsections
in this section discuss decomposition of Gaussian processe s : we first link decomposition and
conditioning and then discuss inter-domain de com posit i on.
2.1 Gaussian process models
Gaussian processes are distributions over functions, defined by a mean function and a
covari an ce function (see Williams and Rasmussen, 2006, for an introduction). In this section
we consider G Ps over the real line, x R. Making the standard assumption that the mean
function is zero, we write
f(x) GP
0, k(x, x
)
. (1)
The data y = [y
n
]
N
n=1
at locati on s X = [x
n
]
N
n=1
are condi t i on ed on the function evaluations
f = [f (x
n
)]
N
n=1
through some factorising likelihood
p(y |f(x)) = p(y |f) =
Y
n
p(y
n
|f(x
n
)) , (2)
which we do not in general assume to b e Gaussian. A defining property of Gaussian pro-
cesses is that any finite number of function eval uat i on s follow a multivariate normal distri-
bution, so
p(f) = N(0, K
) , (3)
3

Hensman, Durrande, and Solin
and the process cond i ti on ed on these values is given by a conditional Gaussian process:
f(x) |f GP
k
f
(x)
K
1
f, k(x, x
) k
f
(x)
K
1
k
f
(x
)
. (4)
To compute the poste ri or over functions given the data f(x) |y, it is possible to compute
the posterior only for the finite set p(f |y) using standard methodologies like Markov Chain
Monte Carlo (MCMC) (e.g. Murray et al., 2010; Filippone et al., 2013) or variational Bayes
(e.g. Opper and Archambeau, 2009; Khan et al., 2012, 2013). Evaluating the function at a
test point means averaging equation (4) over this posterior p(f |y).
In our notation, the matrix K
is given by evaluating the covar ian ce function at all
pairs of data points K
[i, j] = k(x
i
, x
j
). The vector valued function k
f
(x) is given simi-
larly, k
f
(x) = [k(x
1
, x), k(x
2
, x), . . . , k(x
n
, x)]
. We omit dependence on covariance function
parameters for clarity.
2.2 Spectral representations of stationary covarian ces
Stationary G aus si an processes are those wi t h a covari an ce function that can be written as
a function of the distance between observations, k(x, x
) = k(|x x
|) = k(r). Of particular
interest in this work wi l l be the Mat´ern family with half-integer orders, the first three of
which are
k
1
2
(r) = σ
2
exp(r/ℓ) ,
k
3
2
(r) = σ
2
(1 +
3r/ℓ) exp(
3r/ℓ) ,
k
5
2
(r) = σ
2
(1 +
5r/ℓ +
5
3
r
2
/ℓ
2
) exp(
5r/ℓ) ,
(5)
where σ
2
and are variance and lengthscale parameters respectively.
Bochner’s theorem (Akhiezer and Glazman, 1993) tells us that any continuous positive
definite function, such as a covariance function, can b e repre sented as the Fourier transform
of a positive measure. If the measure has a density, it is known as the spectral density
s(ω) of the covariance function. This gi ves rise to the Fourier duality of spectral densities
and covariance functions, known as the W i en er -K h intchin theorem. It gi ves the following
relations:
s(ω) = F{k(r)} =
Z
−∞
k(r)e
r
dr , (6)
k(r) = F
1
{s(ω)} =
1
2π
Z
−∞
s(ω)e
r
dω . (7)
Since kernels are symmetric r eal functions, the spectrum of the proc es s is also a symmetric
real function. The spectra corresponding to the Mat´ern covariances are
s
1
2
(ω) = 2σ
2
λ(λ
2
+ ω
2
)
1
, λ =
1
, (8)
s
3
2
(ω) = 4σ
2
λ
3
(λ
2
+ ω
2
)
2
, λ =
3
1
, (9)
s
5
2
(ω) =
16
3
σ
2
λ
5
(λ
2
+ ω
2
)
3
, λ =
5
1
. (10)
4

Citations
More filters
Journal ArticleDOI

When Gaussian Process Meets Big Data: A Review of Scalable GPs

TL;DR: In this article, a review of state-of-the-art scalable Gaussian process regression (GPR) models is presented, focusing on global and local approximations for subspace learning.
Proceedings ArticleDOI

Efficient High Dimensional Bayesian Optimization with Additivity and Quadrature Fourier Features

TL;DR: An efficient and provably no-regret Bayesian optimization algorithm for optimization of black-box functions in high dimensions and introduces a novel deterministic Fourier Features approximation based on numerical integration with detailed analysis for the squared exponential kernel.
Journal ArticleDOI

Hilbert space methods for reduced-rank Gaussian process regression

TL;DR: In this article, an approximate series expansion of the covariance function in terms of an eigenfunction expansion of Laplace operator in a compact subset of the Gaussian process is proposed.
Proceedings ArticleDOI

Convolutional Gaussian Processes

TL;DR: It is shown how the marginal likelihood can be used to find an optimal weighting between convolutional and RBF kernels to further improve performance, and it is hoped that this illustration of the usefulness of a marginal likelihood will help automate discovering architectures in larger models.
Proceedings Article

Spectral Mixture Kernels for Multi-Output Gaussian Processes

TL;DR: In this paper, a parametric family of complex-valued cross-spectral densities is proposed for multivariate covariance functions to better interpret the relationship between different channels by directly modelling the cross-covariances as a spectral mixture kernel with a phase shift.
References
More filters
Journal ArticleDOI

Variational Inference: A Review for Statisticians

TL;DR: For instance, mean-field variational inference as discussed by the authors approximates probability densities through optimization, which is used in many applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling.
Proceedings Article

Random Features for Large-Scale Kernel Machines

TL;DR: Two sets of random features are explored, provided convergence bounds on their ability to approximate various radial basis kernels, and it is shown that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large- scale kernel machines.
Book

Hierarchical Modeling and Analysis for Spatial Data

TL;DR: Matrix Theory and Spatial Computing Methods Answers to Selected Exercises REFERENCES AUTHOR INDEX SUBJECT INDEX Short TOC
Book

Theory of linear operators in Hilbert space

TL;DR: In this article, the main properties of bounded and unbounded operators, adjoint operators, symmetric and self-adjoint operators in hilbert spaces are discussed, as well as the stability of self-jointness under small perturbations.
Journal ArticleDOI

A Unifying View of Sparse Approximate Gaussian Process Regression

TL;DR: A new unifying view, including all existing proper probabilistic sparse approximations for Gaussian process regression, relies on expressing the effective prior which the methods are using, and highlights the relationship between existing methods.