This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail.
Powered by TCPDF (www.tcpdf.org)
This material is protected by copyright and other intellectual property rights, and duplication or sale of all or
part of any of the repository collections is not permitted, except that material may be duplicated by you for
your research use or educational purposes in electronic or print form. You must obtain permission for any
other use. Electronic or print copies may not be offered, whether for sale or otherwise to anyone who is not
an authorised user.
Hensman, James; Durrande, Nicolas; Solin, Arno
Variational Fourier Features for Gaussian Processes
Published in:
Journal of Machine Learning Research
Published: 01/01/2018
Document Version
Publisher's PDF, also known as Version of record
Published under the following license:
CC BY
Please cite the original version:
Hensman, J., Durrande, N., & Solin, A. (2018). Variational Fourier Features for Gaussian Processes. Journal of
Machine Learning Research, 18(1), 1-52. [151]. https://dl.acm.org/citation.cfm?id=3242008
Journal of Machine Learning Research 18 (2018) 1-52 Submitted 11/16; Revised 11/17; Published 4/18
Variational Fourier Features for Gaussian Processes
James Hensman
∗
james@prowler.io
Nicolas Durrande
†
nicolas@prowler.io
PROWLER.io
66-68 Hills Road
Cambridge, CB2 1LA, UK
Arno Solin arno.solin@aalto.fi
Aalto University
FI-00076 AALTO
Espoo, Finland
Editor: Manfred Opper
Abstract
This work bri ngs together two powerful concepts in Gaussian processes: the variational
approach to sparse approximation and the spectral representation of Gaussian processes.
This gives rise to an approximation that inherits the benefits of the variational approach but
with the representational power and computational scalabi l ity of spect r al representations.
The work hinges on a key result th at the re e xi s t spectral feat u re s rel at e d to a finit e domai n
of the Gaussian process which exhibit almost-independent covariances. We derive these
expressions for Mat´ern kernels in one dimensi on, and generalize to more dimens i ons using
kernels with specific structures. Under the assumption of additive Gaussian noise, our
method requires only a single pass through the data set, making for very fast and accurate
computation. We fi t a model to 4 million trai ni n g points in just a few minutes on a
standard laptop. With non-conjugate likelihoods, our MCMC scheme reduces the cost of
computation from O(NM
2
) (for a sparse Gaussian process) to O(N M) per iteration, where
N is the number of data and M is the number of features.
Keywords: Gaussian processes, Fourier features, variational inference
1. Introduction
Efficient computation in Gauss i an process (GP) models is of broad interest across machine
learning and statistics, with applications in the fields of spatial epidemiology (D iggl e , 2013;
Banerjee et al., 2014), robotics and control (Deisenroth et al., 2015), signal pr ocess i ng (e.g.
S¨arkk¨a et al., 2013), Bayesian optimization and probabilistic numerics, (e.g. Osborne, 2010;
Briol et al., 2016; Hennig et al., 2015), and many oth er s.
Computation is challenging for several reasons. First, the comput at i onal complexity
usually scales cubically with the numbe r of data N, as on e must decompose a N ×N dense
covari an ce mat ri x . Second, the posterior is both high-dimensional and non-Gaussi an if the
∗. This work was partially completed whilst JH was a ffi li a ted to the Centre for Health Information, Com-
putation, and Statistcs (CHICAS), Faculty of Health and Medicine, Lancaster University, UK.
†. This work was partially completed whilst ND was affiliated to Institut Fayol–LIMOS, Mines Saint-
´
Etienne, 158 cours Fauriel, Saint-
´
Etienne, France.
c
2018 James Hensman, Nicolas Durrande, and Arno Solin.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v18/16-579.html.
Hensman, Durrande, and Solin
data likelihood is not conjugate. Third, G P models have a hierarchical structure, with the
latent values of the function being conditioned on parameters of the covarian ce function.
In this work, we adopt a variational Bayesian framework for deal i n g with these issues.
By this we mean that we minimize the Kullback-Leibler (KL) divergence KL[q||p], where
q de not es an approximation to the posterior, and p is th e posterior itself. Many authors
have applied this framework to GP problems before: perhaps the earlie st example is Csat´o
et al. (2002). Seeger (2003) also made use of this KL criterion in the context of Gaussian
processes. Particularly influentuial works include Titsias (2009), who developed the first
sparse GP using the KL divergence, and Opper and Archambeau (2009), who made a case
for this approximation in the non-conjugate ( bu t not sparse) case.
The variational framework for Gaussian processes has several advantages. First, the
three challenges mentioned above can be tackled by a single unified objective. Second,
the accuracy of t h e approximation can easily be s hown to increase monotonically with
increasing c om pl ex i ty of the proposed approximating family (i.e. more inducing points is
always better). Third, the framework provides an evidence lower bound (or ELBO), which
can be evaluated to compare different approximations.
In this work, we combine the variational approach with Fourier features; we refer to
our method as Variational Fourier Features (VFF). Whilst most sparse Gaussian process
approximations rely on inducing or pseudo inputs, which lie in the same domain as the
inputs to the GP, Fourier featu r es lie i n the spectr al domain of the pr ocess. As inducing-
point approximations build up the posterior through kernel functions, Fourier features build
up the approximate posterior through sinusoids. Our approach differs from existing works
which use random fr eq u en ci es (Rahimi and Recht, 2008, 2009) or optimized frequencies
(L´azaro-Gredilla et al., 2010), in that we use frequencies placed on a regular grid (cf. Solin
and S¨arkk¨a, 2014), which is key in our derivation of a computationally convenient matrix
structure.
Naive combination of Fourier features with sparse G P approaches is not feasible because
the standard Fourier transf or m of a stochastic process does not converge, meaning that these
features are associated with random variables of infinit e variance . Matthews et al. (2016)
have shown that valid inducing features for the variational approach require finite, deter-
ministic projections of the process. To combat this, Figueiras-Vidal and L´azaro-Gredilla
(2009) used a windowing technique, which made the integrals converge to valid inducin g
features (for the squared -e xponential kernel), but lost perhaps the most important aspect
of a Fourier approach: that the features should be independent. In this work, we combine
a finite window with a Reproducing Kern el Hilbert Space (R KHS ) projection to construct
features that are almost independent, having covariance structures that are easily decom-
posed. This gives r i se to a complexity of O(NM), comparing favou rab l y with the O(NM
2
)
complexity of other variat ion al approaches, where M denotes the number of features.
The contributions of this paper are t h e following:
• We introduce a novel approximation scheme for representing a GP using a Fourier
decomposition. We limit our interest to the widely used Mat´ern family of station-
ary kernels, for wh i ch our proposed decomposition exhibits an almost-independent
structure.
2
Variational Fourier Features
• We combine our approximation scheme with the vari at i onal Bayesian fr ame work for
Gaussian processes, and present how the methodology generalizes to GP inference
problems with general likelihoods.
Consequently, this paper aims to provide an ap pr oximate Gaussian process inference scheme
which is theoretically sound, has strong representative power, is extremely fast, and—most
importantly—works well in practice.
The rest of th i s paper is structured as follows. Section 2 reviews the relevant background
related to Gauss ian process inference and previous work in sparse and spectral approxima-
tions to GP models. In Section 3 the core methodology for Variational Fourier Features is
presented for the one-dimensional case. In Section 4 th is is extended to multidimensional
cases with additive and product structures. Implementation details and com pu t at ion al
complexity of the method are covered in Section 5. Section 6 is d ed i cat ed to a set of il -
lustrative toy examples and a number of emp ir i cal experiments, where practical aspects
of Variational Fourier Feature inference are demonstrated. Finally, we conclude th e paper
with a discussion in Section 7.
2. Background
In t h i s section, we first set up the family of Gaussian process models that we will consider,
which allows us to establish some notation. Next, we review some basic results regard i ng the
spectrum of stationary Gaussian processes, and recall how Fourier features approximate t he
kernel. We relate sparse Gaussian process approximations and Fourier approximations by
explicating them as alternative models (Qui˜nonero-Candela and Rasmussen, 2005). We then
recap the var iat i on al approximation to Gaussian process es , including expressions for spar se
approximations and approximations for non-conju gat e likelihoods. The two final subsections
in this section discuss decomposition of Gaussian processe s : we first link decomposition and
conditioning and then discuss inter-domain de com posit i on.
2.1 Gaussian process models
Gaussian processes are distributions over functions, defined by a mean function and a
covari an ce function (see Williams and Rasmussen, 2006, for an introduction). In this section
we consider G Ps over the real line, x ∈ R. Making the standard assumption that the mean
function is zero, we write
f(x) ∼ GP
0, k(x, x
′
)
. (1)
The data y = [y
n
]
N
n=1
at locati on s X = [x
n
]
N
n=1
are condi t i on ed on the function evaluations
f = [f (x
n
)]
N
n=1
through some factorising likelihood
p(y |f(x)) = p(y |f) =
Y
n
p(y
n
|f(x
n
)) , (2)
which we do not in general assume to b e Gaussian. A defining property of Gaussian pro-
cesses is that any finite number of function eval uat i on s follow a multivariate normal distri-
bution, so
p(f) = N(0, K
ff
) , (3)
3
Hensman, Durrande, and Solin
and the process cond i ti on ed on these values is given by a conditional Gaussian process:
f(x) |f ∼ GP
k
f
(x)
⊤
K
−1
ff
f, k(x, x
′
) − k
f
(x)
⊤
K
−1
ff
k
f
(x
′
)
. (4)
To compute the poste ri or over functions given the data f(x) |y, it is possible to compute
the posterior only for the finite set p(f |y) using standard methodologies like Markov Chain
Monte Carlo (MCMC) (e.g. Murray et al., 2010; Filippone et al., 2013) or variational Bayes
(e.g. Opper and Archambeau, 2009; Khan et al., 2012, 2013). Evaluating the function at a
test point means averaging equation (4) over this posterior p(f |y).
In our notation, the matrix K
ff
is given by evaluating the covar ian ce function at all
pairs of data points K
ff
[i, j] = k(x
i
, x
j
). The vector valued function k
f
(x) is given simi-
larly, k
f
(x) = [k(x
1
, x), k(x
2
, x), . . . , k(x
n
, x)]
⊤
. We omit dependence on covariance function
parameters for clarity.
2.2 Spectral representations of stationary covarian ces
Stationary G aus si an processes are those wi t h a covari an ce function that can be written as
a function of the distance between observations, k(x, x
′
) = k(|x −x
′
|) = k(r). Of particular
interest in this work wi l l be the Mat´ern family with half-integer orders, the first three of
which are
k
1
2
(r) = σ
2
exp(−r/ℓ) ,
k
3
2
(r) = σ
2
(1 +
√
3r/ℓ) exp(−
√
3r/ℓ) ,
k
5
2
(r) = σ
2
(1 +
√
5r/ℓ +
5
3
r
2
/ℓ
2
) exp(−
√
5r/ℓ) ,
(5)
where σ
2
and ℓ are variance and lengthscale parameters respectively.
Bochner’s theorem (Akhiezer and Glazman, 1993) tells us that any continuous positive
definite function, such as a covariance function, can b e repre sented as the Fourier transform
of a positive measure. If the measure has a density, it is known as the spectral density
s(ω) of the covariance function. This gi ves rise to the Fourier duality of spectral densities
and covariance functions, known as the W i en er -K h intchin theorem. It gi ves the following
relations:
s(ω) = F{k(r)} =
Z
∞
−∞
k(r)e
−iωr
dr , (6)
k(r) = F
−1
{s(ω)} =
1
2π
Z
∞
−∞
s(ω)e
iωr
dω . (7)
Since kernels are symmetric r eal functions, the spectrum of the proc es s is also a symmetric
real function. The spectra corresponding to the Mat´ern covariances are
s
1
2
(ω) = 2σ
2
λ(λ
2
+ ω
2
)
−1
, λ = ℓ
−1
, (8)
s
3
2
(ω) = 4σ
2
λ
3
(λ
2
+ ω
2
)
−2
, λ =
√
3ℓ
−1
, (9)
s
5
2
(ω) =
16
3
σ
2
λ
5
(λ
2
+ ω
2
)
−3
, λ =
√
5ℓ
−1
. (10)
4