Variational Fourier features for Gaussian processes

This is an electronic reprint of the original article.

This reprint may differ from the original in pagination and typographic detail.

Powered by TCPDF (www.tcpdf.org)

This material is protected by copyright and other intellectual property rights, and duplication or sale of all or

part of any of the repository collections is not permitted, except that material may be duplicated by you for

your research use or educational purposes in electronic or print form. You must obtain permission for any

other use. Electronic or print copies may not be offered, whether for sale or otherwise to anyone who is not

an authorised user.

Hensman, James; Durrande, Nicolas; Solin, Arno

Variational Fourier Features for Gaussian Processes

Published in:

Journal of Machine Learning Research

Published: 01/01/2018

Document Version

Publisher's PDF, also known as Version of record

Published under the following license:

CC BY

Please cite the original version:

Hensman, J., Durrande, N., & Solin, A. (2018). Variational Fourier Features for Gaussian Processes. Journal of

Machine Learning Research, 18(1), 1-52. [151]. https://dl.acm.org/citation.cfm?id=3242008

Journal of Machine Learning Research 18 (2018) 1-52 Submitted 11/16; Revised 11/17; Published 4/18

Variational Fourier Features for Gaussian Processes

James Hensman

∗

james@prowler.io

Nicolas Durrande

†

nicolas@prowler.io

PROWLER.io

66-68 Hills Road

Cambridge, CB2 1LA, UK

Arno Solin arno.solin@aalto.fi

Aalto University

FI-00076 AALTO

Espoo, Finland

Editor: Manfred Opper

Abstract

This work bri ngs together two powerful concepts in Gaussian processes: the variational

approach to sparse approximation and the spectral representation of Gaussian processes.

This gives rise to an approximation that inherits the beneﬁts of the variational approach but

with the representational power and computational scalabi l ity of spect r al representations.

The work hinges on a key result th at the re e xi s t spectral feat u re s rel at e d to a ﬁnit e domai n

of the Gaussian process which exhibit almost-independent covariances. We derive these

expressions for Mat´ern kernels in one dimensi on, and generalize to more dimens i ons using

kernels with speciﬁc structures. Under the assumption of additive Gaussian noise, our

method requires only a single pass through the data set, making for very fast and accurate

computation. We ﬁ t a model to 4 million trai ni n g points in just a few minutes on a

standard laptop. With non-conjugate likelihoods, our MCMC scheme reduces the cost of

computation from O(NM

2

) (for a sparse Gaussian process) to O(N M) per iteration, where

N is the number of data and M is the number of features.

Keywords: Gaussian processes, Fourier features, variational inference

1. Introduction

Eﬃcient computation in Gauss i an process (GP) models is of broad interest across machine

learning and statistics, with applications in the ﬁelds of spatial epidemiology (D iggl e , 2013;

Banerjee et al., 2014), robotics and control (Deisenroth et al., 2015), signal pr ocess i ng (e.g.

S¨arkk¨a et al., 2013), Bayesian optimization and probabilistic numerics, (e.g. Osborne, 2010;

Briol et al., 2016; Hennig et al., 2015), and many oth er s.

Computation is challenging for several reasons. First, the comput at i onal complexity

usually scales cubically with the numbe r of data N, as on e must decompose a N ×N dense

covari an ce mat ri x . Second, the posterior is both high-dimensional and non-Gaussi an if the

∗. This work was partially completed whilst JH was a ﬃ li a ted to the Centre for Health Information, Com-

putation, and Statistcs (CHICAS), Faculty of Health and Medicine, Lancaster University, UK.

†. This work was partially completed whilst ND was aﬃliated to Institut Fayol–LIMOS, Mines Saint-

´

Etienne, 158 cours Fauriel, Saint-

´

Etienne, France.

c

2018 James Hensman, Nicolas Durrande, and Arno Solin.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided

at http://jmlr.org/papers/v18/16-579.html.

Hensman, Durrande, and Solin

data likelihood is not conjugate. Third, G P models have a hierarchical structure, with the

latent values of the function being conditioned on parameters of the covarian ce function.

In this work, we adopt a variational Bayesian framework for deal i n g with these issues.

By this we mean that we minimize the Kullback-Leibler (KL) divergence KL[q||p], where

q de not es an approximation to the posterior, and p is th e posterior itself. Many authors

have applied this framework to GP problems before: perhaps the earlie st example is Csat´o

et al. (2002). Seeger (2003) also made use of this KL criterion in the context of Gaussian

processes. Particularly inﬂuentuial works include Titsias (2009), who developed the ﬁrst

sparse GP using the KL divergence, and Opper and Archambeau (2009), who made a case

for this approximation in the non-conjugate ( bu t not sparse) case.

The variational framework for Gaussian processes has several advantages. First, the

three challenges mentioned above can be tackled by a single uniﬁed objective. Second,

the accuracy of t h e approximation can easily be s hown to increase monotonically with

increasing c om pl ex i ty of the proposed approximating family (i.e. more inducing points is

always better). Third, the framework provides an evidence lower bound (or ELBO), which

can be evaluated to compare diﬀerent approximations.

In this work, we combine the variational approach with Fourier features; we refer to

our method as Variational Fourier Features (VFF). Whilst most sparse Gaussian process

approximations rely on inducing or pseudo inputs, which lie in the same domain as the

inputs to the GP, Fourier featu r es lie i n the spectr al domain of the pr ocess. As inducing-

point approximations build up the posterior through kernel functions, Fourier features build

up the approximate posterior through sinusoids. Our approach diﬀers from existing works

which use random fr eq u en ci es (Rahimi and Recht, 2008, 2009) or optimized frequencies

(L´azaro-Gredilla et al., 2010), in that we use frequencies placed on a regular grid (cf. Solin

and S¨arkk¨a, 2014), which is key in our derivation of a computationally convenient matrix

structure.

Naive combination of Fourier features with sparse G P approaches is not feasible because

the standard Fourier transf or m of a stochastic process does not converge, meaning that these

features are associated with random variables of inﬁnit e variance . Matthews et al. (2016)

have shown that valid inducing features for the variational approach require ﬁnite, deter-

ministic projections of the process. To combat this, Figueiras-Vidal and L´azaro-Gredilla

(2009) used a windowing technique, which made the integrals converge to valid inducin g

features (for the squared -e xponential kernel), but lost perhaps the most important aspect

of a Fourier approach: that the features should be independent. In this work, we combine

a ﬁnite window with a Reproducing Kern el Hilbert Space (R KHS ) projection to construct

features that are almost independent, having covariance structures that are easily decom-

posed. This gives r i se to a complexity of O(NM), comparing favou rab l y with the O(NM

2

)

complexity of other variat ion al approaches, where M denotes the number of features.

The contributions of this paper are t h e following:

• We introduce a novel approximation scheme for representing a GP using a Fourier

decomposition. We limit our interest to the widely used Mat´ern family of station-

ary kernels, for wh i ch our proposed decomposition exhibits an almost-independent

structure.

2

Variational Fourier Features

• We combine our approximation scheme with the vari at i onal Bayesian fr ame work for

Gaussian processes, and present how the methodology generalizes to GP inference

problems with general likelihoods.

Consequently, this paper aims to provide an ap pr oximate Gaussian process inference scheme

which is theoretically sound, has strong representative power, is extremely fast, and—most

importantly—works well in practice.

The rest of th i s paper is structured as follows. Section 2 reviews the relevant background

related to Gauss ian process inference and previous work in sparse and spectral approxima-

tions to GP models. In Section 3 the core methodology for Variational Fourier Features is

presented for the one-dimensional case. In Section 4 th is is extended to multidimensional

cases with additive and product structures. Implementation details and com pu t at ion al

complexity of the method are covered in Section 5. Section 6 is d ed i cat ed to a set of il -

lustrative toy examples and a number of emp ir i cal experiments, where practical aspects

of Variational Fourier Feature inference are demonstrated. Finally, we conclude th e paper

with a discussion in Section 7.

2. Background

In t h i s section, we ﬁrst set up the family of Gaussian process models that we will consider,

which allows us to establish some notation. Next, we review some basic results regard i ng the

spectrum of stationary Gaussian processes, and recall how Fourier features approximate t he

kernel. We relate sparse Gaussian process approximations and Fourier approximations by

explicating them as alternative models (Qui˜nonero-Candela and Rasmussen, 2005). We then

recap the var iat i on al approximation to Gaussian process es , including expressions for spar se

approximations and approximations for non-conju gat e likelihoods. The two ﬁnal subsections

in this section discuss decomposition of Gaussian processe s : we ﬁrst link decomposition and

conditioning and then discuss inter-domain de com posit i on.

2.1 Gaussian process models

Gaussian processes are distributions over functions, deﬁned by a mean function and a

covari an ce function (see Williams and Rasmussen, 2006, for an introduction). In this section

we consider G Ps over the real line, x ∈ R. Making the standard assumption that the mean

function is zero, we write

f(x) ∼ GP



0, k(x, x

′

)



. (1)

The data y = [y

n

]

N

n=1

at locati on s X = [x

n

]

N

n=1

are condi t i on ed on the function evaluations

f = [f (x

n

)]

N

n=1

through some factorising likelihood

p(y |f(x)) = p(y |f) =

Y

n

p(y

n

|f(x

n

)) , (2)

which we do not in general assume to b e Gaussian. A deﬁning property of Gaussian pro-

cesses is that any ﬁnite number of function eval uat i on s follow a multivariate normal distri-

bution, so

p(f) = N(0, K

ﬀ

) , (3)

3

Hensman, Durrande, and Solin

and the process cond i ti on ed on these values is given by a conditional Gaussian process:

f(x) |f ∼ GP



k

f

(x)

⊤

K

−1

ﬀ

f, k(x, x

′

) − k

f

(x)

⊤

K

−1

ﬀ

k

f

(x

′

)



. (4)

To compute the poste ri or over functions given the data f(x) |y, it is possible to compute

the posterior only for the ﬁnite set p(f |y) using standard methodologies like Markov Chain

Monte Carlo (MCMC) (e.g. Murray et al., 2010; Filippone et al., 2013) or variational Bayes

(e.g. Opper and Archambeau, 2009; Khan et al., 2012, 2013). Evaluating the function at a

test point means averaging equation (4) over this posterior p(f |y).

In our notation, the matrix K

ﬀ

is given by evaluating the covar ian ce function at all

pairs of data points K

ﬀ

[i, j] = k(x

i

, x

j

). The vector valued function k

f

(x) is given simi-

larly, k

f

(x) = [k(x

1

, x), k(x

2

, x), . . . , k(x

n

, x)]

⊤

. We omit dependence on covariance function

parameters for clarity.

2.2 Spectral representations of stationary covarian ces

Stationary G aus si an processes are those wi t h a covari an ce function that can be written as

a function of the distance between observations, k(x, x

′

) = k(|x −x

′

|) = k(r). Of particular

interest in this work wi l l be the Mat´ern family with half-integer orders, the ﬁrst three of

which are

k

1

2

(r) = σ

2

exp(−r/ℓ) ,

k

3

2

(r) = σ

2

(1 +

√

3r/ℓ) exp(−

√

3r/ℓ) ,

k

5

2

(r) = σ

2

(1 +

√

5r/ℓ +

5

3

r

2

/ℓ

2

) exp(−

√

5r/ℓ) ,

(5)

where σ

2

and ℓ are variance and lengthscale parameters respectively.

Bochner’s theorem (Akhiezer and Glazman, 1993) tells us that any continuous positive

deﬁnite function, such as a covariance function, can b e repre sented as the Fourier transform

of a positive measure. If the measure has a density, it is known as the spectral density

s(ω) of the covariance function. This gi ves rise to the Fourier duality of spectral densities

and covariance functions, known as the W i en er -K h intchin theorem. It gi ves the following

relations:

s(ω) = F{k(r)} =

Z

∞

−∞

k(r)e

−iωr

dr , (6)

k(r) = F

−1

{s(ω)} =

1

2π

Z

∞

−∞

s(ω)e

iωr

dω . (7)

Since kernels are symmetric r eal functions, the spectrum of the proc es s is also a symmetric

real function. The spectra corresponding to the Mat´ern covariances are

s

1

2

(ω) = 2σ

2

λ(λ

2

+ ω

2

)

−1

, λ = ℓ

−1

, (8)

s

3

2

(ω) = 4σ

2

λ

3

(λ

2

+ ω

2

)

−2

, λ =

√

3ℓ

−1

, (9)

s

5

2

(ω) =

16

3

σ

2

λ

5

(λ

2

+ ω

2

)

−3

, λ =

√

5ℓ

−1

. (10)

4

Variational Fourier features for Gaussian processes

Citations

When Gaussian Process Meets Big Data: A Review of Scalable GPs

Efficient High Dimensional Bayesian Optimization with Additivity and Quadrature Fourier Features

Hilbert space methods for reduced-rank Gaussian process regression

Convolutional Gaussian Processes

Spectral Mixture Kernels for Multi-Output Gaussian Processes

References

Variational Inference: A Review for Statisticians

Random Features for Large-Scale Kernel Machines

Hierarchical Modeling and Analysis for Spatial Data

Theory of linear operators in Hilbert space

A Unifying View of Sparse Approximate Gaussian Process Regression

Related Papers (5)

Gaussian Processes for Machine Learning

Variational Learning of Inducing Variables in Sparse Gaussian Processes

A Unifying View of Sparse Approximate Gaussian Process Regression

Sparse Gaussian Processes using Pseudo-inputs

Gaussian processes for Big data