What are the contributions mentioned in the paper "Manifold gaussian processes for regression" ?

In this paper, the authors propose Manifold Gaussian Processes, a novel supervised method that jointly learns a transformation of the data into a feature space and a GP regression from the feature space to observed space. As a proof-of-concept, the authors evaluate their approach on complex non-smooth functions where standard GPs perform poorly, such as step functions and robotics tasks with contacts.

What is the effect of the mapping M?

Increasing the number of parameters of the mapping M intuitively leads to an increased flexibility in the learned covariance function.

what are the benefits of the mGP?

Applications that profit from the enhanced modeling capabilities of the mGP include robot modeling (e.g., contact and stiction modeling), reinforcement learning, and Bayesian optimization.

What is the main challenge of training mGPs using neural networks as mapping M?

One of the main challenges of training mGPs using neural networks as mapping M is the unwieldy joint optimization of the parameters θmGP.

What is the main argument of MacKay?

Unlike neural networks, which have been successfully used to extract complex features, MacKay (1998) argued that GPs are unsuited for feature learning.

What is the way to replace the deterministic mapping?

the authors replace this deterministic mapping with a probabilistic one, which would describe the uncertainty about the location of the discontinuity.

What is the novel approach to learning a regression model?

the authors introduce the Manifold Gaussian Processes, their novel approach to jointly learning a regression model and a suitable feature representation of the data.

(Open Access) Manifold Gaussian Processes for regression (2016) | Roberto Calandra

Q: What is the common approximation to the full Bayesian framework?

A common approximation to the full Bayesian framework is to introduce a deterministic feature space H, and to find the mappings M and G in two consecutive steps.

Manifold Gaussian Processes for Regression

Roberto Calandra

∗

, Jan Peters

∗†

, Carl Edward Rasmussen

‡

and Marc Peter Deisenroth

∗

Intelligent Autonomous Systems Lab, Technische Universit

at Darmstadt, Germany

†

Max Planck Institute for Intelligent Systems, T

ubingen, Germany

‡

Department of Engineering, University of Cambridge, United Kingdom

Department of Computing, Imperial College London, United Kingdom

Abstract—Off-the-shelf Gaussian Process (GP) covariance

functions encode smoothness assumptions on the structure

of the function to be modeled. To model complex and non-

differentiable functions, these smoothness assumptions are of-

ten too restrictive. One way to alleviate this limitation is to ﬁnd

a different representation of the data by introducing a feature

space. This feature space is often learned in an unsupervised

way, which might lead to data representations that are not

useful for the overall regression task. In this paper, we propose

Manifold Gaussian Processes, a novel supervised method that

jointly learns a transformation of the data into a feature

space and a GP regression from the feature space to observed

space. The Manifold GP is a full GP and allows to learn data

representations, which are useful for the overall regression

task. As a proof-of-concept, we evaluate our approach on

complex non-smooth functions where standard GPs perform

poorly, such as step functions and robotics tasks with contacts.

1. Introduction

Gaussian Processes (GPs) are a powerful state-of-the-art

nonparametric Bayesian regression method. The covariance

function of a GP implicitly encodes high-level assumptions

about the underlying function to be modeled, e.g., smooth-

ness or periodicity. Hence, the choice of a suitable covari-

ance function for a speciﬁc data set is crucial. A standard

choice is the squared exponential (Gaussian) covariance

function, which implies assumptions, such as smoothness

and stationarity. Although the squared exponential can be

applied to a great range of problems, generic covariance

functions may also be inadequate to model a variety of

functions where the common smoothness assumptions are

violated, such as ground contacts in robot locomotion.

Two common approaches can overcome the limitations

of standard covariance functions. The ﬁrst approach com-

bines multiple standard covariance functions to form a new

covariance function (Rasmussen and Williams, 2006; Wilson

and Adams, 2013; Duvenaud et al., 2013). This approach

allows to automatically design relatively complex covariance

functions. However, the resulting covariance function is still

limited by the properties of the combined covariance func-

tions. The second approach is based on data transformation

(or pre-processing), after which the data can be modeled

with standard covariance functions. One way to implement

this second approach is to transform the output space as

in the Warped GP (Snelson et al., 2004). An alternative is

to transform the input space. Transforming the input space

and subsequently applying GP regression with a standard

covariance function is equivalent to GP regression with a

new covariance function that explicitly depends on the trans-

formation (MacKay, 1998). One example is the stationary

periodic covariance function (MacKay, 1998; HajiGhassemi

and Deisenroth, 2014), which effectively is the squared

exponential covariance function applied to a complex rep-

resentation of the input variables. Common transformations

of the inputs include data normalization and dimensionality

reduction, e.g., PCA (Pearson, 1901). Generally, these input

transformations are good heuristics or optimize an unsuper-

vised objective. However, they may be suboptimal for the

overall regression task.

In this paper, we propose the Manifold Gaussian Process

(mGP), which is based on MacKay’s ideas to devise ﬂexible

covariance functions for GPs. Our GP model is equivalent

to jointly learning a data transformation into a feature space

followed by a GP regression with off-the-shelf covariance

functions from feature space to observed space. The model

proﬁts from standard GP properties, such as a straightfor-

ward incorporation of a prior mean function and a faithful

representation of model uncertainty.

Multiple related approaches in the literature attempt

joint supervised learning of features and regression/classi-

ﬁcation. In Salakhutdinov and Hinton (2007), pre-training

of the input transformation makes use of computationally

expensive unsupervised learning that requires thousands of

data points. Snoek et al. (2012) combined both unsuper-

vised and supervised objectives for the optimization of an

input transformation in a classiﬁcation task. Unlike these

approaches, the mGP is motivated by the need of a stronger

(i.e., supervised) guidance to discover suitable transforma-

tions for regression problems, while remaining within a

Bayesian framework. Damianou and Lawrence (2013) pro-

posed the Deep GP, which stacks multiple layers of GP-

LVMs, similarly to a neural network. This model exhibits

great ﬂexibility in supervised and unsupervised settings, but

the resulting model is not a full GP. Snelson and Ghahramani

(2006) proposed a supervised dimensionality reduction by

jointly learning a liner transformation of the input and a

X Y

Regression

(a) Supervised learning of a regression function.

X YL

Regression

(b) Supervised learning integrating out a latent space is

intractable.

X YH

Input Transformation

Regression

G|M

lowed by a conditional regression G|M .

X YH

Input Transformation + Regression

(d) Manifold GP: joint supervised learning of the input

transformation M and the regression task G.

Figure 1: Different regression settings to learn the function F : X → Y. (a) Standard supervised regression. (b) Regression

with an auxiliary latent space L that allows to simplify the task. In a full Bayesian framework, L would be integrated out,

which is analytically intractable. (c) Decomposition of the overall regression task F into discovering a feature space H

using the map M and a subsequent (conditional) regression G|M. (d) Our mGP learns the mappings G and M jointly.

GP. Snoek et al. (2014) transformed the input data using

a Beta distribution whose parameters were learned jointly

with the GP. However, the purpose of this transformation is

to account for skewness in the data, while mGP allows for

a more general class of transformations.

2. Manifold Gaussian Processes

In the following, we review methods for regression,

which may use latent or feature spaces. Then, we provide

a brief introduction to Gaussian Process regression. Finally,

we introduce the Manifold Gaussian Processes, our novel

approach to jointly learning a regression model and a suit-

able feature representation of the data.

2.1. Regression with Learned Features

We assume N training inputs x

∈ X ⊆ R

and

respective outputs y

∈ Y ⊆ R, where y

= F(x

) + w,

w ∼ N



0, σ



, n = 1, . . . , N. The training data is denoted

by X and Y for the inputs and targets, respectively. We con-

sider the task of learning a regression function F : X → Y.

The corresponding setting is given in Figure 1a. Discovering

the regression function F is often challenging for nonlinear

functions. A typical way to simplify and distribute the com-

plexity of the regression problem is to introduce an auxiliary

latent space L. The function F can then be decomposed

into F = G ◦ M , where M : X → L and G : L → Y,

as shown in Figure 1b. In a full Bayesian framework, the

latent space L is integrated out to solve the regression

task F , which is often analytically unfeasible (Schmidt and

O’Hagan, 2003).

A common approximation to the full Bayesian frame-

work is to introduce a deterministic feature space H, and to

ﬁnd the mappings M and G in two consecutive steps. First,

M is determined by means of unsupervised feature learning.

Second, the regression G is learned supervisedly as a con-

ditional model G|M , see Figure 1c. The use of this feature

space can reduce the complexity of the learning problem.

For example, for complicated non-linear functions a higher-

dimensional (overcomplete) representation H allows learn-

ing a simpler mapping G : H → Y. For high-dimensional

inputs, the data often lies on a lower-dimensional mani-

fold H, e.g., due to non-discriminant or strongly correlated

covariates. The lower-dimensional feature space H reduces

the effect of the curse of dimensionality. In this paper, we

focus on modeling complex functions with a relatively low-

dimensional input space, which, nonetheless, cannot be well

modeled by off-the-shelf GP covariance functions.

Typically, unsupervised feature learning methods de-

termine the mapping M by optimizing an unsupervised

objective, independent from the objective of the overall

regression F . Examples of such unsupervised objectives are

the minimization of the input reconstruction error (auto-

encoders (Vincent et al., 2008)), maximization of the vari-

ance (PCA (Pearson, 1901)), maximization of the statistical

independence (ICA (Hyv

arinen and Oja, 2000)), or the

preservation of the distances between data (isomap (Tenen-

baum et al., 2000) or LLE (Roweis and Saul, 2000)). In

the context of regression, an unsupervised approach for

feature learning can be insufﬁcient as the learned data

representation H might not suit the overall regression

task F (Wahlstr

om et al., 2015): Unsupervised and super-

vised learning optimize different objectives, which do not

necessarily match, e.g., minimizing the reconstruction error

as unsupervised objective and maximizing the marginal like-

lihood as supervised objective. An approach where feature

learning is performed in a supervised manner can instead

guide learning the feature mapping M toward representa-

tions that are useful for the overall regression F = G ◦ M.

This intuition is the key insight of our Manifold Gaussian

Processes, where the feature mapping M and the GP G

are learned jointly using the same supervised objective as

depicted in Figure 1d.

2.2. Gaussian Process Regression

GPs are a state-of-the-art probabilistic non-parametric

regression method (Rasmussen and Williams, 2006). Such

a GP is a distribution over functions

F ∼ GP (m, k) (1)

and fully deﬁned by a mean function m (in our case m ≡ 0)

and a covariance function k. The GP predictive distribution

at a test input x

∗

is given by

p (F(x

∗

)|D, x

∗

) = N



µ(x

∗

), σ

∗

)



, (2)

µ(x

∗

) = k

∗

(K + σ

−1

Y , (3)

∗

) = k

∗∗

− k

∗

(K + σ

−1

∗

, (4)

where D = {X, Y } is the training data, K is the ker-

nel matrix with K

= k(x

, x

), k

∗∗

= k(x

∗

, x

∗

= k(X, x

∗

) and σ

is the measurement noise variance.

In our experiments, we use different covariance functions

k. Speciﬁcally, we use the squared exponential covariance

function with Automatic Relevance Determination (ARD)

, x

) = σ

exp



−

−x

)

−1

−x

)



, (5)

with Λ = diag([l

, ..., l

]), where l

are the characteristic

length-scales, and σ

is the variance of the latent function F .

Furthermore, we use the neural network covariance function

, x

) = σ

sin

−1



P x

√

(1+x

P x

)(1+x

P x

)



, (6)

where P is a weight matrix. Each covariance function

possesses various hyperparameters θ to be selected. This

selection is performed by minimizing the Negative Log

Marginal Likelihood (NLML)

NLML(θ) = −log p(Y |X, θ) (7)

+ σ

−1

Y +

log |K

+ σ

Using the chain-rule, the corresponding gradient can be

computed analytically as

∂NLML(θ)

∂θ

∂NLML(θ)

∂K

∂θ

, (8)

which allows us to optimize the hyperparameters using

Quasi-Newton optimization, e.g., L-BFGS (Liu and No-

cedal, 1989).

2.3. Manifold Gaussian Processes

In this section, we describe the mGP model and its

parameters θ

mGP

itself, and relate it to standard GP regres-

sion. Furthermore, we detail training and prediction with the

mGP.

2.3.1. Model. As shown in Figure 1d, the mGP considers

the overall regression as a composition of functions

F = G ◦ M . (9)

The two functions M and G are learned jointly to ac-

complish the overall regression objective function, i.e., the

marginal likelihood in Equation (7). In this paper, we assume

that M is a deterministic, parametrized function that maps

the input space X into the feature space H ⊆ R

, which

serves as the domain for the GP regression G : H → Y.

Performing this transformation of the input data corresponds

to training a GP G having H = M(X) as inputs. Therefore,

the mGP is equivalent to a GP for a function F : X → Y

with a covariance function

k deﬁned as

k(x

, x

) = k (M(x

), M(x

)) , (10)

i.e., the kernel operates on the Q-dimensional feature space

H = M(X). According to MacKay (1998), a function

deﬁned as in Equation (10) is a valid covariance function

and, therefore, the mGP is a valid GP.

The predictive distribution for the mGP at a test input x

∗

can then be derived from the predictive distribution of a

standard GP in Equation (2) as

p (F(x

∗

)|D, x

∗

) = p ((G ◦ M)(x

∗

)|D, x

∗

)

= N



µ(M(x

∗

)), σ

(M(x

∗

))



, (11)

µ(M(x

∗

)) =

∗

(

K + σ

−1

Y , (12)

(M(x

∗

)) =

∗∗

−

∗

(

K + σ

−1

∗

, (13)

where

K is the kernel matrix constructed as

k(x

, x

∗∗

k(x

∗

, x

∗

k(X, x

∗

), and

k is the

covariance function from Equation (10). In our experiments,

we used the squared exponential covariance function from

Equation (5) for the kernel k in Equation (10).

2.3.2. Training. We train the mGP by jointly optimiz-

ing the parameters θ

of the transformation M and

the GP hyperparameters θ

. For learning the parame-

ters θ

mGP

= [θ

, θ

], we minimize the NLML as in the

standard GP regression. Considering the composition of the

mapping F = G ◦ M , the NLML becomes

NLML(θ

mGP

) = −log p (Y |X, θ

mGP

)

(

mGP

+ σ

−1

Y +

log |

mGP

+ σ

I|.

Note that

mGP

depends on both θ

and θ

, unlike K

from Equation (7), which depends only on θ

. The analytic

gradients ∂NLML/∂θ

of the objective in Equation (14)

with respect to the parameters θ

are computed as in the

standard GP, i.e.,

∂NLML(θ

mGP

)

∂θ

∂NLML(θ

mGP

)

∂K

mGP

∂K

mGP

∂θ

. (14)

The gradients of the parameters θ

of the feature mapping

are computed by applying the chain-rule

∂NLML(θ

mGP

)

∂θ

∂NLML(θ

mGP

)

∂K

mGP

∂K

mGP

∂H

∂θ

, (15)

where only ∂H/∂θ

depends on the chosen input trans-

formation M, while ∂K

mGP

/∂H is the gradient of the

kernel matrix with respect to the Q-dimensional GP training

inputs H = M(X). Similarly to standard GP, the param-

eters θ

mGP

in the mGP can be obtained using off-the-shelf

optimization methods.

2.3.3. Input Transformation. Our approach can use any

deterministic parametric data transformation M. We focus

on multi-layer neural networks and deﬁne their structure as

−. . . −q

] where l is the number of layers, and q

is the

number of neurons of the i

layer. Each layer i = 1, . . . , l

of the neural network performs the transformation

(Z) = σ (W

Z + B

) , (16)

where Z is the input of the layer, σ is the transfer func-

tion, and W

and B

are the weights and the bias of

the layer, respectively. Therefore, the input transforma-

tion M of Equation (9) is M(X) = (T

◦ . . . ◦ T

)(X).

The parameters θ

of the neural network M are

the weights and biases of the whole network, so that

= [W

, B

, . . . , W

, B

]. The gradients ∂H/∂θ

Equation (15) are computed by repeated application of the

chain-rule (backpropagation).

3. Experimental Results

To demonstrate the efﬁciency of our proposed approach,

we apply the mGP to challenging benchmark problems

and a real-world regression task. First, we demonstrate that

mGPs can be successfully applied to learning discontin-

uous functions, a daunting undertaking with an off-the-

shelf covariance function, due to its underlying smoothness

assumptions. Second, we evaluate mGPs on a function with

multiple natural length-scales. Third, we assess mGPs on

real data from a walking bipedal robot. The locomotion

data set is highly challenging due to ground contacts, which

cause the regression function to violate standard smoothness

assumptions.

To evaluate the goodness of the different models on the

training set, we consider the NLML previously introduced

in Equation (7) and (14). Additionally, for the test set, we

make use of the Negative Log Predictive Probability (NLPP)

−log p(y = y

∗

|X, x

∗

, Y , θ) , (17)

where the y

∗

is the test target for the input x

∗

as computed

for the standard GP in Equation (2) and (11) for the mGP

model.

We compare our mGP approach with GPs using the SE-

ARD and NN covariance functions, which implement the

model in Figure 1a. Moreover, we evaluate two unsupervised

feature extraction methods, Random Embeddings and PCA,

followed by a GP SE-ARD, which implements the model

in Figure 1c.

For the model in Figure 1d, we consider two

variants of mGP with the log-sigmoid σ (x) = 1/(1 + e

−x

)

and the identity σ (x) = x transfer functions. These two

transfer functions lead to a non-linear and a linear trans-

formation M, respectively.

3.1. Step Function

In the following, we consider the step function

y = F(x) + w , w ∼ N



0, 0.01



F(x) =

(

0 if x ≤ 0

1 if x > 0

. (18)

For training, 100 inputs points are sampled from N



0, 1



while the test set is composed of 500 data points uniformly

distributed between −5 and +5. The mGP uses a multi-layer

neural network of [1-6-2] neurons (such that the feature

space H ⊆ R

) for the mapping M and a standard SE-ARD

covariance function for the GP regression G. Values of

the NLML per data point for the training and NLPP per

data point for the test set are reported in Table 1. In both

performance measures, the mGP using a non-linear trans-

formation outperforms the other models. An example of the

resulting predictive mean and the 95% conﬁdence bounds

for three models is shown in Figure 2a. Due to the implicit

assumptions employed by the SE-ARD and NN covariance

functions on the mapping F , neither of them appropriately

captures the discontinuous nature of the underlying function

or its correct noise level. The GP model applied to the

random embedding and mGP (identity) perform similar to a

standard GP with SE-ARD covariance function as their lin-

ear transformations do not substantially change the function.

Compared to these models, the mGP (log-sigmoid) captures

the discontinuities of the function better, thanks to its non-

linear transformation, while the uncertainty remains small

over the whole function’s domain.

Note that the mGP still assumes smoothness in the

regression G, which requires the transformation M to take

care of the discontinuity. This effect can be observed in

Figure 2b, where an example of the 2D learned feature

space H is shown. The discontinuity is already encoded in

the feature space. Hence, it is easier for the GP to learn the

mapping G. Learning the discontinuity in the feature space

is a direct result from jointly training M and G as feature

learning is embedded in the overall regression F .

1. The random embedding is computed as the transformation H = αX ,

where the elements of α are randomly sampled from a normal distribution.

-5 -4 -3 -2 -1 0 1 2 3 4 5

-0.5

0.5

1.5

GP SE-ARD

Input X

Output Y

GP NN

mGP

Training set

(a) GP prediction.

-5 -4 -3 -2 -1 0 1 2 3 4 5

-0.2

0.2

0.4

0.6

0.8

Latent dimension 1

Latent dimension

Input X

Feature space H

(b) Learned mapping M using mGP (log-sigmoid).

Figure 2: Step Function: (a) Predictive mean and 95% conﬁdence bounds for a GP with SE-ARD covariance function

(blue solid), a GP with NN covariance function (red dotted) and a log-sigmoid mGP (green dashed) on the step function

of Equation (18). The discontinuity is captured better by an mGP than by a regular GP with either SE-ARD or NN

covariance functions. (b) The 2D feature space H discovered by the non-linear mapping M as a function of the input X.

The discontinuity of the modeled function is already captured by the non-linear mapping M . Hence, the mapping from

feature space H to the output Y is smooth and can be easily managed by the GP.

Table 1: Step Function: Negative Log Marginal Likelihood

(NLML) and Negative Log Predictive Probability (NLPP)

per data point for the step function of Equation (18). The

mGP (log-sigmoid) captures the nature of the underlying

function better than a standard GP in both the training and

test sets.

Method Training set Test set

NLML RMSE NLPP RMSE

GP SE-ARD −0.68 1.00 × 10

−2

+0.50 × 10

−3

0.58

GP NN −1.49 0.57 × 10

−2

+0.02 × 10

−3

0.14

mGP (log-sigmoid) −2.84 1.06 × 10

−2

−6.34 × 10

−3

0.02

mGP (identity) −0.68 1.00 × 10

−2

+0.50 × 10

−3

0.58

RandEmb + GP SE-ARD −0.77 5.26 × 10

−2

+0.51 × 10

−3

0.52

3.2. Multiple Length-Scales

In the following, we demonstrate that the mGP can

be used to model functions that possess multiple intrinsic

length-scales. For this purpose, we rotate the function

y = 1 − N



|3, 0.5



− N



| − 3, 0.5



100

(19)

anti-clockwise by 45

◦

. The intensity map of the resulting

function is shown in Figure 3a. By itself (i.e., without

rotating the function), Equation (19) is a fairly simple

function. However, when rotated, the correlation between

the covariates substantially complicates modeling. If we

consider a horizontal slice of the rotated function, we can see

how different spectral frequencies are present in the func-

tion, see Figure 3d. The presence of different frequencies

is problematic for covariance functions, such as the SE-

ARD, which assume a single frequency. When learning the

hyperparameters, the length-scale needs to trade off differ-

ent frequencies. Typically, the hyperparameter optimization

gives a preference to shorter length-scales. However, such

a trade-off greatly reduces the generalization capabilities of

the model.

We compare the performances of a standard GP using

SE-ARD and NN covariance functions and random em-

beddings followed by a GP using the SE-ARD covariance

function, and our proposed mGP. We train these models

with 400 data points, randomly sampled from a uniform

distribution in the intervals x

= [0, 10], x

= [0, 10]. As a

test set we use 2500 data points distributed on a regular

grid in the same intervals. For the mGP with both the

log-sigmoid and the identify transfer functions, we use a

neural network of [2-10-3] neurons. The NLML and the

NLPP per data point are shown in Table 2. The mGP

outperforms all other methods evaluated. We believe that

this is due to the mapping M, which transforms the input

space so as to have a single natural frequency. Figure 3b

shows the intensity map of the feature space after the mGP

transformed the inputs using a neural network with the

identify transfer function. Figure 3c shows the intensity map

of the feature when the log-sigmoid transfer function is used.

Both transformations tend to make the feature space smother

compared to the initial input space. This effect is the result

of the transformations, which aim to equalize the natural

frequencies of the original function in order to capture them

more efﬁciently with a single length-scale. The effects of

these transformations are clearly visible in the spectrogram

of the mGP (identity) in Figure 3e and of the mGP (log-

sigmoid) in Figure 3f. The smaller support of the spectrum,

obtained through the non-linear transformations performed

by mGP using the log-sigmoid transfer function, translates

into superior prediction performance.

3.3. Bipedal Robot Locomotion

Modeling data from real robots can be challenging when

the robot has physical interactions with the environment.

Especially in bipedal locomotion, we lack good contact

Manifold Gaussian Processes for regression

Figures

Citations

Taking the Human Out of the Loop: A Review of Bayesian Optimization

Hidden physics models: Machine learning of nonlinear partial differential equations

Machine learning of linear differential equations using Gaussian processes

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

When Gaussian Process Meets Big Data: A Review of Scalable GPs

References

Nonlinear dimensionality reduction by locally linear embedding.

A global geometric framework for nonlinear dimensionality reduction.

Gaussian Processes for Machine Learning

LIII. On lines and planes of closest fit to systems of points in space

Independent component analysis: algorithms and applications

Related Papers (5)

Gaussian Processes for Machine Learning

Taking the Human Out of the Loop: A Review of Bayesian Optimization

Adam: A Method for Stochastic Optimization

A Unifying View of Sparse Approximate Gaussian Process Regression

Sparse Gaussian Processes using Pseudo-inputs

Frequently Asked Questions (15)

Q1. What are the contributions mentioned in the paper "Manifold gaussian processes for regression" ?

Q2. What is the common approximation to the full Bayesian framework?

Q3. What is the effect of the mapping M?

Q4. What is the way to model a variety of functions?

Q5. what are the benefits of the mGP?

Q6. What is the main challenge of training mGPs using neural networks as mapping M?

Q7. What is the covariance function of a GP?

Q8. Why is the locomotion data set difficult?

Q9. What is the effect of different frequencies in the feature space?

Q10. What is the main argument of MacKay?

Q11. What is the way to replace the deterministic mapping?

Q12. What is the effect of learning the discontinuity in the feature space?

Q13. What is the gradient of the kernel matrix?

Q14. What is the purpose of the supervised learning of the input and aGP?

Q15. What is the novel approach to learning a regression model?