scispace - formally typeset
Open AccessProceedings ArticleDOI

Manifold Gaussian Processes for regression

Reads0
Chats0
TLDR
Manifold Gaussian Processes is a novel supervised method that jointly learns a transformation of the data into a feature space and a GP regression from the feature space to observed space, which allows to learn data representations, which are useful for the overall regression task.
Abstract
Off-the-shelf Gaussian Process (GP) covariance functions encode smoothness assumptions on the structure of the function to be modeled. To model complex and non-differentiable functions, these smoothness assumptions are often too restrictive. One way to alleviate this limitation is to find a different representation of the data by introducing a feature space. This feature space is often learned in an unsupervised way, which might lead to data representations that are not useful for the overall regression task. In this paper, we propose Manifold Gaussian Processes, a novel supervised method that jointly learns a transformation of the data into a feature space and a GP regression from the feature space to observed space. The Manifold GP is a full GP and allows to learn data representations, which are useful for the overall regression task. As a proof-of-concept, we evaluate our approach on complex non-smooth functions where standard GPs perform poorly, such as step functions and robotics tasks with contacts.

read more

Content maybe subject to copyright    Report

Manifold Gaussian Processes for Regression
Roberto Calandra
, Jan Peters
, Carl Edward Rasmussen
and Marc Peter Deisenroth
§
Intelligent Autonomous Systems Lab, Technische Universit
¨
at Darmstadt, Germany
Max Planck Institute for Intelligent Systems, T
¨
ubingen, Germany
Department of Engineering, University of Cambridge, United Kingdom
§
Department of Computing, Imperial College London, United Kingdom
Abstract—Off-the-shelf Gaussian Process (GP) covariance
functions encode smoothness assumptions on the structure
of the function to be modeled. To model complex and non-
differentiable functions, these smoothness assumptions are of-
ten too restrictive. One way to alleviate this limitation is to find
a different representation of the data by introducing a feature
space. This feature space is often learned in an unsupervised
way, which might lead to data representations that are not
useful for the overall regression task. In this paper, we propose
Manifold Gaussian Processes, a novel supervised method that
jointly learns a transformation of the data into a feature
space and a GP regression from the feature space to observed
space. The Manifold GP is a full GP and allows to learn data
representations, which are useful for the overall regression
task. As a proof-of-concept, we evaluate our approach on
complex non-smooth functions where standard GPs perform
poorly, such as step functions and robotics tasks with contacts.
1. Introduction
Gaussian Processes (GPs) are a powerful state-of-the-art
nonparametric Bayesian regression method. The covariance
function of a GP implicitly encodes high-level assumptions
about the underlying function to be modeled, e.g., smooth-
ness or periodicity. Hence, the choice of a suitable covari-
ance function for a specific data set is crucial. A standard
choice is the squared exponential (Gaussian) covariance
function, which implies assumptions, such as smoothness
and stationarity. Although the squared exponential can be
applied to a great range of problems, generic covariance
functions may also be inadequate to model a variety of
functions where the common smoothness assumptions are
violated, such as ground contacts in robot locomotion.
Two common approaches can overcome the limitations
of standard covariance functions. The first approach com-
bines multiple standard covariance functions to form a new
covariance function (Rasmussen and Williams, 2006; Wilson
and Adams, 2013; Duvenaud et al., 2013). This approach
allows to automatically design relatively complex covariance
functions. However, the resulting covariance function is still
limited by the properties of the combined covariance func-
tions. The second approach is based on data transformation
(or pre-processing), after which the data can be modeled
with standard covariance functions. One way to implement
this second approach is to transform the output space as
in the Warped GP (Snelson et al., 2004). An alternative is
to transform the input space. Transforming the input space
and subsequently applying GP regression with a standard
covariance function is equivalent to GP regression with a
new covariance function that explicitly depends on the trans-
formation (MacKay, 1998). One example is the stationary
periodic covariance function (MacKay, 1998; HajiGhassemi
and Deisenroth, 2014), which effectively is the squared
exponential covariance function applied to a complex rep-
resentation of the input variables. Common transformations
of the inputs include data normalization and dimensionality
reduction, e.g., PCA (Pearson, 1901). Generally, these input
transformations are good heuristics or optimize an unsuper-
vised objective. However, they may be suboptimal for the
overall regression task.
In this paper, we propose the Manifold Gaussian Process
(mGP), which is based on MacKay’s ideas to devise flexible
covariance functions for GPs. Our GP model is equivalent
to jointly learning a data transformation into a feature space
followed by a GP regression with off-the-shelf covariance
functions from feature space to observed space. The model
profits from standard GP properties, such as a straightfor-
ward incorporation of a prior mean function and a faithful
representation of model uncertainty.
Multiple related approaches in the literature attempt
joint supervised learning of features and regression/classi-
fication. In Salakhutdinov and Hinton (2007), pre-training
of the input transformation makes use of computationally
expensive unsupervised learning that requires thousands of
data points. Snoek et al. (2012) combined both unsuper-
vised and supervised objectives for the optimization of an
input transformation in a classification task. Unlike these
approaches, the mGP is motivated by the need of a stronger
(i.e., supervised) guidance to discover suitable transforma-
tions for regression problems, while remaining within a
Bayesian framework. Damianou and Lawrence (2013) pro-
posed the Deep GP, which stacks multiple layers of GP-
LVMs, similarly to a neural network. This model exhibits
great flexibility in supervised and unsupervised settings, but
the resulting model is not a full GP. Snelson and Ghahramani
(2006) proposed a supervised dimensionality reduction by
jointly learning a liner transformation of the input and a

X Y
Regression
F
(a) Supervised learning of a regression function.
X YL
Regression
M
G
(b) Supervised learning integrating out a latent space is
intractable.
X YH
Input Transformation
Regression
M
G|M
(c) Unsupervisedly learned input transformation M fol-
lowed by a conditional regression G|M .
X YH
Input Transformation + Regression
M
G
(d) Manifold GP: joint supervised learning of the input
transformation M and the regression task G.
Figure 1: Different regression settings to learn the function F : X Y. (a) Standard supervised regression. (b) Regression
with an auxiliary latent space L that allows to simplify the task. In a full Bayesian framework, L would be integrated out,
which is analytically intractable. (c) Decomposition of the overall regression task F into discovering a feature space H
using the map M and a subsequent (conditional) regression G|M. (d) Our mGP learns the mappings G and M jointly.
GP. Snoek et al. (2014) transformed the input data using
a Beta distribution whose parameters were learned jointly
with the GP. However, the purpose of this transformation is
to account for skewness in the data, while mGP allows for
a more general class of transformations.
2. Manifold Gaussian Processes
In the following, we review methods for regression,
which may use latent or feature spaces. Then, we provide
a brief introduction to Gaussian Process regression. Finally,
we introduce the Manifold Gaussian Processes, our novel
approach to jointly learning a regression model and a suit-
able feature representation of the data.
2.1. Regression with Learned Features
We assume N training inputs x
n
X R
D
and
respective outputs y
n
Y R, where y
n
= F(x
n
) + w,
w N
0, σ
2
w
, n = 1, . . . , N. The training data is denoted
by X and Y for the inputs and targets, respectively. We con-
sider the task of learning a regression function F : X Y.
The corresponding setting is given in Figure 1a. Discovering
the regression function F is often challenging for nonlinear
functions. A typical way to simplify and distribute the com-
plexity of the regression problem is to introduce an auxiliary
latent space L. The function F can then be decomposed
into F = G M , where M : X L and G : L Y,
as shown in Figure 1b. In a full Bayesian framework, the
latent space L is integrated out to solve the regression
task F , which is often analytically unfeasible (Schmidt and
O’Hagan, 2003).
A common approximation to the full Bayesian frame-
work is to introduce a deterministic feature space H, and to
find the mappings M and G in two consecutive steps. First,
M is determined by means of unsupervised feature learning.
Second, the regression G is learned supervisedly as a con-
ditional model G|M , see Figure 1c. The use of this feature
space can reduce the complexity of the learning problem.
For example, for complicated non-linear functions a higher-
dimensional (overcomplete) representation H allows learn-
ing a simpler mapping G : H Y. For high-dimensional
inputs, the data often lies on a lower-dimensional mani-
fold H, e.g., due to non-discriminant or strongly correlated
covariates. The lower-dimensional feature space H reduces
the effect of the curse of dimensionality. In this paper, we
focus on modeling complex functions with a relatively low-
dimensional input space, which, nonetheless, cannot be well
modeled by off-the-shelf GP covariance functions.
Typically, unsupervised feature learning methods de-
termine the mapping M by optimizing an unsupervised
objective, independent from the objective of the overall
regression F . Examples of such unsupervised objectives are
the minimization of the input reconstruction error (auto-
encoders (Vincent et al., 2008)), maximization of the vari-
ance (PCA (Pearson, 1901)), maximization of the statistical
independence (ICA (Hyv
¨
arinen and Oja, 2000)), or the
preservation of the distances between data (isomap (Tenen-
baum et al., 2000) or LLE (Roweis and Saul, 2000)). In
the context of regression, an unsupervised approach for

feature learning can be insufficient as the learned data
representation H might not suit the overall regression
task F (Wahlstr
¨
om et al., 2015): Unsupervised and super-
vised learning optimize different objectives, which do not
necessarily match, e.g., minimizing the reconstruction error
as unsupervised objective and maximizing the marginal like-
lihood as supervised objective. An approach where feature
learning is performed in a supervised manner can instead
guide learning the feature mapping M toward representa-
tions that are useful for the overall regression F = G M.
This intuition is the key insight of our Manifold Gaussian
Processes, where the feature mapping M and the GP G
are learned jointly using the same supervised objective as
depicted in Figure 1d.
2.2. Gaussian Process Regression
GPs are a state-of-the-art probabilistic non-parametric
regression method (Rasmussen and Williams, 2006). Such
a GP is a distribution over functions
F GP (m, k) (1)
and fully defined by a mean function m (in our case m 0)
and a covariance function k. The GP predictive distribution
at a test input x
is given by
p (F(x
)|D, x
) = N
µ(x
), σ
2
(x
)
, (2)
µ(x
) = k
T
(K + σ
2
w
I)
1
Y , (3)
σ
2
(x
) = k
∗∗
k
T
(K + σ
2
w
I)
1
k
, (4)
where D = {X, Y } is the training data, K is the ker-
nel matrix with K
ij
= k(x
i
, x
j
), k
∗∗
= k(x
, x
),
k
= k(X, x
) and σ
2
w
is the measurement noise variance.
In our experiments, we use different covariance functions
k. Specifically, we use the squared exponential covariance
function with Automatic Relevance Determination (ARD)
k
SE
(x
p
, x
q
) = σ
2
f
exp
1
2
(x
p
x
q
)
T
Λ
1
(x
p
x
q
)
, (5)
with Λ = diag([l
2
1
, ..., l
2
D
]), where l
i
are the characteristic
length-scales, and σ
2
f
is the variance of the latent function F .
Furthermore, we use the neural network covariance function
k
NN
(x
p
, x
q
) = σ
2
f
sin
1
x
T
p
P x
q
(1+x
T
p
P x
p
)(1+x
T
q
P x
q
)
, (6)
where P is a weight matrix. Each covariance function
possesses various hyperparameters θ to be selected. This
selection is performed by minimizing the Negative Log
Marginal Likelihood (NLML)
NLML(θ) = log p(Y |X, θ) (7)
·
=
1
2
Y
T
(K
θ
+ σ
2
w
I)
1
Y +
1
2
log |K
θ
+ σ
2
w
I|
Using the chain-rule, the corresponding gradient can be
computed analytically as
NLML(θ)
θ
=
NLML(θ)
K
θ
K
θ
θ
, (8)
which allows us to optimize the hyperparameters using
Quasi-Newton optimization, e.g., L-BFGS (Liu and No-
cedal, 1989).
2.3. Manifold Gaussian Processes
In this section, we describe the mGP model and its
parameters θ
mGP
itself, and relate it to standard GP regres-
sion. Furthermore, we detail training and prediction with the
mGP.
2.3.1. Model. As shown in Figure 1d, the mGP considers
the overall regression as a composition of functions
F = G M . (9)
The two functions M and G are learned jointly to ac-
complish the overall regression objective function, i.e., the
marginal likelihood in Equation (7). In this paper, we assume
that M is a deterministic, parametrized function that maps
the input space X into the feature space H R
Q
, which
serves as the domain for the GP regression G : H Y.
Performing this transformation of the input data corresponds
to training a GP G having H = M(X) as inputs. Therefore,
the mGP is equivalent to a GP for a function F : X Y
with a covariance function
˜
k defined as
˜
k(x
p
, x
q
) = k (M(x
p
), M(x
q
)) , (10)
i.e., the kernel operates on the Q-dimensional feature space
H = M(X). According to MacKay (1998), a function
defined as in Equation (10) is a valid covariance function
and, therefore, the mGP is a valid GP.
The predictive distribution for the mGP at a test input x
can then be derived from the predictive distribution of a
standard GP in Equation (2) as
p (F(x
)|D, x
) = p ((G M)(x
)|D, x
)
= N
µ(M(x
)), σ
2
(M(x
))
, (11)
µ(M(x
)) =
˜
k
T
(
˜
K + σ
2
w
I)
1
Y , (12)
σ
2
(M(x
)) =
˜
k
∗∗
˜
k
T
(
˜
K + σ
2
w
I)
1
˜
k
, (13)
where
˜
K is the kernel matrix constructed as
˜
K
ij
=
˜
k(x
i
, x
j
),
˜
k
∗∗
=
˜
k(x
, x
),
˜
k
=
˜
k(X, x
), and
˜
k is the
covariance function from Equation (10). In our experiments,
we used the squared exponential covariance function from
Equation (5) for the kernel k in Equation (10).
2.3.2. Training. We train the mGP by jointly optimiz-
ing the parameters θ
M
of the transformation M and
the GP hyperparameters θ
G
. For learning the parame-
ters θ
mGP
= [θ
M
, θ
G
], we minimize the NLML as in the
standard GP regression. Considering the composition of the
mapping F = G M , the NLML becomes
NLML(θ
mGP
) = log p (Y |X, θ
mGP
)
·
=
1
2
Y
T
(
˜
K
θ
mGP
+ σ
2
w
I)
1
Y +
1
2
log |
˜
K
θ
mGP
+ σ
2
w
I|.
Note that
˜
K
θ
mGP
depends on both θ
G
and θ
M
, unlike K
θ
from Equation (7), which depends only on θ
G
. The analytic
gradients NLML/∂θ
G
of the objective in Equation (14)

with respect to the parameters θ
G
are computed as in the
standard GP, i.e.,
NLML(θ
mGP
)
θ
G
=
NLML(θ
mGP
)
K
θ
mGP
K
θ
mGP
θ
G
. (14)
The gradients of the parameters θ
M
of the feature mapping
are computed by applying the chain-rule
NLML(θ
mGP
)
θ
M
=
NLML(θ
mGP
)
K
θ
mGP
K
θ
mGP
H
H
θ
M
, (15)
where only H/∂θ
M
depends on the chosen input trans-
formation M, while K
θ
mGP
/∂H is the gradient of the
kernel matrix with respect to the Q-dimensional GP training
inputs H = M(X). Similarly to standard GP, the param-
eters θ
mGP
in the mGP can be obtained using off-the-shelf
optimization methods.
2.3.3. Input Transformation. Our approach can use any
deterministic parametric data transformation M. We focus
on multi-layer neural networks and define their structure as
[q
1
. . . q
l
] where l is the number of layers, and q
i
is the
number of neurons of the i
th
layer. Each layer i = 1, . . . , l
of the neural network performs the transformation
T
i
(Z) = σ (W
i
Z + B
i
) , (16)
where Z is the input of the layer, σ is the transfer func-
tion, and W
i
and B
i
are the weights and the bias of
the layer, respectively. Therefore, the input transforma-
tion M of Equation (9) is M(X) = (T
l
. . . T
1
)(X).
The parameters θ
M
of the neural network M are
the weights and biases of the whole network, so that
θ
M
= [W
1
, B
1
, . . . , W
l
, B
l
]. The gradients H/∂θ
M
in
Equation (15) are computed by repeated application of the
chain-rule (backpropagation).
3. Experimental Results
To demonstrate the efficiency of our proposed approach,
we apply the mGP to challenging benchmark problems
and a real-world regression task. First, we demonstrate that
mGPs can be successfully applied to learning discontin-
uous functions, a daunting undertaking with an off-the-
shelf covariance function, due to its underlying smoothness
assumptions. Second, we evaluate mGPs on a function with
multiple natural length-scales. Third, we assess mGPs on
real data from a walking bipedal robot. The locomotion
data set is highly challenging due to ground contacts, which
cause the regression function to violate standard smoothness
assumptions.
To evaluate the goodness of the different models on the
training set, we consider the NLML previously introduced
in Equation (7) and (14). Additionally, for the test set, we
make use of the Negative Log Predictive Probability (NLPP)
log p(y = y
|X, x
, Y , θ) , (17)
where the y
is the test target for the input x
as computed
for the standard GP in Equation (2) and (11) for the mGP
model.
We compare our mGP approach with GPs using the SE-
ARD and NN covariance functions, which implement the
model in Figure 1a. Moreover, we evaluate two unsupervised
feature extraction methods, Random Embeddings and PCA,
followed by a GP SE-ARD, which implements the model
in Figure 1c.
1
For the model in Figure 1d, we consider two
variants of mGP with the log-sigmoid σ (x) = 1/(1 + e
x
)
and the identity σ (x) = x transfer functions. These two
transfer functions lead to a non-linear and a linear trans-
formation M, respectively.
3.1. Step Function
In the following, we consider the step function
y = F(x) + w , w N
0, 0.01
2
,
F(x) =
(
0 if x 0
1 if x > 0
. (18)
For training, 100 inputs points are sampled from N
0, 1
while the test set is composed of 500 data points uniformly
distributed between 5 and +5. The mGP uses a multi-layer
neural network of [1-6-2] neurons (such that the feature
space H R
2
) for the mapping M and a standard SE-ARD
covariance function for the GP regression G. Values of
the NLML per data point for the training and NLPP per
data point for the test set are reported in Table 1. In both
performance measures, the mGP using a non-linear trans-
formation outperforms the other models. An example of the
resulting predictive mean and the 95% confidence bounds
for three models is shown in Figure 2a. Due to the implicit
assumptions employed by the SE-ARD and NN covariance
functions on the mapping F , neither of them appropriately
captures the discontinuous nature of the underlying function
or its correct noise level. The GP model applied to the
random embedding and mGP (identity) perform similar to a
standard GP with SE-ARD covariance function as their lin-
ear transformations do not substantially change the function.
Compared to these models, the mGP (log-sigmoid) captures
the discontinuities of the function better, thanks to its non-
linear transformation, while the uncertainty remains small
over the whole function’s domain.
Note that the mGP still assumes smoothness in the
regression G, which requires the transformation M to take
care of the discontinuity. This effect can be observed in
Figure 2b, where an example of the 2D learned feature
space H is shown. The discontinuity is already encoded in
the feature space. Hence, it is easier for the GP to learn the
mapping G. Learning the discontinuity in the feature space
is a direct result from jointly training M and G as feature
learning is embedded in the overall regression F .
1. The random embedding is computed as the transformation H = αX ,
where the elements of α are randomly sampled from a normal distribution.

-5 -4 -3 -2 -1 0 1 2 3 4 5
-0.5
0
0.5
1
1.5
GP SE-ARD
Input X
Output Y
GP NN
mGP
Training set
(a) GP prediction.
-5 -4 -3 -2 -1 0 1 2 3 4 5
-0.2
0
0.2
0.4
0.6
0.8
1
Latent dimension 1
Latent dimension
2
Input X
Feature space H
(b) Learned mapping M using mGP (log-sigmoid).
Figure 2: Step Function: (a) Predictive mean and 95% confidence bounds for a GP with SE-ARD covariance function
(blue solid), a GP with NN covariance function (red dotted) and a log-sigmoid mGP (green dashed) on the step function
of Equation (18). The discontinuity is captured better by an mGP than by a regular GP with either SE-ARD or NN
covariance functions. (b) The 2D feature space H discovered by the non-linear mapping M as a function of the input X.
The discontinuity of the modeled function is already captured by the non-linear mapping M . Hence, the mapping from
feature space H to the output Y is smooth and can be easily managed by the GP.
Table 1: Step Function: Negative Log Marginal Likelihood
(NLML) and Negative Log Predictive Probability (NLPP)
per data point for the step function of Equation (18). The
mGP (log-sigmoid) captures the nature of the underlying
function better than a standard GP in both the training and
test sets.
Method Training set Test set
NLML RMSE NLPP RMSE
GP SE-ARD 0.68 1.00 × 10
2
+0.50 × 10
3
0.58
GP NN 1.49 0.57 × 10
2
+0.02 × 10
3
0.14
mGP (log-sigmoid) 2.84 1.06 × 10
2
6.34 × 10
3
0.02
mGP (identity) 0.68 1.00 × 10
2
+0.50 × 10
3
0.58
RandEmb + GP SE-ARD 0.77 5.26 × 10
2
+0.51 × 10
3
0.52
3.2. Multiple Length-Scales
In the following, we demonstrate that the mGP can
be used to model functions that possess multiple intrinsic
length-scales. For this purpose, we rotate the function
y = 1 N
x
2
|3, 0.5
2
N
x
2
| 3, 0.5
2
+
x
1
100
(19)
anti-clockwise by 45
. The intensity map of the resulting
function is shown in Figure 3a. By itself (i.e., without
rotating the function), Equation (19) is a fairly simple
function. However, when rotated, the correlation between
the covariates substantially complicates modeling. If we
consider a horizontal slice of the rotated function, we can see
how different spectral frequencies are present in the func-
tion, see Figure 3d. The presence of different frequencies
is problematic for covariance functions, such as the SE-
ARD, which assume a single frequency. When learning the
hyperparameters, the length-scale needs to trade off differ-
ent frequencies. Typically, the hyperparameter optimization
gives a preference to shorter length-scales. However, such
a trade-off greatly reduces the generalization capabilities of
the model.
We compare the performances of a standard GP using
SE-ARD and NN covariance functions and random em-
beddings followed by a GP using the SE-ARD covariance
function, and our proposed mGP. We train these models
with 400 data points, randomly sampled from a uniform
distribution in the intervals x
1
= [0, 10], x
2
= [0, 10]. As a
test set we use 2500 data points distributed on a regular
grid in the same intervals. For the mGP with both the
log-sigmoid and the identify transfer functions, we use a
neural network of [2-10-3] neurons. The NLML and the
NLPP per data point are shown in Table 2. The mGP
outperforms all other methods evaluated. We believe that
this is due to the mapping M, which transforms the input
space so as to have a single natural frequency. Figure 3b
shows the intensity map of the feature space after the mGP
transformed the inputs using a neural network with the
identify transfer function. Figure 3c shows the intensity map
of the feature when the log-sigmoid transfer function is used.
Both transformations tend to make the feature space smother
compared to the initial input space. This effect is the result
of the transformations, which aim to equalize the natural
frequencies of the original function in order to capture them
more efficiently with a single length-scale. The effects of
these transformations are clearly visible in the spectrogram
of the mGP (identity) in Figure 3e and of the mGP (log-
sigmoid) in Figure 3f. The smaller support of the spectrum,
obtained through the non-linear transformations performed
by mGP using the log-sigmoid transfer function, translates
into superior prediction performance.
3.3. Bipedal Robot Locomotion
Modeling data from real robots can be challenging when
the robot has physical interactions with the environment.
Especially in bipedal locomotion, we lack good contact

Figures
Citations
More filters
Journal ArticleDOI

Taking the Human Out of the Loop: A Review of Bayesian Optimization

TL;DR: This review paper introduces Bayesian optimization, highlights some of its methodological aspects, and showcases a wide range of applications.
Journal ArticleDOI

Hidden physics models: Machine learning of nonlinear partial differential equations

TL;DR: In this article, a new paradigm of learning partial differential equations from small data is presented, which is essentially data-efficient learning machines capable of leveraging the underlying laws of physics, expressed by time dependent and nonlinear partial differential equation, to extract patterns from high-dimensional data generated from experiments.
Journal ArticleDOI

Machine learning of linear differential equations using Gaussian processes

TL;DR: Gaussian process priors are modified according to the particular form of such operators and are employed to infer parameters of the linear equations from scarce and possibly noisy observations, leading to model discovery from just a handful of noisy measurements.
Posted Content

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

TL;DR: This article proposed a probabilistic ensembles with trajectory sampling (PETS) algorithm, which combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation to match the asymptotic performance of model-based and model-free deep RL algorithms.
Journal ArticleDOI

When Gaussian Process Meets Big Data: A Review of Scalable GPs

TL;DR: In this article, a review of state-of-the-art scalable Gaussian process regression (GPR) models is presented, focusing on global and local approximations for subspace learning.
References
More filters
Journal ArticleDOI

Nonlinear dimensionality reduction by locally linear embedding.

TL;DR: Locally linear embedding (LLE) is introduced, an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs that learns the global structure of nonlinear manifolds.
Journal ArticleDOI

A global geometric framework for nonlinear dimensionality reduction.

TL;DR: An approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set and efficiently computes a globally optimal solution, and is guaranteed to converge asymptotically to the true structure.
Book

Gaussian Processes for Machine Learning

TL;DR: The treatment is comprehensive and self-contained, targeted at researchers and students in machine learning and applied statistics, and deals with the supervised learning problem for both regression and classification.
Journal ArticleDOI

LIII. On lines and planes of closest fit to systems of points in space

TL;DR: This paper is concerned with the construction of planes of closest fit to systems of points in space and the relationships between these planes and the planes themselves.
Journal ArticleDOI

Independent component analysis: algorithms and applications

TL;DR: The basic theory and applications of ICA are presented, and the goal is to find a linear representation of non-Gaussian data so that the components are statistically independent, or as independent as possible.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What are the contributions mentioned in the paper "Manifold gaussian processes for regression" ?

In this paper, the authors propose Manifold Gaussian Processes, a novel supervised method that jointly learns a transformation of the data into a feature space and a GP regression from the feature space to observed space. As a proof-of-concept, the authors evaluate their approach on complex non-smooth functions where standard GPs perform poorly, such as step functions and robotics tasks with contacts. 

A common approximation to the full Bayesian framework is to introduce a deterministic feature space H, and to find the mappings M and G in two consecutive steps. 

Increasing the number of parameters of the mapping M intuitively leads to an increased flexibility in the learned covariance function. 

Although the squared exponential can be applied to a great range of problems, generic covariance functions may also be inadequate to model a variety of functions where the common smoothness assumptions are violated, such as ground contacts in robot locomotion. 

Applications that profit from the enhanced modeling capabilities of the mGP include robot modeling (e.g., contact and stiction modeling), reinforcement learning, and Bayesian optimization. 

One of the main challenges of training mGPs using neural networks as mapping M is the unwieldy joint optimization of the parameters θmGP. 

The covariance function of a GP implicitly encodes high-level assumptions about the underlying function to be modeled, e.g., smoothness or periodicity. 

The locomotion data set is highly challenging due to ground contacts, which cause the regression function to violate standard smoothness assumptions. 

The presence of different frequencies is problematic for covariance functions, such as the SEARD, which assume a single frequency. 

Unlike neural networks, which have been successfully used to extract complex features, MacKay (1998) argued that GPs are unsuited for feature learning. 

the authors replace this deterministic mapping with a probabilistic one, which would describe the uncertainty about the location of the discontinuity. 

it is easier for the GP to learn the mapping G. Learning the discontinuity in the feature space is a direct result from jointly training M and G as feature learning is embedded in the overall regression F .1. 

The analytic gradients ∂NLML/∂θG of the objective in Equation (14)with respect to the parameters θG are computed as in the standard GP, i.e.,∂NLML(θmGP) ∂θG = ∂NLML(θmGP) ∂KθmGP ∂KθmGP ∂θG . (14)The gradients of the parameters θM of the feature mapping are computed by applying the chain-rule∂NLML(θmGP) ∂θM = ∂NLML(θmGP) ∂KθmGP ∂KθmGP ∂H ∂H ∂θM , (15)where only ∂H/∂θM depends on the chosen input transformation M , while ∂KθmGP/∂H is the gradient of the kernel matrix with respect to the Q-dimensional GP training inputs H = M(X). 

Snelson and Ghahramani (2006) proposed a supervised dimensionality reduction by jointly learning a liner transformation of the input and aGP. 

the authors introduce the Manifold Gaussian Processes, their novel approach to jointly learning a regression model and a suitable feature representation of the data.