scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Episodic Learning with Control Lyapunov Functions for Uncertain Robotic Systems

TL;DR: A machine learning framework centered around Control Lyapunov Functions to adapt to parametric uncertainty and unmodeled dynamics in general robotic systems and yields a stabilizing quadratic program model-based controller.
Abstract: Many modern nonlinear control methods aim to endow systems with guaranteed properties, such as stability or safety, and have been successfully applied to the domain of robotics. However, model uncertainty remains a persistent challenge, weakening theoretical guarantees and causing implementation failures on physical systems. This paper develops a machine learning framework centered around Control Lyapunov Functions (CLFs) to adapt to parametric uncertainty and unmodeled dynamics in general robotic systems. Our proposed method proceeds by iteratively updating estimates of Lyapunov function derivatives and improving controllers, ultimately yielding a stabilizing quadratic program model-based controller. We validate our approach on a planar Segway simulation, demonstrating substantial performance improvements by iteratively refining on a base model-free controller.

Summary (2 min read)

Introduction

  • The authors instead constructively prescribe a CLF, and focus on learning only the necessary information to choose control inputs that achieve the associated stability guarantees, which can be much lower-dimensional.
  • In particular, exhaustive data collection typically scales exponentially with dimensionality of the joint state and control output space, and so should be avoided.
  • We also provide a Python software package implementing their experiments and learning framework.the authors.the authors.

II. PRELIMINARIES ON CLFS

  • This section provides a brief review of input-output feedback linearization, a control technique which can be used to synthesize a CLF.
  • The resulting CLF will be used to quantify the impact of model uncertainty and specify the learning problem outlined in Section III.
  • A. Input-Output Linearization Input-Output Linearization is a nonlinear control method that creates stable linear dynamics for a selected set of outputs of a system [21].
  • This implies the desired output trajectory yd is exponentially stable.
  • This conclusion allows us to construct a Lyapunov function for the system using converse theorems found in [21].

B. Control Lyapunov Functions

  • The preceding formulation of a Lyapunov function required the choice of the specific control law given in (6).
  • For optimality purposes, it may be desirable to choose a different control input for the system, thus motivating the following definition.
  • The authors see that the previously constructed Lyapunov function satisfying (10) satisfies (11) by choosing the control input specified in (6).
  • (12) Information about the dynamics is encoded within the scalar function V̇ , offering a reduction in dimensionality which will become relevant later in learning.
  • Here Sm+ denotes the set of m ×m symmetric positive semi-definite matrices.

A. Uncertainty Modeling Assumptions

  • As defined in Section II, the authors consider affine robotic control systems that evolve under dynamics described by (1).
  • The authors assume the estimated model (14) satisfies the relative degree condition on the domain R, and thus may use the method of feedback linearization to produce a Control Lyapunov Function (CLF), V , for the system.
  • This holds since the true values of f̃ and g̃, if known, enable choosing control inputs as in (6) that respect the same linear output dynamics (8).
  • Instead of learning the unknown dynamics terms A and b, which scale with both the dimension of the configuration space and the number of inputs, the authors will learn the terms a and b, which scale only with the number of inputs.

B. Motivating a Data-Driven Learning Approach

  • The formulation from (15) and (16) defines a general class of dynamics uncertainty.
  • To motivate their learningbased framework, first consider a simple approach of learning a and b via supervised regression [19]: the authors operate the system using some given state-feedback controller to gather data points along the system’s evolution and learn a function that approximates a and b via supervised learning.
  • An experiment is defined as the evolution of the system over a finite time interval from the initial condition (q0,0) using a discrete-time implementation of the given controller.
  • As a consequence, standard supervised learning with sequential, non-i.i.d data collection often leads to error cascades [24].

A. Episodic Learning Framework

  • Episodic learning refers to learning procedures that iteratively alternates between executing an intermediate controller (also known as a roll-out in reinforcement learning [22]), collecting data from that roll-out, and designing a new controller using the newly collected data.
  • The data set is aggregated and a new ERM problem is solved after each episode.
  • Such exploration can be achieved by randomly perturbing the controller used in an experiment at each time step.
  • Algorithm 1 specifies a method of computing a sequence of Lyapunov function derivative estimates and augmenting controllers.
  • The trust coefficients form a monotonically nondecreasing sequence on the interval [0, 1].

B. Additional Controller Details

  • This is done to avoid chatter that may arise from the optimization based nature of the CLF-QP formulation [27].
  • Note that for this choice of Lyapunov function, the gradient ∂V∂η , and therefore a, approach 0 as η approaches 0, which occurs close to the desired trajectory.
  • Such relative error causes the optimization problem in (20) to be poorly conditioned near the desired trajectory.
  • As states approach the trajectory, the coefficient of the quadratic term decreases and enables relaxation of the exponential stability inequality constraint.
  • The exploratory control during experiments is naively chosen as additive noise from a centered uniform distribution, with each coordinate drawn i.i.d.

V. APPLICATION ON SEGWAY PLATFORM

  • In this section the authors apply the episodic learning algorithm constructed in Section IV to the Segway platform.
  • The authors seek to track a pitch angle trajectory2 generated for the estimated model.
  • The baseline PD controller and the augmented controller after 20 experiments can be seen in the right portion Fig. 3.
  • The mean trajectory consistently improves in these later episodes as the trust factor increases.
  • The variation increases but 2Trajectory was generated using the GPOPS-II Optimal Control Software 3Models were implemented in Keras remains small, indicating that the learning problem is robust to randomness in the initialization of the neural networks, in the network training algorithm, and in the noise added during the experiments.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Episodic Learning with Control Lyapunov Functions for
Uncertain Robotic Systems*
Andrew J. Taylor
1
, Victor D. Dorobantu
1
, Hoang M. Le, Yisong Yue, Aaron D. Ames
Abstract Many modern nonlinear control methods aim to
endow systems with guaranteed properties, such as stability
or safety, and have been successfully applied to the domain
of robotics. However, model uncertainty remains a persistent
challenge, weakening theoretical guarantees and causing im-
plementation failures on physical systems. This paper develops
a machine learning framework centered around Control Lya-
punov Functions (CLFs) to adapt to parametric uncertainty
and unmodeled dynamics in general robotic systems. Our
proposed method proceeds by iteratively updating estimates
of Lyapunov function derivatives and improving controllers,
ultimately yielding a stabilizing quadratic program model-
based controller. We validate our approach on a planar Segway
simulation, demonstrating substantial performance improve-
ments by iteratively refining on a base model-free controller.
I. INTRODUCTION
The use of Control Lyapunov Functions (CLFs) [5], [38]
for nonlinear control of robotic systems is becoming in-
creasingly popular [26], [17], [29], often utilizing quadratic
program controllers [3], [2], [17]. While effective, one major
challenge is the need for extensive tuning, which is largely
due to modeling deficiencies such as parametric error and
unmodeled dynamics (cf. [26]). While there has been much
research in developing robust control methods that maintain
stability under uncertainty (e.g., via input-to-state stability
[39]) or in adapting to limited forms of uncertainty (e.g.,
adaptive control [23],[20]), relatively little work has been
done on systematically reducing uncertainty while maintain-
ing stability for general function classes of models.
We take a machine learning approach to address the above
limitations. Learning-based approaches have already shown
great promise for controlling imperfectly modeled robotic
platforms [22], [35]. Successful learning-based approaches
have typically focused on learning model-based uncertainty
[6], [9], [8], [37], or direct model-free controller design [25],
[36], [14], [42], [24].
We are particularly interested in learning-based approaches
that guarantee Lyapunov stability [21]. From that perspective,
the bulk of previous work has focused on using learning to
construct a Lyapunov function [31], [12], [30], or to assess
the region of attraction for a Lyapunov function [10], [7].
*This work was supported by Google Brain Robotics and DARPA Award
HR00111890035
1
Both authors contributed equally.
All authors are with the Department of Computing and
Mathematical Sciences, California Institute of Technology,
Pasadena, CA 91125, USA ajtaylor@caltech.edu,
vdoroban@caltech.edu, hmle@caltech.edu,
yyue@caltech.edu, ames@caltech.edu
Fig. 1. CAD model & physical system, a modified Ninebot Segway.
One limitation of previous work is the learning is conducted
over the full-dimensional state space, which can be data
inefficient. We instead constructively prescribe a CLF, and
focus on learning only the necessary information to choose
control inputs that achieve the associated stability guarantees,
which can be much lower-dimensional.
One challenge in developing learning-based methods for
controller improvement is how best to collect training data
that accurately reflects the desired operating environment
and control goals. In particular, exhaustive data collection
typically scales exponentially with dimensionality of the joint
state and control output space, and so should be avoided. But
first pre-collecting data upfront can lead to poor performance
as downstream control behavior may enter states that are not
present in the pre-collected training data. We will leverage
episodic learning approaches such as Dataset Aggregation
(DAgger) [33] to address these challenges in a data-efficient
manner, and lead to iteratively refined controllers.
In this paper we present a novel episodic learning approach
that utilizes CLFs to iteratively improve controller design
while maintaining stability. To the best of our knowledge,
our approach is the first that integrates CLFs and general
supervised learning (e.g., including deep learning) in a
mathematically integrated way. Another distinctive aspect is
that our approach performs learning on the projection of state
dynamics onto the CLF time derivative, which can be much
lower dimensional than learning the full state dynamics or
the region of attraction.
Our paper is organized as follows. Section II presents a
review of input-output feedback linearization with a focus
on constructing CLFs for unconstrained robotic systems.
Section III discusses model uncertainty of a general robotic
arXiv:1903.01577v1 [cs.RO] 4 Mar 2019

system and establishes assumptions on the structure of this
uncertainty. These assumptions allow us to prescribe a CLF
for the true system, but leave open the question of how to
model its time derivative. Section IV provides an episodic
learning approach to iteratively improving a model of the
time derivative of the CLF. We also present a variant of
optimal CLF-based control that integrates the learned rep-
resentation. Finally, Section V provides simulation results
on a model of a modified Ninebot by Segway E+, seen in
Fig. 1. We also provide a Python software package (LyaPy)
implementing our experiments and learning framework.
1
II. PRELIMINARIES ON CLFS
This section provides a brief review of input-output feed-
back linearization, a control technique which can be used to
synthesize a CLF. The resulting CLF will be used to quantify
the impact of model uncertainty and specify the learning
problem outlined in Section III.
A. Input-Output Linearization
Input-Output Linearization is a nonlinear control method
that creates stable linear dynamics for a selected set of
outputs of a system [21]. The relevance of Input-Output Lin-
earization is that it provides a constructive way to generate
Lyapunov functions for the class of affine robotic control
systems. Consider a configuration space Q R
n
and an
input space U R
m
. Assume Q is path-connected and non-
empty. Consider a control system specified by:
D(q)
¨
q + C(q,
˙
q)
˙
q + G(q)
| {z }
H(q,
˙
q)
= Bu, (1)
with generalized coordinates q Q, coordinate rates
˙
q
R
n
, input u U, inertia matrix D : Q S
n
++
, centrifugal
and Coriolis terms C : Q×R
n
R
n×n
, gravitational forces
G : Q R
n
, and static actuation matrix B R
n×m
. Here
S
n
++
denotes the set of n × n symmetric positive definite
matrices. Define twice-differentiable outputs y : Q R
k
,
with k m, and assume each output has relative degree
2 on some domain R Q (see [34] for more details).
Consider the time interval I = [t
0
, t
f
] for initial and final
times t
0
, t
f
satisfying t
f
> t
0
and define twice-differentiable
time-dependent desired outputs y
d
: I R
k
with r(t) =
y
d
(t)
>
˙
y
d
(t)
>
>
. The error between the outputs and the
desired outputs (commonly referred to as virtual constraints
[44]) yields the dynamic system:
d
dt
y(q) y
d
(t)
˙
y(q,
˙
q)
˙
y
d
(t)
=
f (q,
˙
q)
z }| {
"
y
q
˙
q
˙
y
q
˙
q
y
q
D(q)
1
H(q,
˙
q)
#
˙
y
d
(t)
¨
y
d
(t)
| {z }
˙
r(t)
+
0
k×m
y
q
D(q)
1
B
| {z }
g(q)
u,
(2)
1
https://github.com/vdorobantu/lyapy
noting that
˙
y
˙
q
=
y
q
. For all q R, g(q) is full rank by the
relative degree assumption. Define η : Q × R
n
× I R
2k
,
e
f : Q × R
n
R
k
, and
e
g : Q R
k×m
as:
η(q,
˙
q, t) =
y(q) y
d
(t)
˙
y(q,
˙
q)
˙
y
d
(t)
(3)
e
f(q,
˙
q) =
˙
y
q
˙
q
y
q
D(q)
1
H(q,
˙
q) (4)
e
g(q) =
y
q
D(q)
1
B, (5)
and assume U = R
m
. The input-output linearizing control
input is specified by:
u(q,
˙
q, t) =
e
g(q)
(
e
f(q,
˙
q) +
¨
y
d
(t) + ν(q,
˙
q, t)), (6)
with auxiliary input ν(q,
˙
q, t) R
k
for all q Q,
˙
q R
n
, and t I, where denotes the Moore-Penrose
pseudoinverse. This controller used in (2) generates linear
output dynamics of the form:
˙
η(q,
˙
q, t) =
0
k×k
I
k×k
0
k×k
0
k×k
| {z }
F
η(q,
˙
q, t) +
0
k×k
I
k×k
| {z }
G
ν(q,
˙
q, t),
(7)
where (F, G) are a controllable pair. Defining K =
K
p
K
d
where K
p
, K
d
S
k
++
, the auxiliary control
input ν(q,
˙
q, t) = Kη(q,
˙
q, t) induces output dynamics:
˙
η(q,
˙
q, t) = A
cl
η(q,
˙
q, t), (8)
where A
cl
= F GK is Hurwitz. This implies the desired
output trajectory y
d
is exponentially stable. This conclusion
allows us to construct a Lyapunov function for the system
using converse theorems found in [21]. With A
cl
Hurwitz,
for any Q S
2k
++
, there exists a unique P S
2k
++
such that
the Continuous Time Lyapunov Equation (CTLE):
A
>
cl
P + PA
cl
= Q, (9)
is satisfied. Let C = {η (q,
˙
q, t) : (q,
˙
q) R × R
n
, t I}.
Then V (η) = η
>
Pη, implicitly a function of q,
˙
q, and t,
is a Lyapunov function certifying exponential stability of (8)
on C satisfying:
λ
min
(P)kηk
2
2
V (η) λ
max
(P)kηk
2
2
˙
V (η) λ
min
(Q) kηk
2
2
, (10)
for all η C. Here λ
min
(·) and λ
max
(·) denote the minimum
and maximum eigenvalues of a symmetric matrix, respec-
tively. Alternatively, a Lyapunov function of the same form
can be constructed directly from (7) using the Continuous
Algebraic Riccati Equation (CARE) [21].
B. Control Lyapunov Functions
The preceding formulation of a Lyapunov function re-
quired the choice of the specific control law given in (6). For
optimality purposes, it may be desirable to choose a different
control input for the system, thus motivating the following
definition. Let C R
2k
. A function V : R
2k
R
+
is a
Control Lyapunov Function (CLF) for (1) on C certifying

exponential stability if there exist constants c
1
, c
2
, c
3
> 0
such that:
c
1
kηk
2
2
V (η) c
2
kηk
2
2
inf
u∈U
˙
V (η, u) c
3
kηk
2
2
, (11)
for all η C. We see that the previously constructed
Lyapunov function satisfying (10) satisfies (11) by choosing
the control input specified in (6). In the absence of a specific
control input, we may write the Lyapunov function time
derivative as:
˙
V (η, u) =
V
η
˙
η =
V
η
(f(q,
˙
q)
˙
r(t) + g(q)u). (12)
Information about the dynamics is encoded within the scalar
function
˙
V , offering a reduction in dimensionality which will
become relevant later in learning. Also note that
˙
V is affine
in u. This leads to the class of quadratic program based
controllers given by:
u(q,
˙
q, t) = arg min
u∈U
1
2
u
>
Mu + s
>
u + r
s.t.
˙
V (η, u) c
3
kηk
2
2
, (13)
for M S
m
+
, s R
m
, and r R, provided U is a
polyhedron. Here S
m
+
denotes the set of m × m symmetric
positive semi-definite matrices.
III. UNCERTAINTY MODELS & LEARNING
This section defines the class of model uncertainty we
consider in this work and investigates its impact on the
control system, and concludes with motivation for a data-
driven approach to mitigate this impact.
A. Uncertainty Modeling Assumptions
As defined in Section II, we consider affine robotic control
systems that evolve under dynamics described by (1). In
practice, we do not know the dynamics of the system exactly,
and instead develop our control systems using the estimated
model:
b
D(q)
¨
q +
b
C(q,
˙
q)
˙
q +
b
G(q)
| {z }
b
H(q,
˙
q)
=
b
Bu. (14)
We assume the estimated model (14) satisfies the relative
degree condition on the domain R, and thus may use
the method of feedback linearization to produce a Control
Lyapunov Function (CLF), V , for the system. Using the def-
initions established in (2) in conjunction with the estimated
model, we see that true system evolves as:
˙
η =
b
f(q,
˙
q)
˙
r(t) +
b
g(q)u
+ (g(q)
b
g(q)
| {z }
A(q)
)u + f (q,
˙
q)
b
f(q,
˙
q)
| {z }
b(q,
˙
q)
. (15)
We note the following features of modeling uncertainty in
this fashion:
Uncertainty is allowed to enter the system dynamics
via parametric error as well as through completely
unmodeled dynamics. In particular, the function H can
capture a wide variety of nonlinear behavior and only
needs to be Lipschitz continuous.
This formulation explicitly allows uncertainty in how
the input is introduced into the dynamics via uncertainty
in the inertia matrix D and static actuation matrix B.
This definition of uncertainty is also compatible with a
dynamic actuation matrix B : Q × R
n
R
n×m
given
proper assumptions on the relative degree of the system.
Given this formulation of our uncertainty, we make the
following assumptions of the true dynamics:
Assumption 1. The true system is assumed to be determin-
istic, time invariant, and affine in the control input.
Assumption 2. The CLF V , formulated for the estimated
model, is a CLF for the true system.
It is sufficient to assume that the true system have relative
degree 2 on the domain R to satisfy Assumption 2. This
holds since the true values of
e
f and
e
g, if known, enable
choosing control inputs as in (6) that respect the same linear
output dynamics (8). Given that V is a CLF for the true
system, its time derivative under uncertainty is given by:
˙
V (η, u) =
b
˙
V (η,u)
z }| {
V
η
(
b
f(q,
˙
q)
˙
r(t) +
b
g(q)u)
+
V
η
A(q)
| {z }
a(η,q)
>
u +
V
η
b(q,
˙
q)
| {z }
b(η,q,
˙
q)
, (16)
for all η R
2k
and u U. While V is a CLF for the true
system, it is no longer possible to determine if a specific
control value will satisfy the derivative condition in (11) due
to the unknown components a and b. Rather than form a new
Lyapunov function, we seek to better estimate the Lyapunov
function derivative
˙
V to enable control selection that satisfies
the exponential stability requirement. This estimate should be
affine in the control input, enabling its use in the controller
described in (13). Instead of learning the unknown dynamics
terms A and b, which scale with both the dimension of
the configuration space and the number of inputs, we will
learn the terms a and b, which scale only with the number
of inputs. In the case of the planar Segway model we
simulate, we reduce the number of learned components from
4 to 2 (assuming kinematics are known). These learned
representations need to accurately capture the uncertainty
over the domain in which the system is desired to evolve
to ensure stability during operation.
B. Motivating a Data-Driven Learning Approach
The formulation from (15) and (16) defines a general class
of dynamics uncertainty. It is natural to consider a data-
driven method to estimate the unknown quantities a and b
over the domain of the system. To motivate our learning-
based framework, first consider a simple approach of learning
a and b via supervised regression [19]: we operate the system

using some given state-feedback controller to gather data
points along the system’s evolution and learn a function that
approximates a and b via supervised learning.
Concretely, let q
0
Q be an initial configuration. An
experiment is defined as the evolution of the system over a
finite time interval from the initial condition (q
0
, 0) using
a discrete-time implementation of the given controller. A
resulting discrete-time state history is obtained, which is then
transformed with Lyapunov function V and finally differen-
tiated numerically to estimate
˙
V throughout the experiment.
This yields a data set comprised of input-output pairs:
D = {((q
i
,
˙
q
i
, η
i
, u
i
),
˙
V
i
)}
N
i=1
(Q × R
n
× R
2k
× U) × R.
(17)
Consider a class H
a
of nonlinear functions mapping from
R
2k
× Q to R
m
and a class H
b
of nonlinear functions
mapping from R
2k
× Q × R
n
to R. For a given
b
a H
a
and
b
b H
b
, define
c
˙
W as:
c
˙
W (η, q,
˙
q, u) =
b
˙
V (η, u) +
b
a(η, q)
>
u +
b
b(η, q,
˙
q), (18)
and let H be the class of all such estimators mapping R
2k
×
Q×R
n
×U to R. Defining a loss function L : R × R R
+
,
the supervised regression task is then to find a function in
H via empirical risk minimization (ERM):
inf
b
a∈H
a
b
b∈H
b
1
N
N
X
i=1
L(
c
˙
W (η
i
, q
i
,
˙
q
i
, u
i
),
˙
V
i
). (19)
This experiment protocol can be executed either in simulation
or directly on hardware. While being simple to implement,
supervised learning critically assumes independently and
identically distributed (i.i.d) training data. Each experiment
violates this assumption, as the regression target of each data
point is coupled with the input data of the next time step. As
a consequence, standard supervised learning with sequential,
non-i.i.d data collection often leads to error cascades [24].
IV. INTEGRATING EPISODIC LEARNING & CLFS
In this section we present the main contribution of this
work: an episodic learning algorithm that captures the un-
certainty present in the Lyapunov function derivative in a
learned model and utilizes it in a quadratic program based
controller.
A. Episodic Learning Framework
Episodic learning refers to learning procedures that itera-
tively alternates between executing an intermediate controller
(also known as a roll-out in reinforcement learning [22]),
collecting data from that roll-out, and designing a new
controller using the newly collected data. Our approach
integrates learning a and b with improving the performance
and stability of the control policy u in such an iterative
fashion. First, assume we are given a nominal state-feedback
controller u : Q × R
n
× I U. With an estimator
c
˙
W H
Algorithm 1 Dataset Aggregation for Control Lyapunov
Functions (DaCLyF)
Require: Control Lyapunov Function V , derivative esti-
mate
b
˙
V
0
, model classes H
a
and H
b
, loss function L,
set of initial configurations Q
0
, nominal state-feedback
controller u
0
, number of experiments T , sequence of trust
coefficients 0 w
1
· · · w
T
1
D = Initialize data set
for k = 1, . . . , T do
(q
0
, 0) sample(Q
0
× {0}) Get initial condition
D
k
experiment((q
0
, 0), u
k1
) Run experiment
D D D
k
Aggregate data set
b
a,
b
b ERM(H
a
, H
b
, L, D,
b
˙
V
0
) Fit estimators
b
˙
V
k
b
˙
V
0
+
b
a
>
u +
b
b Update derivative estimator
u
k
u
0
+ w
k
· augment(u
0
,
b
˙
V
k
) Update controller
end for
return
b
˙
V
T
, u
T
as defined in (18), we specify an augmenting controller as:
u
0
(q,
˙
q, t) = arg min
u
0
R
m
J(u
0
)
s.t.
c
˙
W (η, q,
˙
q, u(q,
˙
q, t) + u
0
) c
3
kηk
2
2
u(q,
˙
q, t) + u
0
U, (20)
where J : R
m
R is any positive semi-definite quadratic
cost function.
Our goal is to use this new controller to obtain better
estimates of a and b. One option, as seen in Section III-B,
is to perform experiments and use conventional supervised
regression to update
b
a and
b
b. To overcome the limitations
of conventional supervised learning, we leverage reduction
techniques: a sequential prediction problem is reduced to
a sequence of supervised learning problems over multiple
episodes [15], [32]. In particular, in each episode, an experi-
ment generates data using a different controller. The data set
is aggregated and a new ERM problem is solved after each
episode. Our episodic learning implementation is inspired by
the Data Aggregation algorithm (DAgger) [32], with some
key differences:
DAgger is a reinforcement learning algorithm, which
trains a policy directly in each episode using optimal
computational oracles. Our algorithm defines a con-
troller indirectly via a CLF to ensure stability.
The ERM problem is underdetermined, i.e., different
approximations (
b
a,
b
b) may achieve similar loss for a
given data set while failing to accurately model a and
b. This potentially introduces error in estimating
˙
V
for control inputs not reflected in the training data,
and necessitates the use of exploratory control action
to constrain the estimators
b
a and
b
b. Such exploration
can be achieved by randomly perturbing the controller
used in an experiment at each time step. This need
for exploration is an analog to the notion of persistent

t = 0 t = 1 t = 2 t = 3 t = 4 t = 5
Fig. 2. (Left) Model based QP controller fails to track trajectory. (Right) Improvement in angle tracking of system with augmented controller over nominal
PD controller. (Bottom) Corresponding visualizations of state data. Note that Segway is tilted in the incorrect direction at the end of the QP controller
simulation, but is correctly aligned during the augmented controller simulation.
excitation from adaptive systems [28].
Algorithm 1 specifies a method of computing a sequence
of Lyapunov function derivative estimates and augmenting
controllers. During each episode, the augmenting controller
associated with the estimate of the Lyapunov function deriva-
tive is scaled by a factor reflecting trust in the estimate and
added to the nominal controller for use in the subsequent
experiment. The trust coefficients form a monotonically non-
decreasing sequence on the interval [0, 1]. Importantly, this
experiment need not take place in simulation; the same
procedure may be executed directly on hardware. It may be
infeasible to choose a specific configuration for an initial
condition on a hardware platform; therefore we specify a
set of initial configurations Q
0
Q from which an initial
condition may be sampled, potentially randomly.
B. Additional Controller Details
During augmentation, we specify the controller in (20) by
selecting the minimum-norm cost function:
J(u
0
) =
1
2
ku(q,
˙
q, t) + u
0
k
2
2
, (21)
for all u
0
R
m
, q Q,
˙
q R
n
, and t I. We additionally
incorporate a smoothing regularizer into the cost function of
the form:
R(u
0
) = R ku
0
u
prev
k
2
2
,
for all u
0
R
m
, where u
prev
R
m
is the previously
computed augmenting controller and R > 0. This is done
to avoid chatter that may arise from the optimization based
nature of the CLF-QP formulation [27].
Note that for this choice of Lyapunov function, the gra-
dient
V
η
, and therefore a, approach 0 as η approaches
0, which occurs close to the desired trajectory. While the
estimated Lyapunov function derivative may be fit with low
absolute error on the data set, the relative error may still
be high for states near the desired trajectory. Such relative
error causes the optimization problem in (20) to be poorly
conditioned near the desired trajectory. We therefore add a
slack term δ R
+
to the decision variables, which appears
in the inequality constraint [3]. The slack term is additionally
incorporated into the cost function as:
C(δ) =
1
2
C
V
η
b
g(q)
>
+
b
a(η, q)
2
2
δ
2
, (22)
for all δ R
+
, where C > 0. As states approach the
trajectory, the coefficient of the quadratic term decreases
and enables relaxation of the exponential stability inequality
constraint. In practice this leads to input-to-state stable
behavior, described in [40], around the trajectory.
The exploratory control during experiments is naively cho-
sen as additive noise from a centered uniform distribution,
with each coordinate drawn i.i.d. The variance is scaled by
the norm of the underlying controller to introduce exploration
while maintaining a high signal-to-noise ratio.

Citations
More filters
Proceedings ArticleDOI
12 Jul 2020
TL;DR: In this article, a reinforcement learning framework was proposed to learn the model uncertainty present in the CBF and CLF constraints, as well as other control-affine dynamic constraints in the quadratic program.
Abstract: In this paper, the issue of model uncertainty in safety-critical control is addressed with a data-driven approach. For this purpose, we utilize the structure of an input-ouput linearization controller based on a nominal model along with a Control Barrier Function and Control Lyapunov Function based Quadratic Program (CBF-CLF-QP). Specifically, we propose a novel reinforcement learning framework which learns the model uncertainty present in the CBF and CLF constraints, as well as other control-affine dynamic constraints in the quadratic program. The trained policy is combined with the nominal model-based CBF-CLF-QP, resulting in the Reinforcement Learning-based CBF-CLF-QP (RL-CBF-CLF-QP), which addresses the problem of model uncertainty in the safety constraints. The performance of the proposed method is validated by testing it on an underactuated nonlinear bipedal robot walking on randomly spaced stepping stones with one step preview, obtaining stable and safe walking under model uncertainty.

132 citations

Posted Content
20 Dec 2019
TL;DR: A machine learning framework utilizing Control Barrier Functions (CBFs) to reduce model uncertainty as it impact the safe behavior of a system, ultimately achieving safe behavior.
Abstract: Modern nonlinear control theory seeks to endow systems with properties of stability and safety, and have been deployed successfully in multiple domains. Despite this success, model uncertainty remains a significant challenge in synthesizing safe controllers, leading to degradation in the properties provided by the controllers. This paper develops a machine learning framework utilizing Control Barrier Functions (CBFs) to reduce model uncertainty as it impact the safe behavior of a system. This approach iteratively collects data and updates a controller, ultimately achieving safe behavior. We validate this method in simulation and experimentally on a Segway platform.

90 citations


Cites background or methods from "Episodic Learning with Control Lyap..."

  • ...Learning-based approaches have already shown great promise for controlling systems with uncertain models (Schaal and Atkeson (2010); Kober et al. (2013); Khansari-Zadeh and Billard (2014); Cheng et al. (2019); Taylor et al. (2019b); Shi et al. (2019))....

    [...]

  • ...Future work will seek to investigate the impact of residual error on safe behavior through the analysis established in Taylor et al. (2019a)....

    [...]

  • ...Furthermore, we build upon recent work utilizing learning in the context of Control Lyapunov Functions (CLFs) (Taylor et al. (2019b)) to construct an approach for learning model uncertainty....

    [...]

  • ...Additional details on related work are provided in the extended version of this paper (Taylor et al. (2019c))....

    [...]

  • ...Instead, we take a data-driven approach similar to (Taylor et al. (2019b)) to learn uncertainty as it appears in the time derivative of the CBF, ḣ, given in (6)....

    [...]

Posted Content
TL;DR: A review of the recent advances made in using machine learning to achieve safe decision making under uncertainties, with a focus on unifying the language and frameworks used in control theory and reinforcement learning research can be found in this article.
Abstract: The last half-decade has seen a steep rise in the number of contributions on safe learning methods for real-world robotic deployments from both the control and reinforcement learning communities. This article provides a concise but holistic review of the recent advances made in using machine learning to achieve safe decision making under uncertainties, with a focus on unifying the language and frameworks used in control theory and reinforcement learning research. Our review includes: learning-based control approaches that safely improve performance by learning the uncertain dynamics, reinforcement learning approaches that encourage safety or robustness, and methods that can formally certify the safety of a learned control policy. As data- and learning-based robot control methods continue to gain traction, researchers must understand when and how to best leverage them in real-world scenarios where safety is imperative, such as when operating in close proximity to humans. We highlight some of the open challenges that will drive the field of robot learning in the coming years, and emphasize the need for realistic physics-based benchmarks to facilitate fair comparisons between control and reinforcement learning approaches.

53 citations

Proceedings ArticleDOI
01 May 2020
TL;DR: Experimental results demonstrate that the proposed controller significantly outperforms a baseline nonlinear tracking controller with up to four times smaller worst-case height tracking errors, and empirically demonstrate the ability of the learned model to generalize to larger swarm sizes.
Abstract: In this paper, we present Neural-Swarm, a nonlinear decentralized stable controller for close-proximity flight of multirotor swarms. Close-proximity control is challenging due to the complex aerodynamic interaction effects between multirotors, such as downwash from higher vehicles to lower ones. Conventional methods often fail to properly capture these interaction effects, resulting in controllers that must maintain large safety distances between vehicles, and thus are not capable of close-proximity flight. Our approach combines a nominal dynamics model with a regularized permutation-invariant Deep Neural Network (DNN) that accurately learns the high-order multi-vehicle interactions. We design a stable nonlinear tracking controller using the learned model. Experimental results demonstrate that the proposed controller significantly outperforms a baseline nonlinear tracking controller with up to four times smaller worst-case height tracking errors. We also empirically demonstrate the ability of our learned model to generalize to larger swarm sizes.

52 citations

Posted Content
TL;DR: This work shows that under suitable smoothness assumptions on the perception map and generative model relating state to high-dimensional data, an affine error model is sufficiently rich to capture all possible error profiles, and can be learned via a robust regression problem.
Abstract: Motivated by vision-based control of autonomous vehicles, we consider the problem of controlling a known linear dynamical system for which partial state information, such as vehicle position, is extracted from complex and nonlinear data, such as a camera image. Our approach is to use a learned perception map that predicts some linear function of the state and to design a corresponding safe set and robust controller for the closed loop system with this sensing scheme. We show that under suitable smoothness assumptions on both the perception map and the generative model relating state to complex and nonlinear data, parameters of the safe set can be learned via appropriately dense sampling of the state space. We then prove that the resulting perception-control loop has favorable generalization properties. We illustrate the usefulness of our approach on a synthetic example and on the self-driving car simulation platform CARLA.

42 citations

References
More filters
Posted Content
TL;DR: A new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent, are proposed.
Abstract: We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

9,020 citations


Additional excerpts

  • ...Successful learning-based approaches have typically focused on learning model-based uncertainty [5], [8], [7], [37], or direct model-free controller design [25], [36], [14], [42], [24]....

    [...]

Posted Content
TL;DR: This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
Abstract: We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.

4,225 citations


Additional excerpts

  • ...Successful learning-based approaches have typically focused on learning model-based uncertainty [5], [8], [7], [37], or direct model-free controller design [25], [36], [14], [42], [24]....

    [...]

Journal ArticleDOI
TL;DR: This article attempts to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots by highlighting both key challenges in robot reinforcement learning as well as notable successes.
Abstract: Reinforcement learning offers to robotics a framework and set of tools for the design of sophisticated and hard-to-engineer behaviors. Conversely, the challenges of robotic problems provide both inspiration, impact, and validation for developments in reinforcement learning. The relationship between disciplines has sufficient promise to be likened to that between physics and mathematics. In this article, we attempt to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots. We highlight both key challenges in robot reinforcement learning as well as notable successes. We discuss how contributions tamed the complexity of the domain and study the role of algorithms, representations, and prior knowledge in achieving these successes. As a result, a particular focus of our paper lies on the choice between model-based and model-free as well as between value-function-based and policy-search methods. By analyzing a simple problem in some detail we demonstrate how reinforcement learning approaches may be profitably applied, and we note throughout open questions and the tremendous potential for future research.

2,391 citations


"Episodic Learning with Control Lyap..." refers background or methods in this paper

  • ...Episodic learning refers to learning procedures that iteratively alternates between executing an intermediate controller (also known as a roll-out in reinforcement learning [22]), collecting data from that roll-out, and designing a new controller using the newly collected data....

    [...]

  • ...Learning-based approaches have already shown great promise for controlling imperfectly modeled robotic platforms [22], [35]....

    [...]

Book
16 Apr 2013
TL;DR: How to Construct Nonparametric Regression Estimates * Lower Bounds * Partitioning Estimates * Kernel Estimates * k-NN Estimates * Splitting the Sample * Cross Validation * Uniform Laws of Large Numbers
Abstract: Why is Nonparametric Regression Important? * How to Construct Nonparametric Regression Estimates * Lower Bounds * Partitioning Estimates * Kernel Estimates * k-NN Estimates * Splitting the Sample * Cross Validation * Uniform Laws of Large Numbers * Least Squares Estimates I: Consistency * Least Squares Estimates II: Rate of Convergence * Least Squares Estimates III: Complexity Regularization * Consistency of Data-Dependent Partitioning Estimates * Univariate Least Squares Spline Estimates * Multivariate Least Squares Spline Estimates * Neural Networks Estimates * Radial Basis Function Networks * Orthogonal Series Estimates * Advanced Techniques from Empirical Process Theory * Penalized Least Squares Estimates I: Consistency * Penalized Least Squares Estimates II: Rate of Convergence * Dimension Reduction Techniques * Strong Consistency of Local Averaging Estimates * Semi-Recursive Estimates * Recursive Estimates * Censored Observations * Dependent Observations

1,931 citations


"Episodic Learning with Control Lyap..." refers methods in this paper

  • ...To motivate our learningbased framework, first consider a simple approach of learning a and b via supervised regression [19]: we operate the system using some given state-feedback controller to gather data points along the system’s evolution and learn a function that approximates a and b via supervised learning....

    [...]

Book
22 Jun 1999
TL;DR: In this article, the authors compare Linear vs. Nonlinear Control of Differential Geometry with Linearization by State Feedback (LSF) by using Linearization and Geometric Non-linear Control (GNC).
Abstract: 1 Linear vs. Nonlinear.- 2 Planar Dynamical Systems.- 3 Mathematical Background.- 4 Input-Output Analysis.- 5 Lyapunov Stability Theory.- 6 Applications of Lyapunov Theory.- 7 Dynamical Systems and Bifurcations.- 8 Basics of Differential Geometry.- 9 Linearization by State Feedback.- 10 Design Examples Using Linearization.- 11 Geometric Nonlinear Control.- 12 Exterior Differential Systems in Control.- 13 New Vistas: Multi-Agent Hybrid Systems.- References.

1,925 citations


"Episodic Learning with Control Lyap..." refers background in this paper

  • ...Input-Output (IO) Linearization is a nonlinear control method that creates stable linear dynamics for a selected set of outputs of a system [34]....

    [...]

  • ...Define twice-differentiable outputs y : Q → R, with k ≤ m, and assume each output has relative degree 2 on some domain R ⊆ Q (see [34] for details)....

    [...]

Frequently Asked Questions (12)
Q1. What have the authors contributed in "Episodic learning with control lyapunov functions for uncertain robotic systems*" ?

This paper develops a machine learning framework centered around Control Lyapunov Functions ( CLFs ) to adapt to parametric uncertainty and unmodeled dynamics in general robotic systems. The authors validate their approach on a planar Segway simulation, demonstrating substantial performance improvements by iteratively refining on a base model-free controller. 

There are two main interesting directions for future work. 

The parameters of the model (including mass, inertias, and motor parameters but excluding gravity) are randomly modified by up to 10% of their nominal values and are fixed for the simulations. 

An experiment is defined as the evolution of the system over a finite time interval from the initial condition (q0,0) using a discrete-time implementation of the given controller. 

Episodic learning refers to learning procedures that iteratively alternates between executing an intermediate controller (also known as a roll-out in reinforcement learning [22]), collecting data from that roll-out, and designing a new controller using the newly collected data. 

During augmentation, the authors specify the controller in (20) by selecting the minimum-norm cost function:J(u′) = 12 ‖u(q, q̇, t) + u′‖22 , (21)for all u′ ∈ Rm, q ∈ Q, q̇ ∈ Rn, and t ∈ I. 

Given that V is a CLF for the true system, its time derivative under uncertainty is given by:V̇ (η,u) =̂̇V (η,u)︷ ︸︸ ︷ ∂V∂η (f̂(q, q̇)− ṙ(t) + ĝ(q)u)+ ∂V∂η A(q)︸ ︷︷ ︸a(η,q)>u+ ∂V∂η b(q, q̇)︸ ︷︷ ︸b(η,q,q̇), (16)for all η ∈ R2k and u ∈ U . 

During each episode, the augmenting controller associated with the estimate of the Lyapunov function derivative is scaled by a factor reflecting trust in the estimate and added to the nominal controller for use in the subsequent experiment. 

The exploratory control during experiments is naively chosen as additive noise from a centered uniform distribution, with each coordinate drawn i.i.d. 

define ̂̇W as:̂̇W (η,q, q̇,u) = ̂̇V (η,u) + â(η,q)>u+ b̂(η,q, q̇), (18) and let H be the class of all such estimators mapping R2k× Q×Rn×U to R. Defining a loss function L : R×R→ R+, the supervised regression task is then to find a function in H via empirical risk minimization (ERM):inf â∈Ha b̂∈Hb1N N∑ i=1 L(̂̇W (ηi,qi, q̇i,ui), V̇i). (19) 

The slack term is additionally incorporated into the cost function as:C(δ) = 12 C ∥∥∥∥∥ ( ∂V ∂η ĝ(q) )> + â(η,q) ∥∥∥∥∥ 22δ2, (22)for all δ ∈ R+, where C > 0. 

In practice, the authors do not know the dynamics of the system exactly, and instead develop their control systems using the estimated model:D̂(q)q̈+ Ĉ(q, q̇)q̇+ Ĝ(q)︸ ︷︷ ︸ Ĥ(q,q̇) = B̂u. (14)The authors assume the estimated model (14) satisfies the relative degree condition on the domain R, and thus may use the method of feedback linearization to produce a Control Lyapunov Function (CLF), V , for the system.