Stochastic optimization of bipedal walking using gyro feedback and phase resetting

doi:10.1109/ICHR.2007.4813869

Stochastic Optimization of Bipedal Walking using

Gyro Feedback and Phase Resetting

Felix Faber and Sven Behnke

Computer Science Institute

University of Freiburg, Germany

{faber | behnke}@informatik.uni-freiburg.de

Abstract— We present a method to optimize the walking

pattern of a humanoid robot for forward speed using suitable

metaheuristics. Our starting point is a hand-tuned open-loop gait

that we enhance with two feedback control mechanisms. First,

we employ a P-controller that regulates the foot angle in order to

reduce angular velocity of the robot’s body. Second, we introduce

a phase resetting mechanism that starts the next step at the

moment of foot contact. Using a physics-based simulation, we

demonstrate that such feedback control is essential for achieving

fast and robust locomotion.

For the optimization of open-loop parameters and parameters

of the feedback mechanisms, we compare Policy Gradient Rein-

forcement Learning (PGRL) and Particle Swarm Optimization

(PSO). To make optimization more data-efﬁcient, we extend

PGRL by an adaptive step size and a sequential sampling

procedure. Our experiments show that the proposed extensions

increase the performance of PGRL signiﬁcantly.

We selected the extended PGRL algorithm to optimize the

gait of a real robot. After optimization, the robot is able to walk

signiﬁcantly faster.

I. INTRODUCTION

Enabling a humanoid robot to walk fast and stable is non-

trivial. Such robots are highly complex, non-linear dynamical

systems that are hard to control. Sensor noise, imprecise

actuators, and external disturbances further complicate control.

Among the most successful approaches to biped locomotion

are trajectory tracking methods that are based on precomputed

trajectories of the legs or the Zero Moment Point (ZMP) [1].

The joint trajectories are generated ofﬂine, for example, by

solving the dynamic equations of motion. During the walking

process, precise controllers are used for following the pre-

computed trajectories. Trajectory tracking, however, requires

a precise model of the robot. At the same time, solving the

dynamical equations of motion can be computationally very

demanding if a robot has many degrees of freedom (DOF). The

approach presented here does not require an accurate model

of the robot and has a low computational complexity.

A completely different approach is that of passive dynamic

walkers (PDW). They use the inherent machine dynamics for

walking and most of them walk without actuation or control.

McGeer [2] ﬁrst introduced the notion of PDW and showed

that unactuated and uncontrolled planar walking down a slope

is possible. The main interest of the approach is the evolution

of periodic gaits, without considering starting or stopping,

while agility and responsiveness of motion, as required by

a versatile robot such as ours, play a minor role.

In addition, biologically inspired methods have evolved in

which central pattern generators (CPG) [3] generate joint

trajectories by using nonlinear oscillators. CPG-driven lo-

comotion has been successfully applied in cases where a

complete dynamical model of the robot is not available or

too complex to be useful [4], [5]. The disadvantage of CPG-

driven methods is that it is not trivial to determine appropriate

parameter settings for the oscillators in order to achieve a

stable gait. Our approach is similar to CPG methods in that

joint trajectories are generated and adapted online.

Although CPG-based methods do not use a model, stochas-

tic optimization of their parameters is possible. This has been

demonstrated, e.g., with the Aibo quadrupeds [6]–[8]. In order

to reject disturbances, sensory feedback is essential. Conse-

quently, several successful attempts to incorporate sensory

feedback have been reported, e.g. [9], [10].

In this paper, we propose extending an open-loop gait engine

with two feedback mechanisms: gyro feedback that stabilizes

the trunk and phase resetting at foot contact that entrains

the CPG with the natural dynamics of the robot-environment

system. The parameters of these feedback mechanisms are

included in a stochastic optimization procedure. We evaluate

the suitability of two state-of-the-art metaheuristics for our

task and propose extensions that make them more data-

efﬁcient. This includes a sequential sampling procedure, an

adaptive step size, and automatic restarting.

After reviewing some of the related work, we present the

humanoid robot Jupp used for the experiments and its open-

loop gait engine in Sec. III and IV, respectively. The proposed

feedback mechanisms are detailed in Sec. V. The optimization

problem is deﬁned in Sec. VI and Sec. VII presents the PGRL

and PSO metaheuristics together with the proposed extensions.

The heuristics and their extensions are evaluated in a physics-

based simulation, as described in Sec. VIII. Sec. IX reports

the results of transferring stochastic gait optimization to the

real robot.

II. RELATED WORK

Stochastic gait optimization has been successfully demon-

strated for the Aibo quadruped, e.g. by Kohl and Stone [6] who

proposed Policy Gradient Reinforcement Learning (PGRL).

They approximate the gradient of the objective function by

sampling in the vicinity of a parameter vector. In contrast to

our approach, the authors did not make use of sensory feed-

back. R

¨

ofer [11] used the readings of acceleration sensors to

judge stability of Aibo gaits during evolutionary optimization.

However, the acceleration sensors were not used to adapt the

gait online.

Geng et al. [10] successfully used PGRL for gait optimiza-

tion of a planar biped. While their approach optimizes only

two gait parameters, we include many more parameters in

the optimization. This is possible because of the data-efﬁcient

sequential sampling procedure.

Hemker et al. [9] applied a sequential surrogate approach to

optimize the speed of their humanoid robot. In contrast to our

approach, the authors did not optimize feedback parameters.

The gait also does not include a phase resetting mechanism.

Kosse [12] optimized the gait of a KHR-1 humanoid robot

for forward speed using evolutionary strategies. His gait is

completely open-loop.

Nakanishi et al. [4] proposed phase resetting to adapt

the CPG-frequency of a planar biped. They reset the CPG-

oscillators when the foot hits the ground. The approach learns

open-loop gait parameters from demonstration, but does not

include feedback parameters in the learning process.

Aoi and Tsuchiya [5] also incorporated a phase-resetting

mechanism. The authors did not optimize the gait parameters

and did not make use of gyroscope feedback.

III. KIDSIZE HUMANOID ROBOT JUPP

The robot used for our experiments is Jupp, a 19 DOF

soccer robot from team NimbRo. Jupp is 60cm tall and weights

only 2.3kg. Each leg has a 3 DOF hip, a knee, and a 2 DOF

ankle joint. A pitch joint allows for bending the trunk, just

above the hip. The arms have 2 DOF shoulders and an elbow

joint. All joints are driven by RC-servo motors.

Most parts of the skeleton are made of rectangular alu-

minum tubes. The arms and the feet are made of carbon

composite. A Pocket PC, located in the upper trunk of the

robot, is the main computer. It generates target positions for all

joints at a rate of 83Hz. Jupp is equipped with two gyroscopes

(ADXRS, ± 150/300

◦

/s) that are located in the robot’s trunk.

Two force sensing resistors (FSR) are attached to each heel.

They are used to measure foot contact with the ground. See

[13] for more hardware details.

IV. OPEN-LOOP GAIT ENGINE

The starting point for our work is a clock-driven, open-loop

gait developed by Behnke [13]. The gait is driven by a central

clock −π ≤ φ

Trunk

< π that determines the step frequency ψ.

Each leg derives its gait phase φ

Leg

by shifting the trunk phase:

φ

Leg

= φ

Trunk

+Λ

π

2

. The leg sign Λ is Λ = −1 for the left leg

and Λ = 1 for the right leg. To abstract from the individual

joints, we implemented a kinematic interface that allows for

changing leg extension γ, leg angle θ

Leg

= (θ

r

Leg

, θ

p

Leg

, θ

y

Leg

),

and foot angle θ

Foot

= (θ

r

Foot

, θ

p

Foot

). Superscripts r, p, and y

indicate the roll, pitch, and yaw direction, respectively.

Each leg generates sinusoidal trajectories for this inter-

face from its phase φ

Leg

. Although the original gait is

omnidirectional, we consider here only walking in forward

direction. The key ingredients of the open-loop gait are:

shifting the weight from one leg to the other, shortening

the leg not needed for support, and leg motion in walking

direction. The gait speed and direction can be varied while

the robot is walking by adjusting the leg swing amplitude

a

Swing

= (a

r

Swing

, a

p

Swing

, a

y

Swing

). The robot starts walking by

ﬁrst fading in the lateral weight shift until it is stepping on

the spot and then fading in a

Swing

. The open-loop gait engine

has a set of parameters that have been adapted to the robot.

Many of them have an intuitive interpretation.

V. FEEDBACK CONTROL

Jupp reached a top speed of 22.5 cm/s using the hand-tuned

open-loop gait. In order to achieve higher speeds, we augment

the open-loop gait with two feedback control mechanisms

using readings from the two gyroscopes and the FSRs.

A. Gyroscope Feedback

If the angular velocities measured by the gyroscopes ω

r/p

Gyro

become too large, the robot is likely to loose balance. In

order to induce torque in the opposite direction to the possible

fall, we use a P-controller to modify the foot angles θ

r/p

Foot

to

θ

r/p

FootGyro

. We want the angular velocity of the upper body to

be zero. Thus, the gyroscope P-controllers with gains K

r/p

are

θ

r

FootGyro

= θ

r

Foot

+ Λ · K

r

· ω

r

Gyro

and

θ

p

FootGyro

= θ

p

Foot

+ K

p

· ω

p

Gyro

.

B. Phase Reset

For fast and stable walking, it is essential that the frequency

of the central pattern generator inside the robot and the natural

frequency of the robot-environment system are tuned to each

other. While it is possible to ﬁnd a ﬁxed step frequency that

is on average tuned to the system, disturbances require the

online-adaptation of the step duration.

In order to achieve an entrainment between the internal

clock and the robot-environment system, we reset the trunk

phase φ

Trunk

after the foot hits the ground. When contact of

the foot with the ground is sensed by the FSRs at time t

FC

,

the difference between the current trunk phase and the nominal

start of the next step Φ

Reset

is calculated:

∆

Reset

= Φ

Reset

− φ

Trunk

at t = t

FC

.

Fig. 1. Phase Resetting: After the foot hits the ground at t

FootContact

, the

trunk phase φ

Trunk

changes by ∆

Reset

, distributed over a time T

Trans

.

In order to avoid discontinuities in the joint trajectories, the

required phase change ∆

Reset

is distributed over a transient

time T

Trans

. Let φ

Old

be the continued course of φ

Trunk

if it

was not reset and φ

Target

= φ

Trunk

+ ∆

Reset

be the course of

the phase after an immediate jump of ∆

Reset

. As illustrated in

Fig. 1, φ

Trunk

linearly fades from φ

Old

to φ

Target

:

φ

Trunk

= φ

Old

+ (t − t

FC

)

∆

Reset

T

Trans

if t

FC

< t < t

FC

+ T

Trans

.

As the phases of all other body parts are derived from the

trunk phase, they reset as well. While our resetting mechanism

changes the timing of the walking motion, the trajectory

generation procedure is not changed.

VI. STOCHASTIC OPTIMIZATION

Both, the open-loop gait engine and the proposed feedback

mechanisms have parameters that need to be adjusted to

produce fast and stable walking patterns. As we do not have

a good model of the robot, we follow a model-free approach.

A ﬁtness function f (x), f : R

N

→ R, expresses our opti-

mization goals: speed and robustness. This function evaluates

a parameter vector x = (x

1

, x

2

, . . . , x

N

) through a walking

experiment, which is called a trial. The robot starts from a

standing position and walks for a certain distance d

exp

while

using the gait parameters x. The quality of the gait is judged

using the time needed to reach the desired distance. In case

the gait is unstable and the robot falls, the distance traveled

is used. Because we prefer slower robust gaits over faster

unstable ones, we formulate the ﬁtness function in a way that

any fall-free trial receives a higher score than a fall.

The speed v of the robot is divided by a maximum

speed v

max

, which will not be exceeded. In case the robot

falls, this relative speed is multiplied with the relative distance

traveled (d/d

exp

). Otherwise, the robot receives an extra bonus

of +1 for stability:

f(d, v) =

(

v

max

·

d

exp

if d < d

exp

v

max

+ 1 else.

The optimization problem is now to ﬁnd the parameter

vector x

∗

with the highest ﬁtness:

x

∗

= arg max

x

f(x).

The optimization procedure faces several difﬁculties. First,

f is not a deterministic function, but the ﬁtness of a trial is

a random variable that depends on disturbances encountered.

Evaluating the same parameter vector twice might result in

completely different ﬁtness values, e.g. if the robot falls in

one trial and walks stable in the other. To reduce the variance

of the ﬁtness, we evaluate each parameter vector three times,

if not stated otherwise. Another difﬁculty is that the ﬁtness

function is not necessarily smooth. Slight parameter changes

can make a stable gait pattern fall. Hence, common gradient

descent methods can not be applied for optimizing f. Finally,

the individual parameters are highly dependent on each other,

which makes it impossible to tune them individually, and the

mapping from parameters to the ﬁtness is quite non-linear.

Various metaheuristics have been proposed in the literature to

tackle such difﬁcult optimization problems.

VII. METAHEURISTICS

We evaluated two state-of-the-art metaheuristics for suit-

ability to our task: Policy Gradient Reinforcement Learning

(PGRL), introduced by Kohl and Stone [6], and Particle

Swarm Optmiziation (PSO) by Kennedy and Eberhart [14].

Both metaheuristics have been successfully applied to similar

optimization problems in the past [6], [15].

A. Policy Gradient Reinforcement Learning

In PGRL, each parameter vector is considered as an open-

loop policy that can be executed by the robot. The algorithm

randomly generates B test policies {x

1

, x

2

, . . . , x

B

} around

policy x

π

, which is to be improved. The parameters x

i

j

of the

test policies x

i

are set randomly to either x

π

j

− , to x

π

j

, or to

x

π

j

+ , for all 1 ≤ i ≤ B and 1 ≤ j ≤ N . The disturbance 

is a small constant value.

The test policies x

i

are executed. The obtained ﬁtness

samples are grouped into three categories for each dimension

j: S

−

j

, S

0

j

and S

+

j

, depending on whether the jth parameter

of x

i

is modiﬁed by −, 0, or +. For each set, the average

ﬁtness z

−

j

, z

0

j

and z

+

j

is computed and an adjustment vector

a = (a

1

, a

2

, . . . , a

N

) is constructed as follows:

a

j

=



0 if z

0

j

> z

+

j

and z

0

j

> z

−

j

,

z

+

j

− z

−

j

otherwise.

The adjustment a is normalized to a scalar step-size η and

added to x

π

:

x

π

← x

π

+ η

a

|a|

.

PGRL continues to test the vicinity of the adjusted policy x

π

for possible improvements in the next iteration.

Extensions to the PGRL Algorithm

1) Adaptive Step Size: To implement a coarse-to-ﬁne

search, we adapt the step size η. We start with a step size

η

max

. After g steps, we decrease η linearly to η

min

:

η =

(

η

max

if s < g

η

max

−

(s−g ) ·(η

max

−η

min

)

S−g

else.

Here, s counts the number of ﬁtness function evaluations and S

denotes the maximum allowed number of function evaluations.

2) Sequential Sampling: Instead of sampling the ﬁtness

function for each test policy x

i

three times (through three

walking trials), we apply an approach proposed by Branke

and Schmidt [16] that does not ﬁx the number of samples in

advance. Instead, samples are generated one at a time until

a certain level of conﬁdence for the ﬁtness is achieved, or a

maximum number of samples M have been generated. The

target level of conﬁdence is determined by comparing the

ﬁtness of a test policy x

i

to that of the policy x

π

, which

is to be improved. If the ﬁtnesses of the two policies are very

different, only few samples sufﬁce for estimating the direction

of the gradient of the ﬁtness function. For policies with similar

ﬁtness values more samples are required to guarantee a certain

level of conﬁdence. The sequential sampling procedure is

summarized in Algorithm 1.

Algorithm 1 Sequential Sampling Procedure

Input: Average policy ﬁtness z

π

, test policy x

i

, threshold

for ﬁtness difference ν, max. no. of samplings M

Output: Average ﬁtness z

i

of policy x

i

, number of sam-

plings used m

z

i

← ﬁtness obtained by sampling policy x

i

once

d ← |z

i

− z

π

|

for m = 1 to M − 1 do

if |d| > ν then

return (z

i

, m)

else

z

i

← ﬁtness obtained by sampling x

i

one more time

z

i

← (z

i

· m + z

i

)/(m + 1)

d ← |z

i

− z

π

|

end if

end for

return (z

i

, m)

B. Particle Swarm Optimization

Particle Swarm Optimization (PSO) works on a set (swarm)

of parameter vectors (particles) x

i

. Each particle has a velocity

v

i

= (v

i

1

, v

i

2

, . . . , v

i

N

). The velocity indicates how much the

value of the corresponding parameter changes and thus the

position of the particle, in the next iteration of the algorithm.

It is altered according to the best point in parameter space

a particle has visited so far, p

i

= (p

i

1

, p

i

2

, . . . , p

i

N

), and the

global best point PSO has found so far, p

g

:

v

i

j

← w · v

i

j

+ c

1

· χ

1

· (p

i

j

− x

i

j

) + c

2

· χ

2

· (p

g

j

− x

i

j

).

Here, parameters w, c

1

and c

2

inﬂuence the balance between

local and global exploration. The weights χ

1/2

add stochas-

ticity to the search. They are drawn uniformly from [0, 1].

The positions of the particles are updated accordingly:

x

i

← x

i

+ v

i

.

Extension to PSO

To test for convergence of the algorithm, we sum up the

distances of all J particles to the current global best position:

q

PSO

=

J

X

j=1

|x

j

− p

g

|.

When q

PSO

is smaller than a threshold τ

PSO

, the algorithm

has converged and only small improvements can be expected

by continuing the search. We then restart the algorithm by

generating a new set of randomly generated particles.

VIII. EXPERIMENTS IN A PHYSICS-BASED SIMULAT ION

The simulator used in this work is based on the Open

Dynamics Engine (ODE), a library for simulating rigid body

dynamics [17]. For the simulation experiments the experimen-

tal distance is set to d

exp

= 4.5m. One trial is complete when

the robot has walked this distance in any direction

1

or if it

falls beforehand.

In order to be able to transfer the optimization procedure

to the real robot, we limit the number of trials (evaluations

of the ﬁtness function) to 1000. We call 1000 trials an

episode. A conﬁguration for the learning problem includes a

metaheuristic, parameters of the metaheuristic (e.g. step size,

number of start individuals), and a parameter set subject to

optimization.

When comparing different conﬁgurations, we do not only

consider the overall best (ﬁttest) parameter vector encountered

during an episode. To obtain a better characterization of the

optimization performance, we obtain a distribution of 1000

ﬁtness values as follows: We evaluate each conﬁguration 10

times. From each of the 10 episodes, we take the 10 best

individuals. We then evaluate these 100 individuals 10 times.

We compare the resulting distributions using notched box plots

which can be viewed as a visualization of a t-test.

Because the number of evaluations of the ﬁtness function is

limited both in simulation and in the real-world experiments,

not all parameters of the gait can be optimized simultaneously.

To avoid the curse of dimensionality, we select nine open-loop

trajectory generation parameters and four feedback parameters

that we consider to be relevant candidates for optimization. For

an overview of the parameters, see Table I. In order to avoid

scaling problems, the ranges of parameter values considered

for optimization are normalized to the interval [0, 1].

1

In the real-world experiments presented in Section IX, we use a more

restrictive goal formulation.

TABLE I

OPTIMIZED PARAME TERS

Open-Loop Parameters

∗

ψ step frequency

∗

a

Swing

swing amplitude of the leg in forward direction

∗

a

Shift

amplitude for lateral weight shifting

∗

λ

FootSwing

partial balance of the leg swing with the foot angle

∗

λ

FootPitch

general tilt of the robot in sagittal direction

∗

a

FootPitch

oscillates the robot in sagittal direction with every step

ρ

FootPitch

phase shift of the a

FootPitch

oscillation

λ

LegSpread

offset to the lateral leg angle

a

Arms

amplitude of the arm movement

Feedback Control Parameters

∗

K

r

gain of lateral gyro feedback

∗

K

p

gain of sagittal gyro feedback

∗

Φ

Reset

nominal trunk phase for phase resetting

T

Trans

transition time for phase resetting

∗

Parameters used for the real-world experiment

RandomSearch PGRL 1 PGRL 2 PSO 1 PSO 2

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

Fitness

Configuration

Fig. 2. Comparing metaheuristics. See text for conﬁguration descriptions.

A. Comparing PGRL and PSO Metaheuristics

In this section, we compare two metaheuristics and their

variants in order to ﬁnd the most adequate one for our task. We

use a uniform random search as baseline. All metaheuristics

optimize the parameter set

P

1

= {ψ, a

Robot

, a

Shift

, λ

FootPitch

, K

r

, K

p

, Φ

Reset

, T

Trans

}.

The hand-tuned parameters x

init

are used to initialize PGRL.

We evaluate two different conﬁgurations of extended PGRL.

Both conﬁgurations use a threshold for selecting additional

samples in the sequential sampling procedure of ν = 0.015

and generate B = 10 test policies with a disturbance of  =

0.015. The step size of the ﬁrst conﬁguration (PGRL 1) falls

from η

max

= 0.7 to η

min

= 0.1 after g = 100 steps. The

second conﬁguration (PGRL 2) uses η

min

= 0.3.

We evaluate two conﬁgurations of the PSO. The ﬁrst PSO

conﬁguration (PSO 1) works on J = 10 particles. Both, c

1

and

c

2

are set to 1. The second conﬁguration (PSO 2) works on

J = 20 particles with c

1

= c

2

= 0.5. For both conﬁgurations,

a restart threshold τ

PSO

= 0.3 and a weight momentum of

w = 0.3 are used.

The box plots for comparing these six conﬁgurations are

depicted in Figure 2. All tested metaheuristics perform better

than the random search.

none SS SS,AS PR PR,SS PR,SS,AS

0.4

0.6

0.8

1

1.2

1.4

1.6

Fitness

Configuration

Fig. 3. Relevance of phase resetting (PR), sequential sampling (SS), and an

adaptive step size (AS).

When comparing the two PGRL variants, the conﬁguration

with the larger target step size η

min

= 0.3 (PGRL 2) outper-

forms the one with the smaller step size η

min

= 0.1 (PGRL

1), indicating that η

min

= 0.1 is chosen too small. The PSO

conﬁgurations ﬁnd both a best individual with a very high

ﬁtness (1.61 and 1.6). PSO 2 performs better than PSO 1.

The results of the PSO conﬁgurations should be taken with

a grain of salt: When the PSO has converged, the particles are

concentrated within a small region of the search space. Thus,

taking the 10 best individuals from one episode comes close

to taking the same (best) individual ten times. As PGRL does

not suffer from this problem and PGRL 2 performed nearly

as good as the best PSO conﬁguration, we decided to use

PGRL 2 in further simulation experiments and for the real-

world experiment.

B. Relevance of phase resetting, sequential sampling and an

adaptive step size

To show the relevance of phase resetting, sequential sam-

pling and an adaptive step size, we compare six different

conﬁgurations. The ﬁrst two conﬁgurations neither use se-

quential sampling nor an adaptive step size. The second two

conﬁgurations make use of sequential sampling, but do not

adapt the step size. The third two conﬁgurations use both,

sequential sampling and an adaptive step size. One of the two

conﬁgurations always uses phase resetting and the other one

doesn’t.

Box plots for all six conﬁgurations are shown in Fig. 3.

The conﬁgurations using phase resetting (PR) outperform the

conﬁgurations without phase resetting in all three cases. This

indicates that the phase reset mechanism is essential to achieve

fast and robust locomotion.

It can also be seen that conﬁgurations using sequential

sampling (SS) outperform conﬁgurations without it and that

conﬁgurations using additionally an adaptive step size (AS)

outperform corresponding conﬁgurations without it. This in-

dicates that both, sequential sampling and an adaptive step size

improve the PGRL algorithm and lead to better results.

Stochastic optimization of bipedal walking using gyro feedback and phase resetting

Citations

The humanoid museum tour guide Robotinho

Versatile, high-quality motions and behavior control of a humanoid soccer robot

Kicking a ball - modeling complex dynamic motions for humanoid robots

Learning reliable and efficient navigation with a humanoid

Learning full body push recovery control for small humanoid robots

References

Particle swarm optimization

Passive dynamic walking

Zero-moment point — thirty five years of its life

Policy gradient reinforcement learning for fast quadrupedal locomotion

Automatic gait optimization with Gaussian process regression

Related Papers (5)

Zero-moment point — thirty five years of its life

Biped walking pattern generation by using preview control of zero-moment point

Policy gradient reinforcement learning for fast quadrupedal locomotion

Efficient Walking Speed Optimization of a Humanoid Robot

Online regeneration of bipedal walking gait pattern optimizing footstep placement and timing