What is the main challenge to their framework?

The requirement to meet strict synchrony constraints to ensure temporal and semantic coherence of communicative behavior presents a main challenge to their framework.

What have the authors contributed in "Towards meaningful robot gesture" ?

The authors present an approach to enable the humanoid robot ASIMO to flexibly produce and synchronize speech and co-verbal gestures at run-time, while not being limited to a predefined repertoire of motor action. Since this research challenge has already been tackled in various ways within the domain of virtual conversational agents, the authors build upon the experience gained from the development of a speech and gesture production model used for their virtual human Max. As an underlying action generation architecture, the authors explain how ACE draws upon a tight, bi-directional coupling of ASIMO ’ s perceptuo-motor system with multi-modal scheduling via both efferent control signals and afferent

What are the future works in "Towards meaningful robot gesture" ?

To tackle this challenge the cross-modal adaptation mechanisms applied in ACE will be extended to allow for a finer mutual adaptation between robot gesture and speech.

Why is the outer form of a gesture distorted?

due to deviations from original postures and respective joint angles, the outer form of a gesture might be distorted such that its original meaning is altered.

What is the main advantage of using ACE as an underlying action generation architecture?

By re-employing ACE as an underlying action generation architecture, the authors draw upon a tight coupling of ASIMO’s perceptuo-motor system with multi-modal scheduling.

What is the main advantage of the ACE?

Being one of the most sophisticated multi-modalschedulers, the Articulated Communicator Engine (ACE) allows for an on-the-spot production of flexibly planned behavior representations.

What is the main advantage of this approach to robot control in combination with ACE?

A main advantage of this approach to robot control in combination with ACE is the formulation of the trajectory in terms of effector targets in task space, which are then used to derive a joint space description using the standard WBM controller for ASIMO.

What is the inverse kinematics of the arm?

inverse kinematics (IK) of the arm is solved on the velocity level using the ASIMO whole body motion (WBM) controller framework [5].

Where is the research project “Conceptual Motorics” based?

The research project “Conceptual Motorics” is based at the Research Institute for Cognition and Robotics, Bielefeld University, Germany.

What is the main advantage of the proposed robot control architecture?

This has been realized in a bi-directional robot control architecture which uses both efferent actuator control signals and afferent sensory feedback.

(Open Access) Towards Meaningful Robot Gesture (2009) | Maha Salem

Q: What is the simplest way to describe a given chunk?

A given chunk consists of an intonation phrase and a co-expressive gesture phrase, concertedly conveying a prominent concept [10].

Towards Meaningful Robot Gesture

Maha Salem, Stefan Kopp, Ipke Wachsmuth, Frank Joublin

Abstract Humanoid robot companions that are intended to engage in natural and

ﬂuent human-robot interaction are supposed to combine speech with non-verbal

modalities for comprehensible and believable behavior. We present an approach to

enable the humanoid robot ASIMO to ﬂexibly produce and synchronize speech and

co-verbal gestures at run-time, while not being limited to a predeﬁned repertoire

of motor action. Since this research challenge has already been tackled in various

ways within the domain of virtual conversational agents, we build upon the ex-

perience gained from the development of a speech and gesture production model

used for our virtual human Max. Being one of the most sophisticated multi-modal

schedulers, the Articulated Communicator Engine (ACE) has replaced the use of

lexicons of canned behaviors with an on-the-spot production of ﬂexibly planned be-

havior representations. As an underlying action generation architecture, we explain

how ACE draws upon a tight, bi-directional coupling of ASIMO’s perceptuo-motor

system with multi-modal scheduling via both efferent control signals and afferent

feedback.

Maha Salem

Research Institute for Cognition and Robotics, Bielefeld University, Germany, e-mail:

msalem@cor-lab.uni-bielefeld.de

Stefan Kopp

Sociable Agents Group, Bielefeld University, Germany, e-mail: skopp@techfak.uni-bielefeld.de

Ipke Wachsmuth

Artiﬁcial Intelligence Group, Bielefeld University, Germany, e-mail: ipke@techfak.uni-

bielefeld.de

Frank Joublin

Honda Research Institute Europe, Offenbach, Germany, e-mail: frank.joublin@honda-ri.de

2 Maha Salem, Stefan Kopp, Ipke Wachsmuth, Frank Joublin

1 Introduction

Non-verbal expression via gesture is an important feature of social interaction, fre-

quently used by human speakers to emphasize or supplement what they express in

speech. For example, pointing to objects being referred to or giving spatial direc-

tions conveys information that can hardly be encoded solely by speech. Accord-

ingly, humanoid robot companions that are intended to engage in natural and ﬂuent

human-robot interaction must be able to produce speech-accompanying non-verbal

behaviors from conceptual, to-be-communicated information. Forming an integral

part of human communication, hand and arm gestures are primary candidates for

extending the communicative capabilities of social robots.

According to McNeill [13], co-verbal gestures are mostly generated uncon-

sciously and are strongly connected to speech as part of an integrated utterance,

yielding semantic, pragmatic and temporal synchrony between both modalities. This

suggests that gestures are inﬂuenced by the communicative intent and by the accom-

panying verbal utterance in various ways. In contrast to task-oriented movements

like reaching or grasping, human gestures are derived to some extent from a kind of

internal representation of shape [8], especially when iconic or metaphoric gestures

are used. Such characteristic shape and dynamical properties exhibited by gestural

movement enable humans to distinguish them from subsidiary movements and to

perceive them as meaningful [17]. Consequently, the generation of co-verbal ges-

tures for artiﬁcial humanoid bodies, e.g., as provided for virtual agents or robots,

demands a high degree of control and ﬂexibility concerning shape and time proper-

ties of the gesture, while ensuring a natural appearance of the movement.

In this paper, we ﬁrst discuss related work, highlighting the fact that not much

research has so far focused on the generation of robot gesture (Section 2). In Section

3, we describe our multi-modal behavior realizer, the Articulated Communicator

Engine (ACE), which implements the speech-gesture production model originally

designed for the virtual agent Max and is now used for the humanoid robot ASIMO.

We then present a concept for the generation of meaningful arm movements for the

humanoid robot ASIMO based on ACE in Section 4. Finally, we conclude and give

an outlook of future work in Section 5.

2 Related Work

At present, the generation together with the evaluation of the effects of robot ges-

ture is largely unexplored. In traditional robotics, recognition rather than synthesis

of gesture is mainly brought into focus. In existing cases of gesture synthesis, how-

ever, models typically denote object manipulation serving little or no communica-

tive function. Furthermore, gesture generation is often based on prior recognition

of perceived gestures, hence the aim is often to imitate these movements. In many

cases in which robot gesture is actually generated with a communicative intent,

these arm movements are not produced at run-time, but are pre-recorded for demon-

stration purposes and are not ﬁnely coordinated with speech. Generally, only a few

approaches share any similarities with ours, however, they are mostly realized on

Towards Meaningful Robot Gesture 3

less sophisticated platforms with less complex robot bodies (e.g., limited mobility,

less degrees of freedom (DOF), etc.). One example is the personal robot Maggie

[6] whose aim is to interact with humans in a natural way, so that a peer-to-peer

relationship can be established. For this purpose, the robot is equipped with a set

of pre-deﬁned gestures, but it can also learn some gestures from the user. Another

example of robot gesture is given by the penguin robot Mel [16] which is able to

engage with humans in a collaborative conversation, using speech and gesture to in-

dicate engagement behaviors. However, gestures used in this context are predeﬁned

in a set of action descriptions called the “recipe library”. A further approach is that

of the communication robot Fritz [1], using speech, facial expression, eye-gaze and

gesture to appear livelier while interacting with people. Gestures produced during

interactional conversations are generated on-line and mainly consist of human-like

arm movements and pointing gestures performed with eyes, head, and arms.

As Minato et al. [14] state, not only the behavior but also the appearance of a

robot inﬂuences human-robot interaction. Therefore, the importance of the robot’s

design should not be underestimated if used as a research platform to study the ef-

fect of robot gesture on humans. In general, only few scientiﬁc studies regarding

the perception and acceptance of robot gesture have been carried out so far. Much

research on the human perception of robots depending on their appearance, as based

on different levels of embodiment, has been conducted by MacDorman and Ishiguro

[12], the latter widely known as the inventor of several android robots. In their test-

ing scenarios with androids, however, non-verbal expression via gesture and gaze

was generally hard-coded and hence pre-deﬁned. Nevertheless, MacDorman and

Ishiguro consider androids a key testing ground for social, cognitive, and neuro-

scientiﬁc theories. They argue that they provide an experimental apparatus that can

be controlled more precisely than any human actor. This is in line with initial re-

sults, indicating that only robots strongly resembling humans can elicit the broad

spectrum of responses that people typically direct toward each other. These ﬁndings

highlight the importance of the robot’s design when used as a research platform for

the evaluation of human-robot interaction scenarios.

While being a fairly new area in robotics, within the domain of virtual humanoid

agents, the generation of speech-accompanying gesture has already been addressed

in various ways. Cassell et al. introduced the REA system [2] over a decade ago,

employing a conversational humanoid agent named Rea that plays the role of a real

estate salesperson. A further approach, the BEAT (Behavior Expression Animation

Toolkit) system [3], allows for appropriate and synchronized non-verbal behaviors

by predicting the timing of gesture animations from synthesized speech in which

the expressive phase coincides with the prominent syllable in speech. Gibet et al.

generate and animate sign-language from script-like speciﬁcations, resulting in a

simulation of fairly natural movement characteristics [4]. However, even in this do-

main most existing systems either neglect the meaning a gesture conveys, or they

simplify matters by using lexicons of words and canned non-verbal behaviors in the

form of pre-produced gestures.

In contrast, the framework underlying the virtual agent Max [9] is geared towards

an integrated architecture in which the planning of both content and form across both

4 Maha Salem, Stefan Kopp, Ipke Wachsmuth, Frank Joublin

modalities is coupled [7], hence giving credit to the meaning conveyed in non-verbal

utterances. According to Reiter and Dale [15], computational approaches to gener-

ating multi-modal behavior can be modeled in terms of three consecutive tasks:

ﬁrstly, determining what to convey (i.e., content planning); secondly, determining

how to convey it (i.e., micro-planning); ﬁnally, realizing the planned behaviors (i.e.,

surface realization). Although the Articulated Communicator Engine (ACE) itself

operates on the surface realization layer of the generation pipeline, the overall sys-

tem used for Max also provides an integrated content planning and micro-planning

framework [7]. Within the scope of this paper, however, only ACE is considered and

described, since it marks the starting point required for the interface endowing the

robot ASIMO with similar multi-modal behavior.

3 An Incremental Model of Speech-Gesture Production

Our approach is based on straightforward descriptions of the designated outer

form of the to-be-communicated multi-modal utterances. For this purpose, we use

MURML [11], the XML-based Multi-modal Utterance Representation Markup Lan-

guage, to specify verbal utterances in combination with co-verbal gestures [9].

These, in turn, are explicitly described in terms of form features (i.e., the posture

aspired for the gesture stroke), specifying their afﬁliation to dedicated linguistic el-

ements based on matching time identiﬁers. Fig. 1 shows an example of a MURML

speciﬁcation which can be used as input for our production model. For more infor-

mation on MURML see [11].

Fig. 1 Example of a MURML speciﬁcation for multi-modal utterances.

The concept underlying the multi-modal production model is based on an empir-

ically suggested assumption referred to as segmentation hypothesis [13], according

to which the co-production of continuous speech and gesture is organized in succes-

sive segments. Each of these, in turn, represents a single idea unit which we refer

Towards Meaningful Robot Gesture 5

to as a chunk of speech-gesture production. A given chunk consists of an intona-

tion phrase and a co-expressive gesture phrase, concertedly conveying a prominent

concept [10]. Within a chunk, synchrony is mainly achieved by gesture adaptation

to structure and timing of speech, while absolute time information is obtained at

phoneme level and used to establish timing constraints for co-verbal gestural move-

ments. Given the MURML speciﬁcation shown in Fig. 1, the correspondence be-

tween the verbal phrase and the accompanying gesture is established by the <time

id=“...”/> tag with unique identiﬁer attributes. Accordingly, the beginning and end-

ing of the afﬁliate gesture is deﬁned using the <afﬁliate onset=“...” end=“...”/> tag.

The incremental production of successive coherent chunks is realized by processing

each chunk on a separate ‘blackboard’ running through a sequence of states (Fig. 2).

These states augment the classical two-phase planning - execution procedure with

additional phases, in which the production process of subsequent chunks can inter-

act with one another.

Fig. 2 Blackboards run through a sequence of processing states for incremental production of

multi-modal chunks.

This approach for gesture motor control is based on a hierarchical concept: Dur-

ing higher-level planning, the motor planner is provided with timed form features

as described in the MURML speciﬁcation. This information is then transferred to

independent motor control modules. Such a functional-anatomical decomposition

of motor control aims at breaking down the complex control problem into solvable

sub-problems [18]. ACE [10] provides speciﬁc motor planning modules, amongst

others, for the arms, the wrists, and the hands, which instantiate local motor pro-

grams (LMPs). These are used to animate required sub-movements and operate

within a limited set of DOFs and over a designated period of time (Fig. 3). For each

limb’s motion, an abstract motor control program (MCP) coordinates and synchro-

nizes the concurrently running LMPs for an overall solution to the control prob-

lem. The overall control framework, however, does not attend to how such sub-

movements are controlled. To allow for an effective interplay of the LMPs within a

MCP, the planning modules arrange them into a controller network which deﬁnes

Towards Meaningful Robot Gesture

Figures

Citations

Hand and Mind: What Gestures Reveal about Thought

Synchronized gesture and speech production for humanoid robots

Artimetrics: Biometrics for Artificial Entities

Towards an integrated model of speech and gesture production for multi-modal robot behavior

Development of a generic method to generate upper-body emotional expressions for different social robots

References

Hand and Mind: What Gestures Reveal about Thought

Building Natural Language Generation Systems

Embodied conversational agents

Hand and Mind: What Gestures Reveal about Thought

BEAT: the Behavior Expression Animation Toolkit

Related Papers (5)

Towards an integrated model of speech and gesture production for multi-modal robot behavior

Implementing expressive gesture synthesis for embodied conversational agents

Various emotional expressions with emotion expression humanoid robot WE-4RII

Synthesizing multimodal utterances for conversational agents

Natural deictic communication with humanoid robots

Frequently Asked Questions (13)

Q1. What is the main challenge to their framework?

Q2. What have the authors contributed in "Towards meaningful robot gesture" ?

Q3. What are the future works in "Towards meaningful robot gesture" ?

Q4. What are the primary candidates for extending the communicative capabilities of social robots?

Q5. Why is the outer form of a gesture distorted?

Q6. What is the main advantage of using ACE as an underlying action generation architecture?

Q7. What is the main advantage of the ACE?

Q8. What is the main advantage of this approach to robot control in combination with ACE?

Q9. What is the simplest way to describe a given chunk?

Q10. What are the common types of gestures used in interactional conversations?

Q11. What is the inverse kinematics of the arm?

Q12. Where is the research project “Conceptual Motorics” based?

Q13. What is the main advantage of the proposed robot control architecture?

Trending Questions (1)