scispace - formally typeset
Open AccessBook ChapterDOI

Towards Meaningful Robot Gesture

Reads0
Chats0
TLDR
This work presents an approach to enable the humanoid robot ASIMO to flexibly produce and synchronize speech and co-verbal gestures at run-time, while not being limited to a predefined repertoire of motor action.
Abstract
Humanoid robot companions that are intended to engage in natural and fluent human-robot interaction are supposed to combine speech with non-verbal modalities for comprehensible and believable behavior. We present an approach to enable the humanoid robot ASIMO to flexibly produce and synchronize speech and co-verbal gestures at run-time, while not being limited to a predefined repertoire of motor action. Since this research challenge has already been tackled in various ways within the domain of virtual conversational agents, we build upon the experience gained from the development of a speech and gesture production model used for our virtual human Max. Being one of the most sophisticated multi-modal schedulers, the Articulated Communicator Engine (ACE) has replaced the use of lexicons of canned behaviors with an on-the-spot production of flexibly planned behavior representations. As an underlying action generation architecture, we explain how ACE draws upon a tight, bi-directional coupling of ASIMO’s perceptuo-motor system with multi-modal scheduling via both efferent control signals and afferent feedback.

read more

Content maybe subject to copyright    Report

Towards Meaningful Robot Gesture
Maha Salem, Stefan Kopp, Ipke Wachsmuth, Frank Joublin
Abstract Humanoid robot companions that are intended to engage in natural and
fluent human-robot interaction are supposed to combine speech with non-verbal
modalities for comprehensible and believable behavior. We present an approach to
enable the humanoid robot ASIMO to flexibly produce and synchronize speech and
co-verbal gestures at run-time, while not being limited to a predefined repertoire
of motor action. Since this research challenge has already been tackled in various
ways within the domain of virtual conversational agents, we build upon the ex-
perience gained from the development of a speech and gesture production model
used for our virtual human Max. Being one of the most sophisticated multi-modal
schedulers, the Articulated Communicator Engine (ACE) has replaced the use of
lexicons of canned behaviors with an on-the-spot production of flexibly planned be-
havior representations. As an underlying action generation architecture, we explain
how ACE draws upon a tight, bi-directional coupling of ASIMO’s perceptuo-motor
system with multi-modal scheduling via both efferent control signals and afferent
feedback.
Maha Salem
Research Institute for Cognition and Robotics, Bielefeld University, Germany, e-mail:
msalem@cor-lab.uni-bielefeld.de
Stefan Kopp
Sociable Agents Group, Bielefeld University, Germany, e-mail: skopp@techfak.uni-bielefeld.de
Ipke Wachsmuth
Artificial Intelligence Group, Bielefeld University, Germany, e-mail: ipke@techfak.uni-
bielefeld.de
Frank Joublin
Honda Research Institute Europe, Offenbach, Germany, e-mail: frank.joublin@honda-ri.de
1

2 Maha Salem, Stefan Kopp, Ipke Wachsmuth, Frank Joublin
1 Introduction
Non-verbal expression via gesture is an important feature of social interaction, fre-
quently used by human speakers to emphasize or supplement what they express in
speech. For example, pointing to objects being referred to or giving spatial direc-
tions conveys information that can hardly be encoded solely by speech. Accord-
ingly, humanoid robot companions that are intended to engage in natural and fluent
human-robot interaction must be able to produce speech-accompanying non-verbal
behaviors from conceptual, to-be-communicated information. Forming an integral
part of human communication, hand and arm gestures are primary candidates for
extending the communicative capabilities of social robots.
According to McNeill [13], co-verbal gestures are mostly generated uncon-
sciously and are strongly connected to speech as part of an integrated utterance,
yielding semantic, pragmatic and temporal synchrony between both modalities. This
suggests that gestures are influenced by the communicative intent and by the accom-
panying verbal utterance in various ways. In contrast to task-oriented movements
like reaching or grasping, human gestures are derived to some extent from a kind of
internal representation of shape [8], especially when iconic or metaphoric gestures
are used. Such characteristic shape and dynamical properties exhibited by gestural
movement enable humans to distinguish them from subsidiary movements and to
perceive them as meaningful [17]. Consequently, the generation of co-verbal ges-
tures for artificial humanoid bodies, e.g., as provided for virtual agents or robots,
demands a high degree of control and flexibility concerning shape and time proper-
ties of the gesture, while ensuring a natural appearance of the movement.
In this paper, we first discuss related work, highlighting the fact that not much
research has so far focused on the generation of robot gesture (Section 2). In Section
3, we describe our multi-modal behavior realizer, the Articulated Communicator
Engine (ACE), which implements the speech-gesture production model originally
designed for the virtual agent Max and is now used for the humanoid robot ASIMO.
We then present a concept for the generation of meaningful arm movements for the
humanoid robot ASIMO based on ACE in Section 4. Finally, we conclude and give
an outlook of future work in Section 5.
2 Related Work
At present, the generation together with the evaluation of the effects of robot ges-
ture is largely unexplored. In traditional robotics, recognition rather than synthesis
of gesture is mainly brought into focus. In existing cases of gesture synthesis, how-
ever, models typically denote object manipulation serving little or no communica-
tive function. Furthermore, gesture generation is often based on prior recognition
of perceived gestures, hence the aim is often to imitate these movements. In many
cases in which robot gesture is actually generated with a communicative intent,
these arm movements are not produced at run-time, but are pre-recorded for demon-
stration purposes and are not finely coordinated with speech. Generally, only a few
approaches share any similarities with ours, however, they are mostly realized on

Towards Meaningful Robot Gesture 3
less sophisticated platforms with less complex robot bodies (e.g., limited mobility,
less degrees of freedom (DOF), etc.). One example is the personal robot Maggie
[6] whose aim is to interact with humans in a natural way, so that a peer-to-peer
relationship can be established. For this purpose, the robot is equipped with a set
of pre-defined gestures, but it can also learn some gestures from the user. Another
example of robot gesture is given by the penguin robot Mel [16] which is able to
engage with humans in a collaborative conversation, using speech and gesture to in-
dicate engagement behaviors. However, gestures used in this context are predefined
in a set of action descriptions called the “recipe library”. A further approach is that
of the communication robot Fritz [1], using speech, facial expression, eye-gaze and
gesture to appear livelier while interacting with people. Gestures produced during
interactional conversations are generated on-line and mainly consist of human-like
arm movements and pointing gestures performed with eyes, head, and arms.
As Minato et al. [14] state, not only the behavior but also the appearance of a
robot influences human-robot interaction. Therefore, the importance of the robot’s
design should not be underestimated if used as a research platform to study the ef-
fect of robot gesture on humans. In general, only few scientific studies regarding
the perception and acceptance of robot gesture have been carried out so far. Much
research on the human perception of robots depending on their appearance, as based
on different levels of embodiment, has been conducted by MacDorman and Ishiguro
[12], the latter widely known as the inventor of several android robots. In their test-
ing scenarios with androids, however, non-verbal expression via gesture and gaze
was generally hard-coded and hence pre-defined. Nevertheless, MacDorman and
Ishiguro consider androids a key testing ground for social, cognitive, and neuro-
scientific theories. They argue that they provide an experimental apparatus that can
be controlled more precisely than any human actor. This is in line with initial re-
sults, indicating that only robots strongly resembling humans can elicit the broad
spectrum of responses that people typically direct toward each other. These findings
highlight the importance of the robot’s design when used as a research platform for
the evaluation of human-robot interaction scenarios.
While being a fairly new area in robotics, within the domain of virtual humanoid
agents, the generation of speech-accompanying gesture has already been addressed
in various ways. Cassell et al. introduced the REA system [2] over a decade ago,
employing a conversational humanoid agent named Rea that plays the role of a real
estate salesperson. A further approach, the BEAT (Behavior Expression Animation
Toolkit) system [3], allows for appropriate and synchronized non-verbal behaviors
by predicting the timing of gesture animations from synthesized speech in which
the expressive phase coincides with the prominent syllable in speech. Gibet et al.
generate and animate sign-language from script-like specifications, resulting in a
simulation of fairly natural movement characteristics [4]. However, even in this do-
main most existing systems either neglect the meaning a gesture conveys, or they
simplify matters by using lexicons of words and canned non-verbal behaviors in the
form of pre-produced gestures.
In contrast, the framework underlying the virtual agent Max [9] is geared towards
an integrated architecture in which the planning of both content and form across both

4 Maha Salem, Stefan Kopp, Ipke Wachsmuth, Frank Joublin
modalities is coupled [7], hence giving credit to the meaning conveyed in non-verbal
utterances. According to Reiter and Dale [15], computational approaches to gener-
ating multi-modal behavior can be modeled in terms of three consecutive tasks:
firstly, determining what to convey (i.e., content planning); secondly, determining
how to convey it (i.e., micro-planning); finally, realizing the planned behaviors (i.e.,
surface realization). Although the Articulated Communicator Engine (ACE) itself
operates on the surface realization layer of the generation pipeline, the overall sys-
tem used for Max also provides an integrated content planning and micro-planning
framework [7]. Within the scope of this paper, however, only ACE is considered and
described, since it marks the starting point required for the interface endowing the
robot ASIMO with similar multi-modal behavior.
3 An Incremental Model of Speech-Gesture Production
Our approach is based on straightforward descriptions of the designated outer
form of the to-be-communicated multi-modal utterances. For this purpose, we use
MURML [11], the XML-based Multi-modal Utterance Representation Markup Lan-
guage, to specify verbal utterances in combination with co-verbal gestures [9].
These, in turn, are explicitly described in terms of form features (i.e., the posture
aspired for the gesture stroke), specifying their affiliation to dedicated linguistic el-
ements based on matching time identifiers. Fig. 1 shows an example of a MURML
specification which can be used as input for our production model. For more infor-
mation on MURML see [11].
Fig. 1 Example of a MURML specification for multi-modal utterances.
The concept underlying the multi-modal production model is based on an empir-
ically suggested assumption referred to as segmentation hypothesis [13], according
to which the co-production of continuous speech and gesture is organized in succes-
sive segments. Each of these, in turn, represents a single idea unit which we refer

Towards Meaningful Robot Gesture 5
to as a chunk of speech-gesture production. A given chunk consists of an intona-
tion phrase and a co-expressive gesture phrase, concertedly conveying a prominent
concept [10]. Within a chunk, synchrony is mainly achieved by gesture adaptation
to structure and timing of speech, while absolute time information is obtained at
phoneme level and used to establish timing constraints for co-verbal gestural move-
ments. Given the MURML specification shown in Fig. 1, the correspondence be-
tween the verbal phrase and the accompanying gesture is established by the <time
id=“...”/> tag with unique identifier attributes. Accordingly, the beginning and end-
ing of the affiliate gesture is defined using the <affiliate onset=“... end=“...”/> tag.
The incremental production of successive coherent chunks is realized by processing
each chunk on a separate ‘blackboard’ running through a sequence of states (Fig. 2).
These states augment the classical two-phase planning - execution procedure with
additional phases, in which the production process of subsequent chunks can inter-
act with one another.
Fig. 2 Blackboards run through a sequence of processing states for incremental production of
multi-modal chunks.
This approach for gesture motor control is based on a hierarchical concept: Dur-
ing higher-level planning, the motor planner is provided with timed form features
as described in the MURML specification. This information is then transferred to
independent motor control modules. Such a functional-anatomical decomposition
of motor control aims at breaking down the complex control problem into solvable
sub-problems [18]. ACE [10] provides specific motor planning modules, amongst
others, for the arms, the wrists, and the hands, which instantiate local motor pro-
grams (LMPs). These are used to animate required sub-movements and operate
within a limited set of DOFs and over a designated period of time (Fig. 3). For each
limb’s motion, an abstract motor control program (MCP) coordinates and synchro-
nizes the concurrently running LMPs for an overall solution to the control prob-
lem. The overall control framework, however, does not attend to how such sub-
movements are controlled. To allow for an effective interplay of the LMPs within a
MCP, the planning modules arrange them into a controller network which defines

Citations
More filters
Journal ArticleDOI

Hand and Mind: What Gestures Reveal about Thought

TL;DR: McNeill as discussed by the authors discusses what Gestures reveal about Thought in Hand and Mind: What Gestures Reveal about Thought. Chicago and London: University of Chicago Press, 1992. 416 pp.
Proceedings ArticleDOI

Synchronized gesture and speech production for humanoid robots

TL;DR: A model that is capable of synchronizing expressive gestures with speech, implemented on a Honda humanoid robot, and demonstrating the ability of observers to differentiate varying levels of expressiveness, excitement and speech synchronization is presented.
Journal ArticleDOI

Artimetrics: Biometrics for Artificial Entities

TL;DR: The concept of artimetrics is proposed, a field of study that identifies, classifies, and authenticates robots, software, and virtual reality agents, focusing specifically on those that have a human morphology.
Proceedings ArticleDOI

Towards an integrated model of speech and gesture production for multi-modal robot behavior

TL;DR: This work proposes a robot control architecture building upon the Articulated Communicator Engine (ACE) that was developed to allow virtual agents to flexibly realize planned multi-modal behavior representations on the spot.
Journal ArticleDOI

Development of a generic method to generate upper-body emotional expressions for different social robots

TL;DR: This method aims to minimize the workload when implementing gestures on a new robot platform and facilitate the sharing of gestures between different robots by using a set of target gestures listed in a database and maps them to that specific configuration.
References
More filters
Book

Hand and Mind: What Gestures Reveal about Thought

TL;DR: McNeill et al. as mentioned in this paper argue that gestures do not simply form a part of what is said and meant but have an impact on thought itself, and that gestures are global, synthetic, idiosyncratic, and imagistic.
Book

Building Natural Language Generation Systems

TL;DR: The architecture of a Natural Language Generation system and its implications for national language generation in practice are described.
Book

Embodied conversational agents

TL;DR: Embodied conversational agents as mentioned in this paper are computer-generated cartoonlike characters that demonstrate many of the same properties as humans in face-to-face conversation, including the ability to produce and respond to verbal and nonverbal communication.
Journal ArticleDOI

Hand and Mind: What Gestures Reveal about Thought

TL;DR: McNeill as discussed by the authors discusses what Gestures reveal about Thought in Hand and Mind: What Gestures Reveal about Thought. Chicago and London: University of Chicago Press, 1992. 416 pp.
Proceedings ArticleDOI

BEAT: the Behavior Expression Animation Toolkit

TL;DR: The Behavior Expression Animation Toolkit (BEAT) as discussed by the authors allows animators to input typed text that they wish to be spoken by an animated human figure, and to obtain as output appropriate and synchronized nonverbal behaviors and synthesized speech in a form that can be sent to a number of different animation systems.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What is the main challenge to their framework?

The requirement to meet strict synchrony constraints to ensure temporal and semantic coherence of communicative behavior presents a main challenge to their framework. 

The authors present an approach to enable the humanoid robot ASIMO to flexibly produce and synchronize speech and co-verbal gestures at run-time, while not being limited to a predefined repertoire of motor action. Since this research challenge has already been tackled in various ways within the domain of virtual conversational agents, the authors build upon the experience gained from the development of a speech and gesture production model used for their virtual human Max. As an underlying action generation architecture, the authors explain how ACE draws upon a tight, bi-directional coupling of ASIMO ’ s perceptuo-motor system with multi-modal scheduling via both efferent control signals and afferent 

To tackle this challenge the cross-modal adaptation mechanisms applied in ACE will be extended to allow for a finer mutual adaptation between robot gesture and speech. 

Forming an integral part of human communication, hand and arm gestures are primary candidates for extending the communicative capabilities of social robots. 

due to deviations from original postures and respective joint angles, the outer form of a gesture might be distorted such that its original meaning is altered. 

By re-employing ACE as an underlying action generation architecture, the authors draw upon a tight coupling of ASIMO’s perceptuo-motor system with multi-modal scheduling. 

Being one of the most sophisticated multi-modalschedulers, the Articulated Communicator Engine (ACE) allows for an on-the-spot production of flexibly planned behavior representations. 

A main advantage of this approach to robot control in combination with ACE is the formulation of the trajectory in terms of effector targets in task space, which are then used to derive a joint space description using the standard WBM controller for ASIMO. 

A given chunk consists of an intonation phrase and a co-expressive gesture phrase, concertedly conveying a prominent concept [10]. 

Gestures produced during interactional conversations are generated on-line and mainly consist of human-like arm movements and pointing gestures performed with eyes, head, and arms. 

inverse kinematics (IK) of the arm is solved on the velocity level using the ASIMO whole body motion (WBM) controller framework [5]. 

The research project “Conceptual Motorics” is based at the Research Institute for Cognition and Robotics, Bielefeld University, Germany. 

This has been realized in a bi-directional robot control architecture which uses both efferent actuator control signals and afferent sensory feedback. 

Trending Questions (1)
Which country made Asimo robot?

We present an approach to enable the humanoid robot ASIMO to flexibly produce and synchronize speech and co-verbal gestures at run-time, while not being limited to a predefined repertoire of motor action.