scispace - formally typeset
Open AccessProceedings ArticleDOI

A clickable world: Behavior selection through pointing and context for mobile manipulation

TLDR
It is shown that the 3D location of the click, the state of the robot, and the surrounding context is sufficient for the robot to choose the correct behavior from a set of behaviors and perform the following tasks: pick-up a designated object from a floor or table, deliver an object to a designated person, place an object on a designated table, go to a designate location, and touch a designated location with its end effector.
Abstract
We present a new behavior selection system for human-robot interaction that maps virtual buttons overlaid on the physical environment to the robotpsilas behaviors, thereby creating a clickable world. The user clicks on a virtual button and activates the associated behavior by briefly illuminating a corresponding 3D location with an off-the-shelf green laser pointer. As we have described in previous work, the robot can detect this click and estimate its 3D location using an omnidirectional camera and a pan/tilt stereo camera. In this paper, we show that the robot can select the appropriate behavior to execute using the 3D location of the click, the context around this 3D location, and its own state. For this work, the robot performs this selection process using a cascade of classifiers. We demonstrate the efficacy of this approach with an assistive object-fetching application. Through empirical evaluation, we show that the 3D location of the click, the state of the robot, and the surrounding context is sufficient for the robot to choose the correct behavior from a set of behaviors and perform the following tasks: pick-up a designated object from a floor or table, deliver an object to a designated person, place an object on a designated table, go to a designated location, and touch a designated location with its end effector.

read more

Content maybe subject to copyright    Report

A Clickable World: Behavior Selection Through Pointing and Context
for Mobile Manipulation
Hai Nguyen, Advait Jain, Cressel Anderson, Charles C. Kemp
Abstract We present a new behavior selection system for
human-robot interaction that maps virtual buttons overlaid
on the physical environment to the robot’s behaviors, thereby
creating a clickable world. The user clicks on a virtual button
and activates the associated behavior by briefly illuminating
a corresponding 3D location with an off-the-shelf green laser
pointer. As we have described in previous work, the robot
can detect this click and estimate its 3D location using an
omnidirectional camera and a pan/tilt stereo camera. In this
paper, we show that the robot can select the appropriate
behavior to execute using the 3D location of the click, the
context around this 3D location, and its own state. For this
work, the robot performs this selection process using a cascade
of classifiers.
We demonstrate the efficacy of this approach with an assistive
object-fetching application. Through empirical evaluation, we
show that the 3D location of the click, the state of the robot,
and the surrounding context is sufficient for the robot to choose
the correct behavior from a set of behaviors and perform the
following tasks: pick-up a designated object from a floor or
table, deliver an object to a designated person, place an object
on a designated table, go to a designated location, and touch a
designated location with its end effector.
I. INTRODUCTION
For assistive robots, being able to correctly decipher user
commands would be advantageous for performing useful
services. Many methods have been proposed for human-robot
interaction but none thus far have been adopted extensively.
Interfaces based on the traditional WIMP (windows, icons,
menus, pointers) model are often criticized as being an
unnatural mode for interaction, while natural interfaces based
on speech or gestures are themselves plagued by performance
problems in realistic environments. To cope with these dif-
ficulties, we present a new human-robot interaction system
for which the physical world is viewed as having overlaid
virtual buttons that trigger robotic behaviors when clicked
by the user.
In general, these virtual buttons can be clicked by provid-
ing a 3D location to the robot. For this work, the user clicks
these virtual buttons using an uninstrumented laser pointer.
As we have previously described in [8], our robot El-E has a
laser-pointer interface that detects when a user illuminates a
location in the environment and estimates its 3D location.
We previously validated this approach in the context of
object grasping and a preliminary object-fetching application
[10]. Within this paper we generalize this approach to form
a clickable world interface and demonstrate its efficacy in
Charles C. Kemp is with the Faculty of Biomedical Engineering at
Georgia Tech charlie.kemp@bme.gatech.edu
Fig. 1. A clickable world interface enables a user to trigger appropriate
robotic behaviors by clicking on virtual buttons using a laser-pointer.
the context of a full assistive object-fetching application
designed for motor-impaired individuals.
We first discuss the relationship to previous works in
Section II. Then, in Sections III and IV, we describe our
robot along with details of the clickable world interface as
it applies to assistive robots. To evaluate the effectiveness
of the system at selecting appropriate behaviors, we present
experiments and associated results in Sections V andVI.
Finally, we close with concluding remarks.
II. RELATED WORK
Several other examples of intelligent pointing devices
exist, such as Patel and Abowd’s iCam augmented reality
system [12]. In this work, users could virtually annotate an
environment using a handheld computer containing a laser
pointer, camera, and sensors that determined the computer’s
position relative to a localization system installed in the
environment. The XWand[15] and WorldCursor[14], devel-
oped at Microsoft Research allow people to select locations
in the environment. The XWand is a wand-like device that
enables the user to point at an object in the environment and
control it using gestures and voice commands. For example,
lights can be turned on and off by pointing at the switch
and saying “turn on” or “turn off”, a media player can be
controlled by pointing at it and giving spoken commands
such as “volume up”, “play”, etc. This work is similar in
spirit to ours, the object to be acted upon is selected using
the XWand and a simple command specifies what task is
to be performed. For our work, having a robot perform
tasks avoids the need for specialized, networked, computer-

Human? Floor Height? Table?
Deliver
Move to
Point
Place on
Table
Future
Work...
elevated
surface
vertical
surface
no
Floor Height? Table?
Pick Up
From Floor
Pick Up
From Table
Touch
vertical
surface
elevated
surface
Fig. 2. Top: El-E’s decision process for mapping from sensory inputs to behaviors to execute. Bottom: Corresponding sensory input to behavior mapping
in the state where the robot is holding an object.
operated, intelligent devices. A robot also has the potential
to interact with any physical interface or object in addition
to electronic interfaces. Moreover, unlike these systems our
clickable world interface is fully portable with the robot
and does not require a model of the environment or any
modifications to the environment.
Torralba [13], describes a method of object recognition
based on contextual information. Little [9], discusses learn-
ing spatial configurations of objects for example cups and
plates are typically next to each other on a table. He discusses
how this knowledge combined with the shape and appearance
of objects can be used for object recognition. He motivates
the importance of connecting the spatial and semantic in-
formation. Such work in computer vision is relevant to our
system because the identity of an object limits the set of
robot behaviors that are applicable for that particular object.
The motivation behind the clickable world is that the click
allows the user to specify where in the world the robot should
perform a task. The robot can then use context to infer the
task that the user wants the robot to perform.
In robotics, work by Dune [5] [6] describes a visual
servoing mechanism for a system that enables users to click
on the image of an object in a wide-angle camera and
have a camera mounted on a robot arm point at the object.
Classic work modeling the mechanisms and structure used
for cognition such as ACT-R [2], are related to our work
in that these systems can also be used for determining the
correct behavior to execute given some input. However, our
approach does not attempt to model human level cognition
or reasoning.
It is conceivable that a clickable world interface could
make use of eye gaze, pointing with the hand, and other
natural gestures. There has been extensive research in these
areas[16], [4]. Some of these systems are designed with sim-
ilar objectives to the clickable world interface, but in contrast
to the laser-pointer interface, these methods currently do not
have the ability to provide a suitably accurate 3D location.
III. CLICKABLE WORLD FRAMEWORK
In the behavior-based robotics framework [3], robot behav-
iors can be viewed abstractly as mappings from stimulus, S
to to motor responses, R:
β : S 7→ R (1)
When there are multiple behaviors or sets of behaviors
from which to choose, creating a mapping from stimuli
to behaviors can become a challenge. With our clickable
world interface, we posit that a location in the world can
be a powerful cue for user-directed behavior selection. In
our clickable world interface, each 3D location, p, provided
by the laser-pointer interface is mapped to a behavior, β
i
,
executable by the robot:
f : p 7→ β
i
(2)
For the examples we describe in this paper, this 3D
location serves two roles. First, the robot uses the 3D location
and contextual information around the 3D location to select
and execute the appropriate behavior, thereby implementing
the mapping of equation 2. Consequently, by giving a 3D
location to the robot the user commands the robot to execute
a desired behavior. Second, the selected behavior uses the
3D location as a parameter. This 3D location is often critical
to the behavior, such as when it tells the robot where to
move or where to manipulate. Within our implementation,
these two distinct roles of selecting behaviors and providing
parameters to behaviors are intertwined. For example, the
robot often moves towards a location in order to better assess
the surrounding context and thereby distinguish which of
several behaviors to execute.
As we demonstrate in this paper, the power of this
approach as a user interface derives from the intuitive re-
lationship between a location and a mobile manipulation
behavior. When acquiring an object, the location of an
object is sufficient to tell the robot which object to pick
up. Likewise, when delivering an object, the location for
delivery is sufficient to tell the robot where the object should

(a) El-E (b) Camera system
Fig. 3. (a) An image of the entire mobile manipulator with the integrated
interface system. (b) The laser pointer interface is integrated into the robot’s
head. It consists of an omnidirectional camera (bottom half) and a pan/tilt
stereo camera (top half).
be delivered and the manner in which it should be delivered.
Furthermore, if a user wishes to have the robot manipulate a
fixed part of the environment, such as a door handle or a light
switch, the location of the manipulable device is sufficient
to command the robot to reach out and make contact with
it.
Since mobile manipulation activities typically involve
task-relevant locations with which the robot makes contact
either directly or indirectly, we expect that this type of
interface will extend to a wide variety of activities. For
example, when acquiring objects the location specifies the
object with which the end effector should make contact,
and when delivering objects the location specifies where the
object held by the robot should make contact.
Within our system, the user selected 3D point is given
to a behavior selection mechanism that uses a manually
constructed cascade of classifiers to decide which behavior
to execute (Figure 2). Each module in this cascade results in
a difficult to reverse change in the robot and world state as
the robot attempts to collect more information and thereby
disambiguate the command.
IV. IMPLEMENTATION
The robot, El-E, is a mobile manipulator with a 5-DoF
Neuronics Katana 6M arm mounted on a linear actuator
which sits on top of a Videre Erratic mobile base (Figure
3(a)). In addition to its head, which is specially designed for
detecting laser pointers (Figure 3(b), and more extensively
described in [8]), El-E has a color camera on its end effector
and a URG laser range finder on the linear actuator carriage
that also contains the manipulator.
Fig. 4. In the starting configuration, users can select object buttons on
the ground and table to be picked up by the robot. Drive to commands can
be given by clicking on ground buttons in areas where there are no objects
present. In addition, the robot can be instructed to touch a location when
the user selects points on vertical surfaces representing buttons on walls.
The specially designed hardware and software system
forming the laser-pointer interface enables El-E to detect
when a user illuminates a location with a green laser pointer
and estimate the 3D location selected by this point and click.
Given this 3D location and its immediate context obtained
through the robot’s sensors, our system activates the correct
behavior to carry out desired user commands.
We now describe El-E’s decision process after the user
clicks a button in the world. The set of behaviors that
El-E has to choose from are first determined by the cur-
rent state of the robot, either the robot is free to grasp
an object (Free To Grasp) or has an object in its gripper
(Object In Gripper).
For each behavior, if the 3D location given is initially
further than a threshold distance away, El-E drives towards
it but stops before the given point is in manipulable range
to request another laser detection. This two step process was
created to reduce the error in the estimated 3D location of
the designated point since, for stereo ranging, the error in
triangulation increases with increasing distance away from
the camera.
A. Robot State Free To Grasp
In this state the robot does not have an object in its gripper.
During this mode users are able to click on the following
buttons: objects on the floor, empty locations on the floor,
objects on the table and wall locations. When activated, these
buttons (illustrated in Figure 4 and 5) trigger the following
associated behaviors: Grasp On Floor, Follow Laserpoint,
Grasp On Table, Reachout And Touch .
To determine which of the buttons were activated and
thus which behaviors to execute, our robot uses the decision
process summarized in Figure 2. In more detail, first El-
E determines whether the click was on the floor button by
checking the height of the laser point. El-E assumes that the
floor is flat and that its base sits on the floor. If the height
of the laser point is less than a threshold height (currently

Fig. 5. Each row shows the user clicking a button and the robot selecting
the appropriate behavior when the robot is in the Free To Grasp state. Top
to bottom: Grasp On Floor, Reachout And Touch, Grasp On Table
30 cm) above the assumed floor height, El-E perceives it
as a floor button. It then drives towards the laser point. If
there is an object in close vicinity of the click, the robot
picks it up. If the human has pointed at an empty location
on the floor then the robot moves to the location selected
by the user without grasping anything. El-E uses the 3D
location of the button and the local context (presence or
absence of an object) to select between the Grasp On Floor
and Follow Laserpoint behavior. For the Follow Laserpoint
behavior, no further information is necessary so the robot
simply drives towards the user indicated location.
If the height of the button is greater than the threshold
height (again, 30cm), the robot goes closer to the selected
point and then calls a classifier to determine whether the
selected button is a horizontal surface (table) or a vertical
surface (wall). This classifier works in a manner similar to
the table detector described in [10].
The classifier first takes a rectangle of range readings
around the user-selected 3D location. It then calculates the
differences between horizontal scan lines in this rectangle,
finds the maximum difference, and classifies the input as
table if this maximum difference is above a threshold. More
intuitively, if there is a large difference between two range
readings from adjacent heights then our classifier considers
that location to be a table as only horizontal surfaces parallel
to the scanning plane of the laser range finder would be
likely to cause such a sudden change in the amount of free
space perceived. The current implementation of the detector
classifies a range rectangle as a vertical surface (wall) if it
is not classified as a horizontal surface (table).
If the selected button is classified as a table, El-E selects
Fig. 6. When El-E has an object in its gripper (green circle), users can select
from buttons representing a person (yellow shape), an elevated horizontal
surface (blue shape) or location on the floor (orange circles). Selecting a
person cause the robot to hand over the object to that person. Selecting
elevated surfaces cause the robot to drop off the object. Selecting locations
on the floor moves the robot to the laser pointer’s position.
the Grasp On Table behavior. This behavior uses the laser
range finder mounted on a linear actuator to determine
the exact height of the table edge and then grasps the
selected object. More details on both the Grasp On Floor
and Grasp On Table behavior can be found in [10].
Finally, if the classifier reports that the clicked button is a
wall, the robot executes the Reachout And Touch behavior.
In this mode, the robot drives towards the point, orients itself
so that it is perpendicular to the wall and facing the given
3D location. With its manipulator, El-E then reaches out to
touch the selected point on the wall. To orient the robot, we
perform a least squares line fitting operation to find the line
representing the plane of the wall in the ranging information
returned by El-E’s laser range finders. This behavior serves
a plausible precursor for future behaviors such as operating
a light switch or opening a door, which would require that
the robot reach out and make contact.
B. Robot State Object In Gripper
Figure 2 shows the robot’s decision process when it has
an object in its gripper. The set of buttons which the user
can click and their associated behaviors are: directly in front
of a person (Deliver To Person), an empty location on the
floor (Follow Laserpoint), and a table (Deliver To Table).
El-E first determines whether the click should trigger
the Deliver To Person behavior of a human button by first
checking for a face in a 3D gravity oriented cylinder around
the laser point. To detect the face we used the Viola-Jones
frontal face detector as implemented in OpenCV [11] in a
manner similar to Edsinger et al [7]. If a face is detected
then the robot drives towards the person, extends its hand so
that the object can be grasped by the person, and releases
the object after a preset amount of time.
If a face is not detected, then the robot chooses the
Follow Laserpoint behavior for laser point with height less
than a threshold (30 cm). Finally, if the height of the laser

Fig. 7. Each row shows the user clicking a button and the robot selecting
the appropriate behavior when the robot is in the Object In Gripper state.
Top to bottom: Deliver To Table, Deliver To Person
point is greater than the given threshold, the robot calls the
same classifier as described in the previous sub-section to
distinguish between a table and a vertical flat surface. If
the user had clicked the table button, the robot selects the
Deliver To Table behavior and places the object on the table.
C. Local context in the decision process
In addition to the 3D coordinate of the click and the state
of the robot, the decision processes described in the previous
two sub-sections utilize the local context around the click to
select the appropriate behavior. In a situation with very few
behaviors and buttons such as Follow Laserpoint, only the
3D location of the click may be required. But, as the number
of behaviors or the complexity of the button increases, local
context information needs to be added to select the correct
behavior.
For example, to recognize a click on a human button, El-
E looks to see if there is a face in the vicinity of the click.
Similarly, a click on an object on the floor is recognized by
both the height of the laser point and the presence of an
obstacle near the click (detected using a laser range finder
which can scan across the surface). Buttons which require
additional context information are the vertical surface and
the table buttons. The classifier which is used to distinguish
between a vertical and horizontal surface requires a three
dimensional depth map of a rectangular region around the
click. The depth map is obtained by using the laser range
finder which is mounted on a linear actuator.
D. Physical Extent of World Buttons
Each button that can be activated by the user in our inter-
face has a physical extent that is determined largely by the
classifier that is used to recognize it. The buttons that signify
grasping commands (green in Fig. 4), are approximately the
size of a circle placed on the floor centered on the object
with radius t, where t is a threshold specified by the interface
designer. The robot will only pick up an object if the object
is within distance t of the 3D location selected by the user.
If two object buttons overlap, then the robot will pick up the
Fig. 8. Floor buttons that the robot is supposed to drive to. Since the robot
drives so that the front of it just barely touches the location, we placed the
targeting mark outside the square that the robot has to sit on.
closest object, which results in a straight edge separating the
two buttons much like a Voronoi cell.
With the human buttons (yellow in Fig. 4), the system uses
a distance threshold to the closest detected face effectively
creating cylindrical buttons with radius t
f
. For vertical
buttons (red in Fig. 4) and table buttons (blue in Fig. 6),
the classifier looks at a square patch from the laser range
finder around the selected point. This square patch dictates
the effective size of the button.
V. EXPERIMENTS
To demonstrate that our clickable world interface is able
to robustly support a variety of tasks relevant to assistive
applications, we tested El-E’s ability to select and execute
several mobile manipulation behaviors according to user
intentions. We conducted our study with 5 members of our
lab.
In the first experiment, the robot started out next to the
subject with an object in its gripper with both subject and
robot facing the same direction. The experimenter would then
instruct the subject to command the robot to perform one of
its programmed actions: drive to a location, place the object
on a table, or hand the object over to a person. The subject
was asked to command the robot to perform each action 3
times. Prior to each single command the robot was returned
to its starting position.
For the drive to location command, we designated three
locations on the floor that the subject could click on. Figure
8 shows the targets. At each location we marked out a
square on the floor with tape and affixed another smaller
piece of tape nearby. The subject was instructed to click on
the smaller tape patch. If the robot base moved toward the
small patch and stopped such that it covered the square we
considered the command to have been successfully executed.
With the object placement command, the subject was
asked to click on either a short or a tall table. The robot
was deemed to have successfully carried out this command
if it placed the object stably on the selected surface.

Citations
More filters
Journal ArticleDOI

Semantic 3D Object Maps for Everyday Manipulation in Human Living Environments

TL;DR: The dissertation presented in this article proposes Semantic 3D Object Models as a novel representation of the robot’s operating environment that satisfies these requirements and shows how these models can be automatically acquired from dense 3D range data.
Journal ArticleDOI

EL-E: an assistive mobile manipulator that autonomously fetches objects from flat surfaces

TL;DR: The most recent version of the assistive mobile manipulator EL-E is presented with a focus on the subsystem that enables the robot to retrieve objects from and deliver objects to flat surfaces, including the use of specialized behaviors, task-relevant features, and low-dimensional representations.
Journal ArticleDOI

Robots for humanity: using assistive robotics to empower people with disabilities

TL;DR: Assistive mobile manipulators have the potential to one day serve as surrogates and helpers for people with disabilities, giving them the freedom to perform tasks such as scratching an itch, picking up a cup, or socializing with their families.
Proceedings ArticleDOI

Real-time perception-guided motion planning for a personal robot

TL;DR: A modular and distributed architecture is proposed, which seamlessly integrates the creation of 3D maps for collision detection and semantic annotations, with a real-time motion replanning framework.
Proceedings ArticleDOI

Benchmarking grasping and manipulation: Properties of the Objects of Daily Living

TL;DR: A new classification of the Activities of Daily Living (ADLs) is presented, putting forth a standard categorization for the application of robotics in human environments.
References
More filters
Book

An Behavior-based Robotics

TL;DR: Following a discussion of the relevant biological and psychological models of behavior, the author covers the use of knowledge and learning in autonomous robots, behavior-based and hybrid robot architectures, modular perception, robot colonies, and future trends in robot intelligence.
Book

Behavior-Based Robotics

TL;DR: Whence behaviour? animal behaviour robot behaviour behaviour based architectures representational issues for behavioural systems hybrid deliberative/rective architectures perceptual basis for behaviour-based control adaptive behaviour social behaviour fringe robotics - beyond behaviour.
Journal ArticleDOI

Contextual Priming for Object Detection

TL;DR: A simple framework for modeling the relationship between context and object properties based on the correlation between the statistics of low-level features across the entire scene and the objects that it contains serves as an effective procedure for object priming, context driven focus of attention and automatic scale-selection on real-world scenes.
Proceedings Article

Non-Intrusive Gaze Tracking Using Artificial Neural Networks

TL;DR: An empirical analysis of the performance of a large number of artificial neural network architectures for gaze tracking, and Suggestions for further explorations for neurally based gaze trackers are presented, and are related to other similar artificial Neural network applications such as autonomous road following.
Related Papers (5)
Frequently Asked Questions (18)
Q1. What are the contributions mentioned in the paper "A clickable world: behavior selection through pointing and context for mobile manipulation" ?

The authors present a new behavior selection system for human-robot interaction that maps virtual buttons overlaid on the physical environment to the robot ’ s behaviors, thereby creating a clickable world. As the authors have described in previous work, the robot can detect this click and estimate its 3D location using an omnidirectional camera and a pan/tilt stereo camera. In this paper, the authors show that the robot can select the appropriate behavior to execute using the 3D location of the click, the context around this 3D location, and its own state. For this work, the robot performs this selection process using a cascade of classifiers. The authors demonstrate the efficacy of this approach with an assistive object-fetching application. Through empirical evaluation, the authors show that the 3D location of the click, the state of the robot, and the surrounding context is sufficient for the robot to choose the correct behavior from a set of behaviors and perform the following tasks: pick-up a designated object from a floor or table, deliver an object to a designated person, place an object on a designated table, go to a designated location, and touch a designated location with its end effector. 

In future work, the authors hope to further demonstrate the applicability of their approach to robots that assist motor-impaired people with activities of daily living. 

When placing an object on table surfaces, the three failures were due, respectively, to theobject bouncing off the table surface after being dropped from slightly above the table, a table scan failure, and the robot failing to acquire a second laser pointer detection, probably due to the very shallow angle of incidence with which the user illuminated the table from a long distance. 

It is conceivable that a clickable world interface could make use of eye gaze, pointing with the hand, and other natural gestures. 

In the human button case, the task was considered successful if El-E handed the object to the selected seated person, such that the person could retrieve the object from El-E without standing up. 

If a face is detected then the robot drives towards the person, extends its hand so that the object can be grasped by the person, and releases the object after a preset amount of time. 

For driving to a given location, the only failure resulted from El-E being 5 cm away from the border of the marked target square. 

Each module in this cascade results in a difficult to reverse change in the robot and world state as the robot attempts to collect more information and thereby disambiguate the command. 

The current implementation of the detector classifies a range rectangle as a vertical surface (wall) if it is not classified as a horizontal surface (table). 

if a user wishes to have the robot manipulate a fixed part of the environment, such as a door handle or a light switch, the location of the manipulable device is sufficient to command the robot to reach out and make contact with it. 

With their clickable world interface, the authors posit that a location in the world can be a powerful cue for user-directed behavior selection. 

The experimenter would then instruct the subject to command the robot to perform one of its programmed actions: drive to a location, place the object on a table, or hand the object over to a person. 

As the authors demonstrate in this paper, the power of this approach as a user interface derives from the intuitive relationship between a location and a mobile manipulation behavior. 

Classic work modeling the mechanisms and structure used for cognition such as ACT-R [2], are related to their work in that these systems can also be used for determining the correct behavior to execute given some input. 

This behavior serves a plausible precursor for future behaviors such as operating a light switch or opening a door, which would require that the robot reach out and make contact. 

the three failures in the touching task were, in two cases, due to the robot touching a location approximately 1 cm away from the border of the yellow square. 

When activated, these buttons (illustrated in Figure 4 and 5) trigger the following associated behaviors: Grasp On Floor, Follow Laserpoint, Grasp On Table, Reachout And Touch . 

In the object fetching experiment, the robot successfully grasped and returned one object from the floor and a table in 10 out of 10 trials as shown in Figure 11. 

Trending Questions (1)
How to do right click in Robot Framework?

In this paper, we show that the robot can select the appropriate behavior to execute using the 3D location of the click, the context around this 3D location, and its own state.