What are the future works mentioned in the paper "A clickable world: behavior selection through pointing and context for mobile manipulation" ?

In future work, the authors hope to further demonstrate the applicability of their approach to robots that assist motor-impaired people with activities of daily living.

Why did the robot fail to detect the object?

When placing an object on table surfaces, the three failures were due, respectively, to theobject bouncing off the table surface after being dropped from slightly above the table, a table scan failure, and the robot failing to acquire a second laser pointer detection, probably due to the very shallow angle of incidence with which the user illuminated the table from a long distance.

What was the object retrieval command in the human button case?

In the human button case, the task was considered successful if El-E handed the object to the selected seated person, such that the person could retrieve the object from El-E without standing up.

How far away was the object from the target square?

For driving to a given location, the only failure resulted from El-E being 5 cm away from the border of the marked target square.

What is the current implementation of the detector?

The current implementation of the detector classifies a range rectangle as a vertical surface (wall) if it is not classified as a horizontal surface (table).

What is the role of the clickable world interface?

With their clickable world interface, the authors posit that a location in the world can be a powerful cue for user-directed behavior selection.

What is the relationship between the clickable world interface and the ACT-R system?

Classic work modeling the mechanisms and structure used for cognition such as ACT-R [2], are related to their work in that these systems can also be used for determining the correct behavior to execute given some input.

What was the reason for the failures in the touching task?

the three failures in the touching task were, in two cases, due to the robot touching a location approximately 1 cm away from the border of the yellow square.

How many trials did the robot successfully retrieve?

In the object fetching experiment, the robot successfully grasped and returned one object from the floor and a table in 10 out of 10 trials as shown in Figure 11.

(Open Access) A clickable world: Behavior selection through pointing and context for mobile manipulation (2008) | Hai Nguyen

Q: What are the contributions mentioned in the paper "A clickable world: behavior selection through pointing and context for mobile manipulation" ?

The authors present a new behavior selection system for human-robot interaction that maps virtual buttons overlaid on the physical environment to the robot ’ s behaviors, thereby creating a clickable world. As the authors have described in previous work, the robot can detect this click and estimate its 3D location using an omnidirectional camera and a pan/tilt stereo camera. In this paper, the authors show that the robot can select the appropriate behavior to execute using the 3D location of the click, the context around this 3D location, and its own state. For this work, the robot performs this selection process using a cascade of classifiers. The authors demonstrate the efficacy of this approach with an assistive object-fetching application. Through empirical evaluation, the authors show that the 3D location of the click, the state of the robot, and the surrounding context is sufficient for the robot to choose the correct behavior from a set of behaviors and perform the following tasks: pick-up a designated object from a floor or table, deliver an object to a designated person, place an object on a designated table, go to a designated location, and touch a designated location with its end effector.

Q: What is the effect of each module in the cascade?

Each module in this cascade results in a difficult to reverse change in the robot and world state as the robot attempts to collect more information and thereby disambiguate the command.

Q: What is the role of the location in a robot's behavior?

if a user wishes to have the robot manipulate a fixed part of the environment, such as a door handle or a light switch, the location of the manipulable device is sufficient to command the robot to reach out and make contact with it.

A Clickable World: Behavior Selection Through Pointing and Context

for Mobile Manipulation

Hai Nguyen, Advait Jain, Cressel Anderson, Charles C. Kemp

Abstract— We present a new behavior selection system for

human-robot interaction that maps virtual buttons overlaid

on the physical environment to the robot’s behaviors, thereby

creating a clickable world. The user clicks on a virtual button

and activates the associated behavior by brieﬂy illuminating

a corresponding 3D location with an off-the-shelf green laser

pointer. As we have described in previous work, the robot

can detect this click and estimate its 3D location using an

omnidirectional camera and a pan/tilt stereo camera. In this

paper, we show that the robot can select the appropriate

behavior to execute using the 3D location of the click, the

context around this 3D location, and its own state. For this

work, the robot performs this selection process using a cascade

of classiﬁers.

We demonstrate the efﬁcacy of this approach with an assistive

object-fetching application. Through empirical evaluation, we

show that the 3D location of the click, the state of the robot,

and the surrounding context is sufﬁcient for the robot to choose

the correct behavior from a set of behaviors and perform the

following tasks: pick-up a designated object from a ﬂoor or

table, deliver an object to a designated person, place an object

on a designated table, go to a designated location, and touch a

designated location with its end effector.

I. INTRODUCTION

For assistive robots, being able to correctly decipher user

commands would be advantageous for performing useful

services. Many methods have been proposed for human-robot

interaction but none thus far have been adopted extensively.

Interfaces based on the traditional WIMP (windows, icons,

menus, pointers) model are often criticized as being an

unnatural mode for interaction, while natural interfaces based

on speech or gestures are themselves plagued by performance

problems in realistic environments. To cope with these dif-

ﬁculties, we present a new human-robot interaction system

for which the physical world is viewed as having overlaid

virtual buttons that trigger robotic behaviors when clicked

by the user.

In general, these virtual buttons can be clicked by provid-

ing a 3D location to the robot. For this work, the user clicks

these virtual buttons using an uninstrumented laser pointer.

As we have previously described in [8], our robot El-E has a

laser-pointer interface that detects when a user illuminates a

location in the environment and estimates its 3D location.

We previously validated this approach in the context of

object grasping and a preliminary object-fetching application

[10]. Within this paper we generalize this approach to form

a clickable world interface and demonstrate its efﬁcacy in

Charles C. Kemp is with the Faculty of Biomedical Engineering at

Georgia Tech charlie.kemp@bme.gatech.edu

Fig. 1. A clickable world interface enables a user to trigger appropriate

robotic behaviors by clicking on virtual buttons using a laser-pointer.

the context of a full assistive object-fetching application

designed for motor-impaired individuals.

We ﬁrst discuss the relationship to previous works in

Section II. Then, in Sections III and IV, we describe our

robot along with details of the clickable world interface as

it applies to assistive robots. To evaluate the effectiveness

of the system at selecting appropriate behaviors, we present

experiments and associated results in Sections V andVI.

Finally, we close with concluding remarks.

II. RELATED WORK

Several other examples of intelligent pointing devices

exist, such as Patel and Abowd’s iCam augmented reality

system [12]. In this work, users could virtually annotate an

environment using a handheld computer containing a laser

pointer, camera, and sensors that determined the computer’s

position relative to a localization system installed in the

environment. The XWand[15] and WorldCursor[14], devel-

oped at Microsoft Research allow people to select locations

in the environment. The XWand is a wand-like device that

enables the user to point at an object in the environment and

control it using gestures and voice commands. For example,

lights can be turned on and off by pointing at the switch

and saying “turn on” or “turn off”, a media player can be

controlled by pointing at it and giving spoken commands

such as “volume up”, “play”, etc. This work is similar in

spirit to ours, the object to be acted upon is selected using

the XWand and a simple command speciﬁes what task is

to be performed. For our work, having a robot perform

tasks avoids the need for specialized, networked, computer-

Human? Floor Height? Table?

Deliver

Move to

Point

Place on

Table

Future

Work...

elevated

surface

vertical

surface

Floor Height? Table?

Pick Up

From Floor

Pick Up

From Table

Touch

vertical

surface

elevated

surface

Fig. 2. Top: El-E’s decision process for mapping from sensory inputs to behaviors to execute. Bottom: Corresponding sensory input to behavior mapping

in the state where the robot is holding an object.

operated, intelligent devices. A robot also has the potential

to interact with any physical interface or object in addition

to electronic interfaces. Moreover, unlike these systems our

clickable world interface is fully portable with the robot

and does not require a model of the environment or any

modiﬁcations to the environment.

Torralba [13], describes a method of object recognition

based on contextual information. Little [9], discusses learn-

ing spatial conﬁgurations of objects for example cups and

plates are typically next to each other on a table. He discusses

how this knowledge combined with the shape and appearance

of objects can be used for object recognition. He motivates

the importance of connecting the spatial and semantic in-

formation. Such work in computer vision is relevant to our

system because the identity of an object limits the set of

robot behaviors that are applicable for that particular object.

The motivation behind the clickable world is that the click

allows the user to specify where in the world the robot should

perform a task. The robot can then use context to infer the

task that the user wants the robot to perform.

In robotics, work by Dune [5] [6] describes a visual

servoing mechanism for a system that enables users to click

on the image of an object in a wide-angle camera and

have a camera mounted on a robot arm point at the object.

Classic work modeling the mechanisms and structure used

for cognition such as ACT-R [2], are related to our work

in that these systems can also be used for determining the

correct behavior to execute given some input. However, our

approach does not attempt to model human level cognition

or reasoning.

It is conceivable that a clickable world interface could

make use of eye gaze, pointing with the hand, and other

natural gestures. There has been extensive research in these

areas[16], [4]. Some of these systems are designed with sim-

ilar objectives to the clickable world interface, but in contrast

to the laser-pointer interface, these methods currently do not

have the ability to provide a suitably accurate 3D location.

III. CLICKABLE WORLD FRAMEWORK

In the behavior-based robotics framework [3], robot behav-

iors can be viewed abstractly as mappings from stimulus, S

to to motor responses, R:

β : S 7→ R (1)

When there are multiple behaviors or sets of behaviors

from which to choose, creating a mapping from stimuli

to behaviors can become a challenge. With our clickable

world interface, we posit that a location in the world can

be a powerful cue for user-directed behavior selection. In

our clickable world interface, each 3D location, p, provided

by the laser-pointer interface is mapped to a behavior, β

executable by the robot:

f : p 7→ β

(2)

For the examples we describe in this paper, this 3D

location serves two roles. First, the robot uses the 3D location

and contextual information around the 3D location to select

and execute the appropriate behavior, thereby implementing

the mapping of equation 2. Consequently, by giving a 3D

location to the robot the user commands the robot to execute

a desired behavior. Second, the selected behavior uses the

3D location as a parameter. This 3D location is often critical

to the behavior, such as when it tells the robot where to

move or where to manipulate. Within our implementation,

these two distinct roles of selecting behaviors and providing

parameters to behaviors are intertwined. For example, the

robot often moves towards a location in order to better assess

the surrounding context and thereby distinguish which of

several behaviors to execute.

As we demonstrate in this paper, the power of this

approach as a user interface derives from the intuitive re-

lationship between a location and a mobile manipulation

behavior. When acquiring an object, the location of an

object is sufﬁcient to tell the robot which object to pick

up. Likewise, when delivering an object, the location for

delivery is sufﬁcient to tell the robot where the object should

(a) El-E (b) Camera system

Fig. 3. (a) An image of the entire mobile manipulator with the integrated

interface system. (b) The laser pointer interface is integrated into the robot’s

head. It consists of an omnidirectional camera (bottom half) and a pan/tilt

stereo camera (top half).

be delivered and the manner in which it should be delivered.

Furthermore, if a user wishes to have the robot manipulate a

ﬁxed part of the environment, such as a door handle or a light

switch, the location of the manipulable device is sufﬁcient

to command the robot to reach out and make contact with

it.

Since mobile manipulation activities typically involve

task-relevant locations with which the robot makes contact

either directly or indirectly, we expect that this type of

interface will extend to a wide variety of activities. For

example, when acquiring objects the location speciﬁes the

object with which the end effector should make contact,

and when delivering objects the location speciﬁes where the

object held by the robot should make contact.

Within our system, the user selected 3D point is given

to a behavior selection mechanism that uses a manually

constructed cascade of classiﬁers to decide which behavior

to execute (Figure 2). Each module in this cascade results in

a difﬁcult to reverse change in the robot and world state as

the robot attempts to collect more information and thereby

disambiguate the command.

IV. IMPLEMENTATION

The robot, El-E, is a mobile manipulator with a 5-DoF

Neuronics Katana 6M arm mounted on a linear actuator

which sits on top of a Videre Erratic mobile base (Figure

3(a)). In addition to its head, which is specially designed for

detecting laser pointers (Figure 3(b), and more extensively

described in [8]), El-E has a color camera on its end effector

and a URG laser range ﬁnder on the linear actuator carriage

that also contains the manipulator.

Fig. 4. In the starting conﬁguration, users can select object buttons on

the ground and table to be picked up by the robot. Drive to commands can

be given by clicking on ground buttons in areas where there are no objects

present. In addition, the robot can be instructed to touch a location when

the user selects points on vertical surfaces representing buttons on walls.

The specially designed hardware and software system

forming the laser-pointer interface enables El-E to detect

when a user illuminates a location with a green laser pointer

and estimate the 3D location selected by this point and click.

Given this 3D location and its immediate context obtained

through the robot’s sensors, our system activates the correct

behavior to carry out desired user commands.

We now describe El-E’s decision process after the user

clicks a button in the world. The set of behaviors that

El-E has to choose from are ﬁrst determined by the cur-

rent state of the robot, either the robot is free to grasp

an object (Free To Grasp) or has an object in its gripper

(Object In Gripper).

For each behavior, if the 3D location given is initially

further than a threshold distance away, El-E drives towards

it but stops before the given point is in manipulable range

to request another laser detection. This two step process was

created to reduce the error in the estimated 3D location of

the designated point since, for stereo ranging, the error in

triangulation increases with increasing distance away from

the camera.

A. Robot State – Free To Grasp

In this state the robot does not have an object in its gripper.

During this mode users are able to click on the following

buttons: objects on the ﬂoor, empty locations on the ﬂoor,

objects on the table and wall locations. When activated, these

buttons (illustrated in Figure 4 and 5) trigger the following

associated behaviors: Grasp On Floor, Follow Laserpoint,

Grasp On Table, Reachout And Touch .

To determine which of the buttons were activated and

thus which behaviors to execute, our robot uses the decision

process summarized in Figure 2. In more detail, ﬁrst El-

E determines whether the click was on the ﬂoor button by

checking the height of the laser point. El-E assumes that the

ﬂoor is ﬂat and that its base sits on the ﬂoor. If the height

of the laser point is less than a threshold height (currently

Fig. 5. Each row shows the user clicking a button and the robot selecting

the appropriate behavior when the robot is in the Free To Grasp state. Top

to bottom: Grasp On Floor, Reachout And Touch, Grasp On Table

30 cm) above the assumed ﬂoor height, El-E perceives it

as a ﬂoor button. It then drives towards the laser point. If

there is an object in close vicinity of the click, the robot

picks it up. If the human has pointed at an empty location

on the ﬂoor then the robot moves to the location selected

by the user without grasping anything. El-E uses the 3D

location of the button and the local context (presence or

absence of an object) to select between the Grasp On Floor

and Follow Laserpoint behavior. For the Follow Laserpoint

behavior, no further information is necessary so the robot

simply drives towards the user indicated location.

If the height of the button is greater than the threshold

height (again, 30cm), the robot goes closer to the selected

point and then calls a classiﬁer to determine whether the

selected button is a horizontal surface (table) or a vertical

surface (wall). This classiﬁer works in a manner similar to

the table detector described in [10].

The classiﬁer ﬁrst takes a rectangle of range readings

around the user-selected 3D location. It then calculates the

differences between horizontal scan lines in this rectangle,

ﬁnds the maximum difference, and classiﬁes the input as

table if this maximum difference is above a threshold. More

intuitively, if there is a large difference between two range

readings from adjacent heights then our classiﬁer considers

that location to be a table as only horizontal surfaces parallel

to the scanning plane of the laser range ﬁnder would be

likely to cause such a sudden change in the amount of free

space perceived. The current implementation of the detector

classiﬁes a range rectangle as a vertical surface (wall) if it

is not classiﬁed as a horizontal surface (table).

If the selected button is classiﬁed as a table, El-E selects

Fig. 6. When El-E has an object in its gripper (green circle), users can select

from buttons representing a person (yellow shape), an elevated horizontal

surface (blue shape) or location on the ﬂoor (orange circles). Selecting a

person cause the robot to hand over the object to that person. Selecting

elevated surfaces cause the robot to drop off the object. Selecting locations

on the ﬂoor moves the robot to the laser pointer’s position.

the Grasp On Table behavior. This behavior uses the laser

range ﬁnder mounted on a linear actuator to determine

the exact height of the table edge and then grasps the

selected object. More details on both the Grasp On Floor

and Grasp On Table behavior can be found in [10].

Finally, if the classiﬁer reports that the clicked button is a

wall, the robot executes the Reachout And Touch behavior.

In this mode, the robot drives towards the point, orients itself

so that it is perpendicular to the wall and facing the given

3D location. With its manipulator, El-E then reaches out to

touch the selected point on the wall. To orient the robot, we

perform a least squares line ﬁtting operation to ﬁnd the line

representing the plane of the wall in the ranging information

returned by El-E’s laser range ﬁnders. This behavior serves

a plausible precursor for future behaviors such as operating

a light switch or opening a door, which would require that

the robot reach out and make contact.

B. Robot State – Object In Gripper

Figure 2 shows the robot’s decision process when it has

an object in its gripper. The set of buttons which the user

can click and their associated behaviors are: directly in front

of a person (Deliver To Person), an empty location on the

ﬂoor (Follow Laserpoint), and a table (Deliver To Table).

El-E ﬁrst determines whether the click should trigger

the Deliver To Person behavior of a human button by ﬁrst

checking for a face in a 3D gravity oriented cylinder around

the laser point. To detect the face we used the Viola-Jones

frontal face detector as implemented in OpenCV [11] in a

manner similar to Edsinger et al [7]. If a face is detected

then the robot drives towards the person, extends its hand so

that the object can be grasped by the person, and releases

the object after a preset amount of time.

If a face is not detected, then the robot chooses the

Follow Laserpoint behavior for laser point with height less

than a threshold (30 cm). Finally, if the height of the laser

Fig. 7. Each row shows the user clicking a button and the robot selecting

the appropriate behavior when the robot is in the Object In Gripper state.

Top to bottom: Deliver To Table, Deliver To Person

point is greater than the given threshold, the robot calls the

same classiﬁer as described in the previous sub-section to

distinguish between a table and a vertical ﬂat surface. If

the user had clicked the table button, the robot selects the

Deliver To Table behavior and places the object on the table.

C. Local context in the decision process

In addition to the 3D coordinate of the click and the state

of the robot, the decision processes described in the previous

two sub-sections utilize the local context around the click to

select the appropriate behavior. In a situation with very few

behaviors and buttons such as Follow Laserpoint, only the

3D location of the click may be required. But, as the number

of behaviors or the complexity of the button increases, local

context information needs to be added to select the correct

behavior.

For example, to recognize a click on a human button, El-

E looks to see if there is a face in the vicinity of the click.

Similarly, a click on an object on the ﬂoor is recognized by

both the height of the laser point and the presence of an

obstacle near the click (detected using a laser range ﬁnder

which can scan across the surface). Buttons which require

additional context information are the vertical surface and

the table buttons. The classiﬁer which is used to distinguish

between a vertical and horizontal surface requires a three

dimensional depth map of a rectangular region around the

click. The depth map is obtained by using the laser range

ﬁnder which is mounted on a linear actuator.

D. Physical Extent of World Buttons

Each button that can be activated by the user in our inter-

face has a physical extent that is determined largely by the

classiﬁer that is used to recognize it. The buttons that signify

grasping commands (green in Fig. 4), are approximately the

size of a circle placed on the ﬂoor centered on the object

with radius t, where t is a threshold speciﬁed by the interface

designer. The robot will only pick up an object if the object

is within distance t of the 3D location selected by the user.

If two object buttons overlap, then the robot will pick up the

Fig. 8. Floor buttons that the robot is supposed to drive to. Since the robot

drives so that the front of it just barely touches the location, we placed the

targeting mark outside the square that the robot has to sit on.

closest object, which results in a straight edge separating the

two buttons much like a Voronoi cell.

With the human buttons (yellow in Fig. 4), the system uses

a distance threshold to the closest detected face effectively

creating cylindrical buttons with radius t

. For vertical

buttons (red in Fig. 4) and table buttons (blue in Fig. 6),

the classiﬁer looks at a square patch from the laser range

ﬁnder around the selected point. This square patch dictates

the effective size of the button.

V. EXPERIMENTS

To demonstrate that our clickable world interface is able

to robustly support a variety of tasks relevant to assistive

applications, we tested El-E’s ability to select and execute

several mobile manipulation behaviors according to user

intentions. We conducted our study with 5 members of our

lab.

In the ﬁrst experiment, the robot started out next to the

subject with an object in its gripper with both subject and

robot facing the same direction. The experimenter would then

instruct the subject to command the robot to perform one of

its programmed actions: drive to a location, place the object

on a table, or hand the object over to a person. The subject

was asked to command the robot to perform each action 3

times. Prior to each single command the robot was returned

to its starting position.

For the drive to location command, we designated three

locations on the ﬂoor that the subject could click on. Figure

8 shows the targets. At each location we marked out a

square on the ﬂoor with tape and afﬁxed another smaller

piece of tape nearby. The subject was instructed to click on

the smaller tape patch. If the robot base moved toward the

small patch and stopped such that it covered the square we

considered the command to have been successfully executed.

With the object placement command, the subject was

asked to click on either a short or a tall table. The robot

was deemed to have successfully carried out this command

if it placed the object stably on the selected surface.

A clickable world: Behavior selection through pointing and context for mobile manipulation

Figures

Citations

Semantic 3D Object Maps for Everyday Manipulation in Human Living Environments

EL-E: an assistive mobile manipulator that autonomously fetches objects from flat surfaces

Robots for humanity: using assistive robotics to empower people with disabilities

Real-time perception-guided motion planning for a personal robot

Benchmarking grasping and manipulation: Properties of the Objects of Daily Living

References

An Behavior-based Robotics

Behavior-Based Robotics

Contextual Priming for Object Detection

ACT: A simple theory of complex cognition.

Non-Intrusive Gaze Tracking Using Artificial Neural Networks

Related Papers (5)

Robotic Grasping of Novel Objects using Vision

EL-E: an assistive mobile manipulator that autonomously fetches objects from flat surfaces

Learning human activities and object affordances from RGB-D videos

Latent dirichlet allocation

Human activity analysis: A review

Frequently Asked Questions (18)

Q1. What are the contributions mentioned in the paper "A clickable world: behavior selection through pointing and context for mobile manipulation" ?

Q2. What are the future works mentioned in the paper "A clickable world: behavior selection through pointing and context for mobile manipulation" ?

Q3. Why did the robot fail to detect the object?

Q4. What is the possible use of the clickable world interface?

Q5. What was the object retrieval command in the human button case?

Q6. What is the behavior of the robot when it has an object in its gripper?

Q7. How far away was the object from the target square?

Q8. What is the effect of each module in the cascade?

Q9. What is the current implementation of the detector?

Q10. What is the role of the location in a robot's behavior?

Q11. What is the role of the clickable world interface?

Q12. What is the purpose of the experiment?

Q13. What is the power of this approach as a user interface?

Q14. What is the relationship between the clickable world interface and the ACT-R system?

Q15. What is the function of the robot that determines the distance between the selected object and the target?

Q16. What was the reason for the failures in the touching task?

Q17. What are the actions that El-E uses to select objects?

Q18. How many trials did the robot successfully retrieve?

Trending Questions (1)