What have the authors contributed in "Developing a gesture-based interface" ?

Q: What have the authors contributed in "Developing a gesture-based interface" ?

This paper presents a novel attempt at developing a hand gesture-based interface. The authors propose an on-line predictive EigenTracker for the moving hand. The authors propose a new state-based representation scheme for hand gestures, based on the eigenspace reconstruction error. The authors show results of successful operation of their system even in cases of background clutter and other moving objects.

(Open Access) Developing a Gesture-based Interface (2002) | Namita Gupta

Developing a Gesture-based Interface

Namita Gupta



Pooja Mittal



Sumantra Dutta Roy



Santanu Chaudhury



Subhashis Banerjee





Dept of Maths



Dept of EE



Dept of EE



Dept of CSE

IIT Delhi IIT Bombay IIT Delhi IIT Delhi

New Delhi 110016 Mumbai 400076 New Delhi 110016 New Delhi 110016

namitag@microsoft.com, pmittal@amazon.com, sumantra@ee.iitb.ac.in,

santanuc@



ee,cse



.iitd.ac.in, suban@cse.iitd.ac.in

Abstract

A gesture-based interface involves tracking a moving hand

across frames, and extracting the semantic interpretation

corresponding to the gesture. This is a difﬁcult task, since

there is a change in both the position as well as the appear-

ance of the hand. Further, such a system should be robust

to the speed at which the gesture is performed. This paper

presents a novel attempt at developing a handgesture-based

interface. We propose an on-line predictive EigenTracker

for the moving hand. Our tracker can learn the eigenspace

on the ﬂy. We propose a new state-based representation

scheme for hand gestures, based on the eigenspace recon-

struction error. This makes the system independent of the

speed of performing the gesture. We use learning for adapt-

ing the gesture recognition system to individual require-

ments. We show results of successful operation of our sys-

tem even in cases of background clutter and other moving

objects.

1. Introduction

The use of hand gestures provides an attractive alternative

to cumbersome interface devices for Human-Computer In-

teraction (HCI) [10]. Hand gesture analysis involves both

spatial as well as temporal processingof image frames. Two

important components of the above task are the tracking of

the moving hand across frames, and extracting the seman-

tic interpretation corresponding to the gesture. Each one is

a difﬁcult task. There is a loss in information due to the

projection of the 3-D human hand to the 2-D image plane.

Elaborate 3-D models have prohibitive high-dimensional

parameter spaces. Further, estimating 3-D parameters from

2-D images is also very difﬁcult [10]. The tracker also has

current afﬁliation and address: 10/2449 Microsoft Corporation, One

Microsoft Way, WA-98052, USA.

current afﬁliation and address: 454B Amazon.com, 605 5th Avenue S,

Seattle, WA-98104, USA.

to handle changing shapes, other moving objects, and noise

(as in Figure 1). The difﬁculty of the second task is com-

002 053 071

078 096 102

114 126 140

Figure 1: A set of representative frames from a hand gesture

analysis (frames in row-major order, with frame numbers).

(Details in text).

pounded by different factors – hand shapes and sizes vary

from individual to individual. Thus, this is a serious prob-

lem for recognizing static hand gestures alone. For a dy-

namic hand gesture, different people may perform the same

gesture in different periods of time.

Pavlovic, Sharma and Huang [10] present an extensive

review of hand gesture interpretation techniques. Bobick

and Wilson [3] propose a state-based technique for repre-

sentation and recognition of gestures in which they deﬁne a

gesture as a sequence of states in a measurement or conﬁgu-

ration space. A HMM is a possible tool modelingthe spatial

and temporal nature of a gesture [10], [1], [11]. Yeasin and

Chaudhuri [12] model the temporal signature of a hand ges-

ture as a ﬁnite state machine.

In this paper, we have proposed a hand gesture based

interface system which uses hand tracking and changes in

hand shapes for the purpose of associating semiotics to

the gesture. Isard and Blake [6] propose the CONDEN-

SATION algorithm (a predictive tracker more general than

a Kalman tracker) for tracking moving objects, including

hand, in clutter, using the conditionaldensity propagationof

state density over time. An EigenTracker [2] has an advan-

tage over traditional feature-based tracking algorithms – the

ability to track objects which simultaneously undergoafﬁne

image motions and changes in view (the Appendix gives

salient features of CONDENSATION and EigenTracking).

An important lacuna of EigenTracking is the absence of a

predictive framework. This paper removes a serious restric-

tion of the EigenTracker framework – the absence of a pre-

dictive framework. We develop a novel predictive Eigen-

Tracker with efﬁcient eigenspace update methods – it can

learn the eigenspace representation on the ﬂy. We have

an automatic initialization process for the tracker – it does

not need to be bootstrapped. This learning-and-tracking of

changing hand shapes ﬁts in with our gesture recognition

framework. We express a gesture as a combination of dif-

ferent epochs, corresponding to eigenspace representations

of static hand shapes, and their temporal relationships. The

system goes through the same set of states, whether the

gesture is performed slowly or done fast. We use a shape-

based state identiﬁcationscheme. Theidentiﬁcation scheme

makes use of hand shapes of individuals (corresponding to

different states) learnt a priori.

The rest of the paper is organized as follows. Sec-

tion 2 presents our predictive EigenTracker, with on-line

eigenspace updates, and automatic initialization. We use

this predictive Eigentracker to track the motion of the hand.

Next, we discuss our gesture recognition framework. This

framework uses information from the predictive Eigen-

Tracker. In each case, we present results of experimentation

with our system.

2. A Predictive EigenTracker for Hand

Gestures

One of the main reasons for the inefﬁciency of the Eigen-

Tracking algorithm is the absence of a predictive frame-

work. An EigenTracker simply updates the eigenspace and

afﬁne coefﬁcients after each frame, requiring a good seed

value for the non-linear optimization in each case. We use

a predictive framework to speed up the EigenTracker. We

incorporate a prediction of the position of the object being

tracked, using a CONDENSATION-based algorithm. We

describe our model for the system state, dynamics and mea-

surement (observation) as follows.

The hand motion between frames has effects such as ro-

tation, translation, scaling and shear – which can be ac-

counted for by an afﬁne model. The shape of the bound-

ing window for the hand will be a parallelogram. This is

consistent with the afﬁne motion model. Further, a parallel-

ogram offers a tighter ﬁt to the object being tracked (further

reducing the effect of the background) – an important con-

sideration for an Eigenspace-based method. A 6-element

state vector characterizes afﬁne motion. One can use the

coordinates of three image points (any three image points

form a 2-D afﬁne basis). The afﬁne parameters represent

the parallelogram bounding the hand shape in each frame.

Alternatively, the 6 afﬁne coefﬁcients



(









)

themselves can serve as elements of the state vector. In

other words,



 

























. These afﬁne

coefﬁcients



represent the transformation of the current

bounding window to the original one. A commonly used

model for state dynamics is a second order AR process:

































, where



is a zero-mean,

white Gaussian random vector. The particular form of the

model will depend on the application – constant velocity

model, random walk model, etc.

The measurement is the set of 6 afﬁne parameters from

the image









. Similar to [6], the observation model

has Gaussian peaks around each observation, and constant

density otherwise. We use a large number of representative

sequences to estimate the covariances of the afﬁne parame-

ters obtained in a non-predictive EigenTracker. These serve

as the covariances of the above Gaussian.

We use a pyramidal approach for the predictive

CONDENSATION-based EigenTracker. The measure-

ments are made at each level of the pyramid. We start at the

coarsest level. Using

























and the measurement at

this level, we get















. The afﬁne parameter estimate

at this level goes as input to the next level of the pyramid.

From the estimates at the ﬁnest level, we predict the afﬁne

parameters for the next frame.

It is not feasible to learn the multitude of poses corre-

sponding to hand gestures, even for one particular person.

One needs to learn and update the relevant eigenspaces, on

the ﬂy. We discuss this in the following section.

2.1. On-line Eigenspace Updates

In a hand gesture, the appearance of the hand often changes

considerably. One needs to build and update the eigenspace

representation efﬁciently, on-line. A naive











al-

gorithm for



images having



pixels each is computa-

tionally inefﬁcient. Particularly, one needs efﬁcient incre-

mental SVD update algorithms, to update the eigenspace

at each frame. For our case, we use a scale-space variant

of the algorithm of Chandrasekaran et al. [4], which takes





 



, for



most signiﬁcant singular values.

ALGORITHM PREDICTIVE EIGENTRACKER

A. Delineate moving hand

B. REPEAT FOR ALL frames:

1. Obtain image MEASUREMENT optimizing

affine parameters



and

reconstruction coefficients



2. ESTIMATE new affine parameters

from step 1 output (PREDICTION)

3. IF reconstruction error















THEN update eigenspace

4. IF reconstruction error very large

THEN construct eigenspace afresh

Figure 2: Our Predictive EigenTracker for Hand Gestures:

An Overview

2.2. Tracker Initialization

Initializing a tracker is a difﬁcult problem because of multi-

ple moving objects, and background clutter. In other words,

one needs to segment out the moving region of interest from

the possibly cluttered background in the frames. Our hand

gesture tracker performs fully automatic initialization.We

use a combination of motion cues (dominant motion detec-

tion [5] as well as skin colour cues [7], [9], [8] to identify

the region of interest in each frame.

2.3. The Overall Tracking Scheme

We now present an overview of our overall predictive

EigenTracker for hand gestures (Figure 2 outlines the main

steps). For the ﬁrst few frames, we segment out the mov-

ing hand (Section 2.2). We now predict afﬁne parameters

– a parallelogram bounding box for the next frame (Step

1 in Figure 2, details in Section 2). The next step is ob-

taining measurements (of the afﬁne parameters) from the

image – an optimization of the afﬁne parameters



and the

eigenspace reconstruction coefﬁcients



(Appendix). De-

pendingon the reconstruction error (Equation 1, Appendix),

it decides on whether or not to perform an eigenspace up-

date (Section 2.1). If the reconstruction error is very large,

this indicates a new view of the object. The algorithm

recomputes a new bounding box and starts rebuilding the

eigenspace (Step 5 in Figure 2). This cue indicates an epoch

change (Section 3). It then repeats the above steps for the

next frame.

Figure 1 shows the result of an experiment on a typical

hand gesture sequence. Our tracker can successfully track

the moving hand in a variety of changing poses, in spite of

background clutter, as well as other moving objects present

in the scene.

Figure 3(b) compares results obtained using the pre-

dictive EigenTracker with those corresponding to a non-

predictive version (Figure 3(a)). The average number of

002 010 020

030 040 050

(a) Non-predictive EigenTracker

002 010 020

030 040 050

(b) Predictive EigenTracker

Figure 3: Tracking results with (a) a simple EigenTracker,

as compared with (b) results of our predictive EigenTracker

(bottom row): some representative frames. The hand is not

properly tracked using the former.

iterations (for the optimization) improves from 3.5 to 2.9.

A comparison of a non-predictive EigenTracker with a pre-

dictive one for the sequence in Figure 4 shows a drastic im-

provement in the average number of iterations – from 7.44

to 4.67.

2.4. Synergistic Conjunction with Other

Trackers: Restricted Afﬁne Motion

A simple variant of our EigenTracking framework has an

on-line EigenTracker working in conjunction with another

tracker. We can thus take advantage of a tracker tracking

the same object, using a different measurement process, or

tracking principle. The EigenTracker works synergistically

with the other tracker, using it to get its afﬁne parameters.

It then optimizes these parameters, and proceeds with the

EigenTracking. Such a synergistic combination endows the

combined tracker with the beneﬁts of both the EigenTracker

as well as the other one – tracking the view changes of an

object in a predictive manner.

We have experimented with using an SVD update-based

multi-resolution EigenTracker with a skin colour-based

082 090 119

125 127 160

(a) Non-predictive EigenTracker

082 090 119

125 127 160

(b) Predictive EigenTracker

Figure 4: Non-predictive EigenTracking versus our Predic-

tive framework: another example

CONDENSATION tracker [9], [8] for cases of restricted

afﬁnemotions. The latter considers the parameters of a rect-

angular window bounding the moving hand as state vector

elements – its centroid, the height and the width. The ob-

servation is also a 4-element state vector, consisting of the

rectangular window parameters of the largestskin blob. The

state dynamics considers a constant velocity model for the

centroid position, and a constant position one for the other

two parameters. We use the tracking parameters obtained

from the CONDENSATION skin tracker for each frame,

to estimate the afﬁne parameters for the the Appearance

tracker. The appearance tracker then does the ﬁne adjust-

ments of the afﬁne parameters and computes the reconstruc-

tion error. We ﬁrst consider a restricted case of afﬁne trans-

formations – scaling and translation alone (Figure 5). The

processing time per frame is 100–180ms when it can track

at the coarsest level itself, and 600–900ms when it goes

to the ﬁnest level (image size 320



240). This experiment

shows that having evena very simple restricted afﬁne model

overcomes an inherent problem with the EigenTracker of

being able to track motion up to only a few pixels.

We extend the previous scheme to cover rotations as

well. We ﬁrst compute the principal axis of the pixel dis-

tribution of the best ﬁtting blob. We align the principal

axis with the vertical



-axis and compute the new width,

002 070 072

120 167 200

Figure 5: A simple combination of a CONDENSATION

skin tracker with an online EigenTracker: scaling and trans-

lation. Details in Section 2.4

height and centroid. These parameters give us the restricted

afﬁne matrix (scaling, rotation, translation):



 











. When applied to the current image, these pa-

rameters take it to the ﬁrst bounding window of the CON-

DENSATION skin tracker. In Figure 6 we show results of

this approach. This scheme allows tracking of large rota-

001 060 090

180 230 235

Figure 6: Using an online EigenTracker in conjunction with

a skin colour-based CONDENSATION tracker: rotation,

translation, scaling. Details in Section 2.4

tions (as evident in Figure 6). We get a better ﬁtting window

and less background pixels, leading to lower eigenspace re-

construction error. The average processing time per frame

is 900ms.

3. Gesture Recognition

We propose a novel methodology for a gesture recognition

system. We use our predictive EigenTracker (Section 2) to

track hand motion across frames. Our predictive Eigen-

Tracking mechanism ﬁts seamlessly into our gesture rep-

resentation and recognition framework. We represent each

gesture as a ﬁnite state machine (Figure 7). The states in the

FSM correspond to different static hand shapes. In our sys-

tem, a ﬁxed stationary hand shape is taken as the start shape

Figure 7: A very simple example of the representation of a

gesture. We represent a gesture as composed of a particular

temporal sequence of epochs, and transitions between them

(Details in text).

– a stationary open palm. A stop state (signifying the end of

the gesture) is a position of the hand that has not changed

its position for at least a particular number of frames. A

gesture is composed of a ﬁxed temporal order of transitions

between states.

The system stores an eigenspace representation corre-

sponding to each static hand shape. It is important to note

that the segmentation of all hand appearances during a ges-

ture is done automatically, based on the eigenspace re-

construction coefﬁcients. The tracker works on the basis

of the eigenspace reconstruction error (Section 2). If the

eigenspace reconstruction error is less than the parameter





(as deﬁned in Figure 2), then we take the hand shape

to be the same as that corresponding to the previous frame.

It means that the system is in the same state as it was for

the previous frame. If the error lies between





and





we update the eigenspace representation corresponding to

this hand shape. Only when the error exceeds





, does the

EigenTracker signal an epoch change. This epoch change

corresponds to a drastic change in hand shape, and hence,

a new state of the FSM. The system searches hypothesized

transitions from the current state, based on a predeﬁned set

of gestures. Such a state-based representation imparts ro-

bustness to the speed at which a hand gesture is performed

– it will always correspond to the same set of states.

Our current set of gestures explores the idea of having a

static hand shape represent a state, or an epoch. Since we

have a predictive EigenTracker, we have information about

temporal and spatial changes as well. Hence, an extension

of our scheme will also include this information – for the

case when the shape of the hand does not change signif-

icantly, but the position of the hand changes signiﬁcantly

with time.

The static hand-shapes corresponding to the individual

states can vary from person to person (e.g., a open hand

shape can be different for different people). We propose a

personalized gesture recognition system. Hand shapes of

individuals with ﬁxed semantics are learnt a priori. Our

system uses these learnt shapes for identiﬁcation of states.

We exploit hand tracking, epoch changes and state identiﬁ-

cation for gesture recognition. A change in epochs (or the

Figure 8: Contour-based veriﬁcation of static hand shapes

(See text for details)

gesture itself) can switch the system to a different task. The

system tries to recognize a particular hand shape when it

detects an epoch change. For our system, we use a contour-

based shape recognition strategy [11] for verifying particu-

lar hand shapes. Figure 8 depicts the system verifying two

particular hand shapes: an open hand, and a closed hand.

We present some preliminary results with our

eigenspace-based gesture recognition system. In the

sequence of Figure 1, the system starts with the eigenspace

corresponding to an open hand. The eigenspace recon-

struction error starts changing drastically at frame number

75 (corresponding to the upper threshold





), and doesn’t

change much thereafter. This represents a transition from

the open hand to the closed hand. Figure 5 shows the result

with another gesture, performed by another individual.

The gesture starts from the same start shape, goes to the

closed hand pose, and ends up at the thumbs-up sign. Here

again, each epoch is triggered by a drastic change in the

eigenspace reconstruction error. The system re-initializes

itself with a fresh eigenspace corresponding to the closest

static shape in its database during each such change.

3.1. A Simple Application: A 3-D Mouse

This section describes a simple application of some of the

ideas presented in the preceding sections – a 3-D mouse.

The motivation behind this is to have a hand (movingin 3-D

space) substituting for a mouse (without extracting any 3-D

positional information). The ﬁrst image in Figure 9 shows

an example of such a setup: a camera is looking down on

a table, where the user moves his or her hand. The other

frames of Figure 9 show screen snaps of the program in ex-

ecution. The left window shows what the camera sees. On

the right, we show the hand, segmented out from the im-

age. For each such segmented out hand, we use the follow-

ing heuristic to compute the position of the virtual mouse

pointer. We consider the two eigenvalues corresponding to

the hand shape, and ﬁnd out the principal axis of the hand

shape. We use this to compute the position of the extreme

tip of the ﬁngers. This is where we place the virtual mouse

pointer. If the ratio of the eigenvalues is greater than a

threshold, we consider it to be a pointing gesture, and in-

terpret it to be a mouse click. This system is on-line, and

implemented on a 700 MHz machine running Linux.

Developing a Gesture-based Interface

Figures

Citations

A review of vision based hand gestures recognition

Hand gesture recognition and tracking based on distributed locally linear embedding

Machine learning based sign language recognition: a review and its research frontier

Static Hand Gesture Recognition Using Artificial Neural Network

A brief overview of hand gestures used in wearable human computer interfaces

References

C ONDENSATION —Conditional Density Propagation forVisual Tracking

Visual interpretation of hand gestures for human-computer interaction: a review

EigenTracking: Robust Matching and Tracking of Articulated Objects Using a View-Based Representation

Computing occluding and transparent motions

A state-based approach to the representation and recognition of gesture

Related Papers (5)

EigenTracking: Robust Matching and Tracking of Articulated Objects Using a View-Based Representation

Real-time American sign language recognition using desk and wearable computer based video

GestureGAN for Hand Gesture-to-Gesture Translation in the Wild

A hand gesture recognition technique for human-computer interaction

Design of hand gesture recognition system for human-computer interaction

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Developing a gesture-based interface" ?