scispace - formally typeset
Search or ask a question

Showing papers by "Trevor Darrell published in 2003"


Proceedings ArticleDOI
18 Jun 2003
TL;DR: This work presents a method for online rigid object tracking using an adaptive view-based appearance model that has bounded drift and can track objects undergoing large motion for long periods of time when the object's pose trajectory crosses itself.
Abstract: We present a method for online rigid object tracking using an adaptive view-based appearance model. When the object's pose trajectory crosses itself, our tracker has bounded drift and can track objects undergoing large motion for long periods of time. Our tracker registers each incoming frame against the views of the appearance model using a two-frame registration algorithm. Using a linear Gaussian filter, we simultaneously estimate the pose of the object and adjust the view-based model as pose-changes are recovered from the registration algorithm. The adaptive view-based model is populated online with views of the object as it undergoes different orientations in pose space, allowing us to capture non-Lambertian effects. We tested our approach on a real-time rigid object tracking task using stereo cameras and observed an RMS error within the accuracy limit of an attached inertial sensor.

129 citations


Book ChapterDOI
12 Oct 2003
TL;DR: In this paper, the authors discuss the concept of activity zones and suggest that such zones can be used to trigger application actions, retrieve information based on previous context, and present information to users.
Abstract: Location is a primary cue in many context-aware computing systems, and is often represented as a global coordinate, room number, or a set of Euclidean distances to various landmarks. A user’s concept of location, however, is often defined in terms of regions in which similar activities occur. We discuss the concept of such regions, which we call activity zones, and suggest that such zones can be used to trigger application actions, retrieve information based on previous context, and present information to users. We show how to semi- automatically partition a space into activity zones based on patterns of observed user location and motion. We describe our system and two implemented example applications whose behavior is controlled by users’ entry, exit, and presence in the zones.

121 citations


Proceedings ArticleDOI
18 Jun 2003
TL;DR: It is shown how the use of a class-specific prior in a visual hull reconstruction can reduce the effect of segmentation errors from the silhouette extraction process.
Abstract: We present a Bayesian approach to image-based visual hull reconstruction. The 3D (three-dimensional) shape of an object of a known class is represented by sets of silhouette views simultaneously observed from multiple cameras. We show how the use of a class-specific prior in a visual hull reconstruction can reduce the effect of segmentation errors from the silhouette extraction process. In our representation, 3D information is implicit in the joint observations of multiple contours from known viewpoints. We model the prior density using a probabilistic principal components analysis-based technique and estimate a maximum a posteriori reconstruction of multi-view contours. The proposed method is applied to a dataset of pedestrian images, and improvements in the approximate 3D models under various noise conditions are shown.

88 citations


Proceedings ArticleDOI
17 Oct 2003
TL;DR: This work presents a method for estimating the absolute pose of a rigid object based on intensity and depth view-based eigenspaces, built across multiple views of example objects of the same class.
Abstract: We present a method for estimating the absolute pose of a rigid object based on intensity and depth view-based eigenspaces, built across multiple views of example objects of the same class. Given an initial frame of an object with unknown pose, we reconstruct a prior model for all views represented in the eigenspaces. For each new frame, we compute the pose-changes between every view of the reconstructed prior model and the new frame. The resulting pose-changes are then combined and used in a Kalman filter update. This approach for pose estimation is user-independent and the prior model can be initialized automatically from any viewpoint of the view-based eigenspaces. To track more robustly over time, we present an extension of this pose estimation technique where we integrate our prior model approach with an adaptive differential tracker. We demonstrate the accuracy of our approach on face pose tracking using stereo cameras.

74 citations


Proceedings Article
13 Oct 2003
TL;DR: In this article, a probabilistic shape+structure model is proposed to estimate the 3D locations of 19 joints on the body based on observedsilhouette contours from real images.
Abstract: We present an image-based approach to infer 3D structureparameters using a probabilistic "shape+structure" model.The 3D shape of an object class is represented by setsof contours from silhouette views simultaneously observedfrom multiple calibrated cameras, while structural featuresof interest on the object are denoted by a number of 3D locations.A prior density over the multi-view shape and correspondingstructure is constructed with a mixture of probabilisticprincipal components analyzers. Given a novelset of contours, we infer the unknown structure parametersfrom the new shape's Bayesian reconstruction. Modelmatching and parameter inference are done entirely in theimage domain and require no explicit 3D construction. Ourshape model enables accurate estimation of structure despitesegmentation errors or missing views in the input silhouettes,and it works even with only a single input view.Using a training set of thousands of pedestrian images generatedfrom a synthetic model, we can accurately infer the3D locations of 19 joints on the body based on observedsilhouette contours from real images.

46 citations


Proceedings ArticleDOI
16 Jun 2003
TL;DR: A probabilistic tracking framework that combines sound and vision to achieve more robust and accurate tracking of multiple objects and accurately reflects the number of people present is presented.
Abstract: In this paper, we present a probabilistic tracking framework that combines sound and vision to achieve more robust and accurate tracking of multiple objects. In a cluttered or noisy scene, our measurements have a non-Gaussian, multi-modal distribution. We apply a particle filter to track multiple people using combined audio and video observations. We have applied our algorithm to the domain of tracking people with a stereo-based visual foreground detection algorithm and audio localization using a beamforming technique. Our model also accurately reflects the number of people present. We test the efficacy of our system on a sequence of multiple people moving and speaking in an indoor environment.

40 citations


Proceedings ArticleDOI
05 Apr 2003
TL;DR: Several different interaction styles are compared, based on an analysis of the space of possible perceptual interface abstractions for full-body navigation and the results of a wizard-of-oz study of user preferences, for passive, real-time articulated tracking with standard cameras and personal computers.
Abstract: Navigating virtual environments usually requires a wired interface, game console, or keyboard. The advent of perceptual interface techniques allows a new option, the passive and untethered sensing of users' pose and gesture to allow them maneuver through virtual worlds. We show new algorithms for passive, real-time articulated tracking with standard cameras and personal computers. Several different interaction styles are compared, based on an analysis of the space of possible perceptual interface abstractions for full-body navigation and the results of a wizard-of-oz study of user preferences. In this demo we show our prototype system with users guiding avatars through a series of 3-D virtual game worlds.

39 citations


Proceedings ArticleDOI
05 Nov 2003
TL;DR: A simple probabilistic framework that combines multiple cues derived from both audio and video information that provides a more robust solution than using any single cue alone is presented.
Abstract: This paper presents a multi-modal approach to locate a speaker in a scene and determine to whom he or she is speaking. We present a simple probabilistic framework that combines multiple cues derived from both audio and video information. A purely visual cue is obtained using a head tracker to identify possible speakers in a scene and provide both their 3-D positions and orientation. In addition, estimates of the audio signal's direction of arrival are obtained with the help of a two-element microphone array. A third cue measures the association between the audio and the tracked regions in the video. Integrating these cues provides a more robust solution than using any single cue alone. The usefulness of our approach is shown in our results for video sequences with two or more people in a prototype interactive kiosk environment.

37 citations


Journal ArticleDOI
TL;DR: This component-based architecture creates presence applications using perceptual user interface widgets that automatically convey user states to a remote location or application without user input.
Abstract: Perceptive presence systems automatically convey user states to a remote location or application without user input. Our component-based architecture creates presence applications using perceptual user interface widgets.

16 citations


Proceedings ArticleDOI
16 Jun 2003
TL;DR: This work describes new algorithms for interacting with 3-D environments using real-time articulated body tracking with standard cameras and personal computers, based on rigid stereo-motion estimation algorithms and uses a linear technique for enforcing articulation constraints.
Abstract: Navigating virtual environments usually requires a wired interface, game console, or keyboard The advent of perceptual interface techniques allows a new option: the passive and untethered sensing of users' pose and gesture to allow them maneuver through and manipulate virtual worlds We describe new algorithms for interacting with 3-D environments using real-time articulated body tracking with standard cameras and personal computers Our method is based on rigid stereo-motion estimation algorithms and uses a linear technique for enforcing articulation constraints With our tracking system users can navigate virtual environments using 3-D gesture and body poses We analyze the space of possible perceptual interface abstractions for full-body navigation, and present a prototype system based on these results We finally describe an initial evaluation of our prototype system with users guiding avatars through a series of 3-D virtual game worlds

12 citations


Proceedings ArticleDOI
05 Nov 2003
TL;DR: This research focuses on a module that provides parameterized gesture recognition, using various machine learning techniques, and trains the support vector classifier to model the boundary of the space of possible gestures, and train Hidden Markov Models on specific gestures.
Abstract: Humans use a combination of gesture and speech to convey meaning, and usually do so without holding a device or pointer. We present a system that incorporates body tracking and gesture recognition for an untethered human-computer interface. This research focuses on a module that provides parameterized gesture recognition, using various machine learning techniques. We train the support vector classifier to model the boundary of the space of possible gestures, and train Hidden Markov Models on specific gestures. Given a sequence, we can find the start and end of various gestures using a support vector classifier, and find gesture likelihoods and parameters with a HMM. Finally multimodal recognition is performed using rank-order fusion to merge speech and vision hypotheses.

Proceedings ArticleDOI
06 Jul 2003
TL;DR: An algorithm and experimental results of a human speaker moving in a scene and a method which successfully learns such a model without benefit of hand initialization using only the associated audio signal to "decide" which object to model and track.
Abstract: Objects of interest are rarely silent or invisible. Analysis of multi-modal signal generation from a single object represents a rich and challenging area for smart sensor arrays. We consider the problem of simultaneously learning and audio and visual appearance model of a moving subject. We present a method which successfully learns such a model without benefit of hand initialization using only the associated audio signal to "decide" which object to model and track. We are interested in particular in modeling joint audio and video variation, such as produced by a speaking face. We present an algorithm and experimental results of a human speaker moving in a scene.