scispace - formally typeset
Search or ask a question

Showing papers by "Trevor Darrell published in 2000"


Journal ArticleDOI
TL;DR: This work combines stereo, color, and face detection modules into a single robust system, shows an initial application in an interactive, face-responsive display, and discusses the failure modes of each individual module.
Abstract: We present an approach to real-time person tracking in crowded and/or unknown environments using integration of multiple visual modalities. We combine stereo, color, and face detection modules into a single robust system, and show an initial application in an interactive, face-responsive display. Dense, real-time stereo processing is used to isolate users from other objects and people in the background. Skin-hue classification identifies and tracks likely body parts within the silhouette of a user. Face pattern detection discriminates and localizes the face within the identified body parts. Faces and bodies of users are tracked over several temporal scales: short-term (user stays within the field of view), medium-term (user exits/reenters within minutes), and long term (user returns after hours or days). Short-term tracking is performed using simple region position and size correspondences, while medium and long-term tracking are based on statistics of user appearance. We discuss the failure modes of each individual module, describe our integration method, and report results with the complete system in trials with thousands of users.

435 citations


Proceedings Article
01 Jan 2000
TL;DR: First, the data is projected into a maximally informative, low-dimensional subspace, suitable for density estimation, and the complicated stochastic relationships between the signals are modeled using a nonparametric density estimator.
Abstract: People can understand complex auditory and visual information, often using one to disambiguate the other. Automated analysis, even at a low-level, faces severe challenges, including the lack of accurate statistical models for the signals, and their high-dimensionality and varied sampling rates. Previous approaches [6] assumed simple parametric models for the joint distribution which, while tractable, cannot capture the complex signal relationships. We learn the joint distribution of the visual and auditory signals using a non-parametric approach. First, we project the data into a maximally informative, low-dimensional subspace, suitable for density estimation. We then model the complicated stochastic relationships between the signals using a nonparametric density estimator. These learned densities allow processing across signal modalities. We demonstrate, on synthetic and real signals, localization in video of the face that is speaking in audio, and, conversely, audio enhancement of a particular speaker selected from the video.

226 citations


Book ChapterDOI
14 Oct 2000
TL;DR: It is shown how audio utterances from several speakers recorded with a single microphone can be separated into constituent streams, and how the method can help reduce the effect of noise in automatic speech recognition.
Abstract: Audio-based interfaces usually suffer when noise or other acoustic sources are present in the environment. For robust audio recognition, a single source must first be isolated. Existing solutions to this problem generally require special microphone configurations, and often assume prior knowledge of the spurious sources. We have developed new algorithms for segmenting streams of audio-visual information into their constituent sources by exploiting the mutual information present between audio and visual tracks. Automatic face recognition and image motion analysis methods are used to generate visual features for a particular user; empirically these features have high mutual information with audio recorded from that user. We show how audio utterances from several speakers recorded with a single microphone can be separated into constituent streams; we also show how the method can help reduce the effect of noise in automatic speech recognition.

54 citations


Proceedings ArticleDOI
01 Jun 2000
TL;DR: This paper addresses several important issues in the formation of the constraint equations, including updating the body rotation matrix without using a first-order matrix approximation and removing the coupling between the rotation and translation updates.
Abstract: This paper explores several approaches for articulated-pose estimation, assuming that video-rate depth information is available, from either stereo cameras or other sensors. We use these depth measurements in the traditional linear brightness constraint equation, as well as in a depth constraint equation. To capture the joint constraints, we combine the brightness and depth constraints with twist mathematics. We address several important issues in the formation of the constraint equations, including updating the body rotation matrix without using a first-order matrix approximation and removing the coupling between the rotation and translation updates. The resulting constraint equations are linear on a modified parameter set. After solving these linear constraints, there is a single closed-form non-linear transformation to return the updates to the original pose parameters. We show results for tracking body pose in oblique views of synthetic walking sequences and in moving-camera views of synthetic jumping-jack sequences. We also show results for tracking body pose in side views of a real walking sequence.

30 citations


01 Jan 2000
TL;DR: A tracking system will need to automatically initialize a model from the video, track persons for long periods of time (possibly tens of minutes), and be able to recover if it looses track.
Abstract: Motivation: An ability to obtain this information would be extremely useful in applications such as virtual reality, remote human identification (gait analysis), nonintrusive medical diagnostics, and others. The common way to model human body for this purpose is a kinematic tree which is parametrized on the sizes of the limbs and the joint angles. The tracking system will need to automatically initialize a model from the video, track persons for long periods of time (possibly tens of minutes), and be able to recover if it looses track.

01 Jan 2000
TL;DR: This system demonstrates the capabilities of a solely vision-based system for understanding human speech and movements and shows the ability to track and understand people.
Abstract: Motivation: Systems which can track and understand people have a wide variety of commercial applications. It is predicted that computers of the future will interact more naturally with humans than they do now. Instead of the desktop computer paradigm with humans communicating by typing, computers of the future will be able to understand human speech and movements. Our system demonstrates the capabilities of a solely vision-based system for these ends.