scispace - formally typeset
Search or ask a question

Showing papers by "Trevor Darrell published in 2004"


Proceedings Article
01 Dec 2004
TL;DR: An extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition is proposed, which allows the assumption of conditional independence of the observed data to be relaxed.
Abstract: We present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as flexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Random Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recognition proceeds by finding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i.e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes.

428 citations


Patent
22 Jan 2004
TL;DR: In this paper, a mobile deixis device includes a camera to capture an image and a wireless handheld device coupled to the camera and to a wireless network to communicate the image with existing databases to find similar images.
Abstract: A mobile deixis device includes a camera to capture an image and a wireless handheld device, coupled to the camera and to a wireless network, to communicate the image with existing databases to find similar images. The mobile deixis device further includes a processor, coupled to the device, to process found database records related to similar images and a display to view found database records that include web pages including images. With such an arrangement, users can specify a location of interest by simply pointing a camera-equipped cellular phone at the object of interest and by searching an image database or relevant web resources, users can quickly identify good matches from several close ones to find an object of interest.

410 citations


Proceedings ArticleDOI
19 Jul 2004
TL;DR: A method for simultaneously recovering the trajectory of a target and the external calibration parameters of non-overlapping cameras in a multi-camera system with a network of indoor wireless cameras is described.
Abstract: We describe a method for simultaneously recovering the trajectory of a target and the external calibration parameters of non-overlapping cameras in a multi-camera system. Each camera is assumed to measure the location of a moving target within its field of view with respect to the camera's ground-plane coordinate system. Calibrating the network of cameras requires aligning each camera's ground-plane coordinate system with a global ground-plane coordinate system. Prior knowledge about the target's dynamics can compensate for the lack of overlap between the camera fields of view. The target is allowed to move freely with varying speed and direction. We demonstrate the idea with a network of indoor wireless cameras.

231 citations


Proceedings ArticleDOI
19 Jul 2004
TL;DR: This work presents a contour matching algorithm that quickly computes the minimum weight matching between sets of descriptive local features using a recently introduced low-distortion embedding of the earth mover's distance (EMD) into a normed space.
Abstract: Weighted graph matching is a good way to align a pair of shapes represented by a set of descriptive local features; the set of correspondences produced by the minimum cost matching between two shapes' features often reveals how similar the shapes are. However due to the complexity of computing the exact minimum cost matching, previous algorithms could only run efficiently when using a limited number of features per shape, and could not scale to perform retrievals from large databases. We present a contour matching algorithm that quickly computes the minimum weight matching between sets of descriptive local features using a recently introduced low-distortion embedding of the earth mover's distance (EMD) into a normed space. Given a novel embedded contour, the nearest neighbors in a database of embedded contours are retrieved in sublinear time via approximate nearest neighbors search with locality-sensitive hashing (LSH). We demonstrate our shape matching method on a database of 136,500 images of human figures. Our method achieves a speedup of four orders of magnitude over the exact method, at the cost of only a 4% reduction in accuracy.

203 citations


Proceedings ArticleDOI
27 Jun 2004
TL;DR: The usefulness of common image search metrics applied on images captured with a camera-equipped mobile device to find matching images on the World Wide Web or other general-purpose databases is demonstrated.
Abstract: We describe an approach to recognizing location from mobile devices using image-based Web search. We demonstrate the usefulness of common image search metrics applied on images captured with a camera-equipped mobile device to find matching images on the World Wide Web or other general-purpose databases. Searching the entire Web can be computationally overwhelming, so we devise a hybrid image-and-keyword searching technique. First, image-search is performed over images and links to their source Web pages in a database that indexes only a small fraction of the Web. Then, relevant keywords on these Web pages are automatically identified and submitted to an existing text-based search engine (e.g. Google) that indexes a much larger portion of the Web. Finally, the resulting image set is filtered to retain images close to the original query. It is thus possible to efficiently search hundreds of millions of images that are not only textually related but also visually relevant. We demonstrate our approach on an application allowing users to browse Web pages matching the image of a nearby location.

166 citations



Journal ArticleDOI
TL;DR: A probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence and nonparametric statistical density modeling techniques can characterize the mutual information between signals from different domains.
Abstract: Audio and visual signals arriving from a common source are detected using a signal-level fusion technique. A probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence. Nonparametric statistical density modeling techniques can characterize the mutual information between signals from different domains. By comparing the mutual information between different pairs of signals, it is possible to identify which person is speaking a given utterance and discount errant motion or audio from other utterances or nonspeech events.

125 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: A system that combines sound and vision to track multiple people using a particle filter with audio and video state components, and derive observation likelihood methods based on both audio andVideo measurements.
Abstract: In this paper, we present a system that combines sound and vision to track multiple people. In a cluttered or noisy scene, multi-person tracking estimates have a distinctly non-Gaussian distribution. We apply a particle filter with audio and video state components, and derive observation likelihood methods based on both audio and video measurements. Our state includes the number of people present, their positions, and whether each person is talking. We show experiments in an environment with sparse microphones and monocular cameras. Our results show that our system can accurately track the locations and speech activity of a varying number of people.

88 citations


Proceedings ArticleDOI
13 Oct 2004
TL;DR: A novel approach to visual speech modeling, based on articulatory features, which has potential benefits under visually challenging conditions, and is evaluated in a preliminary experiment on a small audio-visual database.
Abstract: Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features, which has potential benefits under visually challenging conditions. The idea is to use a set of parallel classifiers to extract different articulatory attributes from the input images, and then combine their decisions to obtain higher-level units, such as visemes or words. We evaluate our approach in a preliminary experiment on a small audio-visual database, using several image noise conditions, and compare it to the standard viseme-based modeling approach.

47 citations


Proceedings ArticleDOI
24 Apr 2004
TL;DR: This work introduces a point-by-photograph paradigm, where users can specify a location simply by taking pictures, and uses content-based image retrieval methods to search the web or other databases for matching images and their source pages to find relevant location-based information.
Abstract: We demonstrate an image-based approach to specifying location and finding location-based information from camera-equipped mobile devices. We introduce a point-by-photograph paradigm, where users can specify a location simply by taking pictures. Our technique uses content-based image retrieval methods to search the web or other databases for matching images and their source pages to find relevant location-based information. In contrast to conventional approaches to location detection, our method can refer to distant locations and does not require any physical infrastructure beyond mobile internet service. We have developed a prototype on a camera phone and conducted user studies to demonstrate the efficacy of our approach compared to other alternatives.

39 citations


Book ChapterDOI
13 Sep 2004
TL;DR: This paper introduces a point-by-photograph paradigm, where users can specify a location simply by taking pictures, and uses content-based image retrieval methods to search the web or other databases for matching images and their source pages to find relevant location-based information.
Abstract: In this paper we describe an image-based approach to finding location-based information from camera-equipped mobile devices. We introduce a point-by-photograph paradigm, where users can specify a location simply by taking pictures. Our technique uses content-based image retrieval methods to search the web or other databases for matching images and their source pages to find relevant location-based information. In contrast to conventional approaches to location detection, our method can refer to distant locations and does not require any physical infrastructure beyond mobile internet service. We have developed a prototype on a camera phone and conducted user studies to demonstrate the efficacy of our approach.

Proceedings ArticleDOI
24 Apr 2004
TL;DR: This demo describes the ongoing efforts to build a robot that can collaborate with a person in hosting activities and reports on extensions to the robot's existing gestural abilities to be able to recognize nodding in conversations.
Abstract: In this demo we describe our ongoing efforts to build a robot that can collaborate with a person in hosting activities. We illustrate our current robot's conversations, which include gestures of various types, and report on extensions to the robot's existing gestural abilities to be able to recognize nodding in conversations.

Proceedings ArticleDOI
13 Oct 2004
TL;DR: The design of a module that detects head pose and gesture cues is presented and examples of its integration in three different conversational agents with varying degrees of discourse model complexity are shown.
Abstract: Head pose and gesture offer several key conversational grounding cues and are used extensively in face-to-face interaction among people. While the machine interpretation of these cues has previously been limited to output modalities, recent advances in face-pose tracking allow for systems which are robust and accurate enough to sense natural grounding gestures. We present the design of a module that detects these cues and show examples of its integration in three different conversational agents with varying degrees of discourse model complexity. Using a scripted discourse model and off-the-shelf animation and speech-recognition components, we demonstrate the use of this module in a novel "conversational tooltip" task, where additional information is spontaneously provided by an animated character when users attendto various physical objects or characters in the environment. We further describe the integration of our module in two systems where animated and robotic characters interact with users based on rich discourse and semantic models.

Book ChapterDOI
11 May 2004
TL;DR: This work shows that if information about the dynamics of the target is available, it can estimate the trajectory of thetarget without visible ground planes or overlapping cameras.
Abstract: Recent techniques for multi-camera tracking have relied on either overlap between the fields of view of the cameras or on a visible ground plane. We show that if information about the dynamics of the target is available, we can estimate the trajectory of the target without visible ground planes or overlapping cameras.

Book ChapterDOI
11 May 2004
TL;DR: A novel 3D appearance model using image-based rendering techniques, which can represent complex lighting conditions, structures, and surfaces and overcomes the limitations of polygonal based appearance models and uses light fields that are acquired in real-time.
Abstract: Statistical shape and texture appearance models are powerful image representations, but previously had been restricted to 2D or 3D shapes with smooth surfaces and lambertian reflectance. In this paper we present a novel 3D appearance model using image-based rendering techniques, which can represent complex lighting conditions, structures, and surfaces. We construct a light field manifold capturing the multi-view appearance of an object class and extend the direct search algorithm of Cootes and Taylor to match new light fields or 2D images of an object to a point on this manifold. When matching to a 2D image the reconstructed light field can be used to render unseen views of the object. Our technique differs from previous view-based active appearance models in that model coefficients between views are explicitly linked, and that we do not model any pose variation within the shape model at a single view. It overcomes the limitations of polygonal based appearance models and uses light fields that are acquired in real-time.

Proceedings ArticleDOI
23 Oct 2004
TL;DR: The space of possible perceptual interface abstractions for full-body navigation for 3-D environments using real-time articulated body tracking with standard cameras and personal computers is analyzed, and a prototype system based on these results is presented.
Abstract: Interacting and navigating virtual environments usually requires a wired interface, game console, or keyboard. The advent of perceptual interface techniques allows a new option: the passive and untethered sensing of users' pose and gesture to allow them to maneuver through and manipulate virtual worlds. We describe new algorithms for interacting with 3-D environments using real-time articulated body tracking with standard cameras and personal computers. Our method is based on rigid stereo-motion estimation algorithms and can accurately track upper body pose in real-time. With our tracking system users can navigate virtual environments using 3-D gesture and body poses. We analyze the space of possible perceptual interface abstractions for full-body navigation, and present a prototype system based on these results. We finally describe an initial evaluation of our prototype system with users guiding avatars through a series of 3-D virtual game worlds.

01 Dec 2004
TL;DR: In this article, a nonparametric density model of object shape is learned for the given object class by collecting multi-view silhouette examples from calibrated, though possibly varied, camera rigs.
Abstract: We present a method for estimating the 3D visual hull of an object from a known class given a single silhouette or sequence of silhouettes observed from an unknown viewpoint. A non-parametric density model of object shape is learned for the given object class by collecting multi-view silhouette examples from calibrated, though possibly varied, camera rigs. To infer a 3D shape from a single input silhouette, we search for 3D shapes which maximize the posterior given the observed contour. The input is matched to component single views of the multi-view training examples. A set of viewpoint-aligned virtual views are generated from the visual hulls corresponding to these examples. The most likely visual hull for the input is then found by interpolating between the contours of these aligned views. When the underlying shape is ambiguous given a single view silhouette, we produce multiple visual hull hypotheses; if a sequence of input images is available, a dynamic programming approach is applied to find the maximum likelihood path through the feasible hypotheses over time. We show results of our algorithm on real and synthetic images of people.

Book ChapterDOI
16 May 2004
TL;DR: A method for estimating the 3D visual hull of an object from a known class given a single silhouette or sequence of silhouettes observed from an unknown viewpoint and results are shown on real and synthetic images of people.
Abstract: We present a method for estimating the 3D visual hull of an object from a known class given a single silhouette or sequence of silhouettes observed from an unknown viewpoint. A non-parametric density model of object shape is learned for the given object class by collecting multi-view silhouette examples from calibrated, though possibly varied, camera rigs. To infer a 3D shape from a single input silhouette, we search for 3D shapes which maximize the posterior given the observed contour. The input is matched to component single views of the multi-view training examples. A set of viewpoint-aligned virtual views are generated from the visual hulls corresponding to these examples. The most likely visual hull for the input is then found by interpolating between the contours of these aligned views. When the underlying shape is ambiguous given a single view silhouette, we produce multiple visual hull hypotheses; if a sequence of input images is available, a dynamic programming approach is applied to find the maximum likelihood path through the feasible hypotheses over time. We show results of our algorithm on real and synthetic images of people.

Book ChapterDOI
16 May 2004
TL;DR: This work combines constituent dynamical systems in a manner similar to a Product of HMMs model, and presents an approximate non-loopy filtering algorithm based on sequential application of Belief Propagation to acyclic subgraphs of the model.
Abstract: Stochastic tracking of structured models in monolithic state spaces often requires modeling complex distributions that are difficult to represent with either parametric or sample-based approaches. We show that if redundant representations are available, the individual state estimates may be improved by combining simpler dynamical systems, each of which captures some aspect of the complex behavior. For example, human body parts may be robustly tracked individually, but the resulting pose combinations may not satisfy articulation constraints. Conversely, the results produced by full-body trackers satisfy such constraints, but such trackers are usually fragile due to the presence of clutter. We combine constituent dynamical systems in a manner similar to a Product of HMMs model. Hidden variables are introduced to represent system appearance. While the resulting model contains loops, making the inference hard in general, we present an approximate non-loopy filtering algorithm based on sequential application of Belief Propagation to acyclic subgraphs of the model.

Patent
19 May 2004
TL;DR: In this paper, a mobile deixis device includes a camera to capture an image and a wireless handheld device coupled to the camera and to a wireless network to communicate the image with existing databases to find similar images.
Abstract: A mobile deixis device includes a camera to capture an image and a wireless handheld device, coupled to the camera and to a wireless network, to communicate the image with existing databases to find similar images. The mobile deixis device further includes a processor, coupled to the device, to process found database records related to similar images and a display to view found database records that include web pages including images. With such an arrangement, users can specify a location of interest by simply pointing a camera-equipped cellular phone at the object of interest and by searching an image database or relevant web resources, users can quickly identify good matches from several close ones to find an object of interest.

28 Jan 2004
TL;DR: This work presents a method for generating a “virtual visual hull”– an estimate of the 3D shape of an object from a known class, given a single silhouette observed from an unknown viewpoint.
Abstract: Recovering a volumetric model of a person, car, or other object of interest from a single snapshot would be useful for many computer graphics applications. 3D model estimation in general is hard, and currently requires active sensors, multiple views, or integration over time. For a known object class, however, 3D shape can be successfully inferred from a single snapshot. We present a method for generating a “virtual visual hull”– an estimate of the 3D shape of an object from a known class, given a single silhouette observed from an unknown viewpoint. For a given class, a large database of multi-view silhouette examples from calibrated, though possibly varied, camera rigs are collected. To infer a novel single view input silhouette’s virtual visual hull, we search for 3D shapes in the database which are most consistent with the observed contour. The input is matched to component single views of the multi-view training examples. A set of viewpoint-aligned virtual views are generated from the visual hulls corresponding to these examples. The 3D shape estimate for the input is then found by interpolating between the contours of these aligned views. When the underlying shape is ambiguous given a single view silhouette, we produce multiple visual hull hypotheses; if a sequence of input images is available, a dynamic programming approach is applied to find the maximum likelihood path through the feasible hypotheses over time. We show results of our algorithm on real and synthetic images of people.


Proceedings ArticleDOI
13 Oct 2004
TL;DR: An audio-visual tracking system that can estimate the location of multiple people, detect the current speaker and build a model of interaction between people in a meeting is demonstrated.
Abstract: We demonstrate an audio-visual tracking system for meeting analysis. A stereo camera and a microphone array are used to track multiple people and their speech activity in real-time. Our system can estimate the location of multiple people, detect the current speaker and build a model of interaction between people in a meeting.