Top 23 papers published by Trevor Darrell from University of California, Berkeley in 2004

Proceedings Article•

Conditional Random Fields for Object Recognition

[...]

Ariadna Quattoni¹, Michael Collins¹, Trevor Darrell¹•Institutions (1)

01 Dec 2004

TL;DR: An extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition is proposed, which allows the assumption of conditional independence of the observed data to be relaxed.

...read moreread less

Abstract: We present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as flexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Random Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recognition proceeds by finding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i.e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes.

...read moreread less

428 citations

Patent•

Photo-based mobile deixis system and related techniques

[...]

Trevor Darrell¹, Tom Yeh¹, Konrad Tollmar¹•Institutions (1)

Massachusetts Institute of Technology¹

22 Jan 2004

TL;DR: In this paper, a mobile deixis device includes a camera to capture an image and a wireless handheld device coupled to the camera and to a wireless network to communicate the image with existing databases to find similar images.

...read moreread less

Abstract: A mobile deixis device includes a camera to capture an image and a wireless handheld device, coupled to the camera and to a wireless network, to communicate the image with existing databases to find similar images. The mobile deixis device further includes a processor, coupled to the device, to process found database records related to similar images and a display to view found database records that include web pages including images. With such an arrangement, users can specify a location of interest by simply pointing a camera-equipped cellular phone at the object of interest and by searching an image database or relevant web resources, users can quickly identify good matches from several close ones to find an object of interest.

...read moreread less

410 citations

Proceedings Article•DOI•

Simultaneous calibration and tracking with a network of non-overlapping sensors

[...]

Ali Rahimi¹, B. Dunagan¹, Trevor Darrell¹•Institutions (1)

Massachusetts Institute of Technology¹

19 Jul 2004

TL;DR: A method for simultaneously recovering the trajectory of a target and the external calibration parameters of non-overlapping cameras in a multi-camera system with a network of indoor wireless cameras is described.

...read moreread less

Abstract: We describe a method for simultaneously recovering the trajectory of a target and the external calibration parameters of non-overlapping cameras in a multi-camera system. Each camera is assumed to measure the location of a moving target within its field of view with respect to the camera's ground-plane coordinate system. Calibrating the network of cameras requires aligning each camera's ground-plane coordinate system with a global ground-plane coordinate system. Prior knowledge about the target's dynamics can compensate for the lack of overlap between the camera fields of view. The target is allowed to move freely with varying speed and direction. We demonstrate the idea with a network of indoor wireless cameras.

...read moreread less

231 citations

Proceedings Article•DOI•

Fast contour matching using approximate earth mover's distance

[...]

Kristen Grauman¹, Trevor Darrell¹•Institutions (1)

Massachusetts Institute of Technology¹

19 Jul 2004

TL;DR: This work presents a contour matching algorithm that quickly computes the minimum weight matching between sets of descriptive local features using a recently introduced low-distortion embedding of the earth mover's distance (EMD) into a normed space.

...read moreread less

Abstract: Weighted graph matching is a good way to align a pair of shapes represented by a set of descriptive local features; the set of correspondences produced by the minimum cost matching between two shapes' features often reveals how similar the shapes are. However due to the complexity of computing the exact minimum cost matching, previous algorithms could only run efficiently when using a limited number of features per shape, and could not scale to perform retrievals from large databases. We present a contour matching algorithm that quickly computes the minimum weight matching between sets of descriptive local features using a recently introduced low-distortion embedding of the earth mover's distance (EMD) into a normed space. Given a novel embedded contour, the nearest neighbors in a database of embedded contours are retrieved in sublinear time via approximate nearest neighbors search with locality-sensitive hashing (LSH). We demonstrate our shape matching method on a database of 136,500 images of human figures. Our method achieves a speedup of four orders of magnitude over the exact method, at the cost of only a 4% reduction in accuracy.

...read moreread less

203 citations

Proceedings Article•DOI•

Searching the Web with mobile images for location recognition

[...]

Tom Yeh, Konrad Tollmar, Trevor Darrell

27 Jun 2004

TL;DR: The usefulness of common image search metrics applied on images captured with a camera-equipped mobile device to find matching images on the World Wide Web or other general-purpose databases is demonstrated.

...read moreread less

Abstract: We describe an approach to recognizing location from mobile devices using image-based Web search. We demonstrate the usefulness of common image search metrics applied on images captured with a camera-equipped mobile device to find matching images on the World Wide Web or other general-purpose databases. Searching the entire Web can be computationally overwhelming, so we devise a hybrid image-and-keyword searching technique. First, image-search is performed over images and links to their source Web pages in a database that indexes only a small fraction of the Web. Then, relevant keywords on these Web pages are automatically identified and submitted to an existing text-based search engine (e.g. Google) that indexes a much larger portion of the Web. Finally, the resulting image set is filtered to retain images close to the original query. It is thus possible to efficiently search hundreds of millions of images that are not only textually related but also visually relevant. We demonstrate our approach on an application allowing users to browse Web pages matching the image of a nearby location.

...read moreread less

166 citations

Journal Article•

MULTIMODAL INTERFACES THAT Flex, Adapt, and Persist

[...]

Myron D. Flickner, Trevor Darrell, Sharon Oviatt

01 Jan 2004-Communications of The ACM

151 citations

Journal Article•DOI•

Speaker association with signal-level audiovisual fusion

[...]

John W. Fisher¹, Trevor Darrell¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jun 2004-IEEE Transactions on Multimedia

TL;DR: A probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence and nonparametric statistical density modeling techniques can characterize the mutual information between signals from different domains.

...read moreread less

Abstract: Audio and visual signals arriving from a common source are detected using a signal-level fusion technique. A probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence. Nonparametric statistical density modeling techniques can characterize the mutual information between signals from different domains. By comparing the mutual information between different pairs of signals, it is possible to identify which person is speaking a given utterance and discount errant motion or audio from other utterances or nonspeech events.

...read moreread less

125 citations

Proceedings Article•DOI•

Multiple person and speaker activity tracking with a particle filter

[...]

Neal Checka¹, Kevin W. Wilson¹, Michael R. Siracusa¹, Trevor Darrell¹•Institutions (1)

Massachusetts Institute of Technology¹

17 May 2004

TL;DR: A system that combines sound and vision to track multiple people using a particle filter with audio and video state components, and derive observation likelihood methods based on both audio andVideo measurements.

...read moreread less

Abstract: In this paper, we present a system that combines sound and vision to track multiple people. In a cluttered or noisy scene, multi-person tracking estimates have a distinctly non-Gaussian distribution. We apply a particle filter with audio and video state components, and derive observation likelihood methods based on both audio and video measurements. Our state includes the number of people present, their positions, and whether each person is talking. We show experiments in an environment with sparse microphones and monocular cameras. Our results show that our system can accurately track the locations and speech activity of a varying number of people.

...read moreread less

88 citations

Proceedings Article•DOI•

Articulatory features for robust visual speech recognition

[...]

Kate Saenko¹, Trevor Darrell¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

13 Oct 2004

TL;DR: A novel approach to visual speech modeling, based on articulatory features, which has potential benefits under visually challenging conditions, and is evaluated in a preliminary experiment on a small audio-visual database.

...read moreread less

Abstract: Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features, which has potential benefits under visually challenging conditions. The idea is to use a set of parallel classifiers to extract different articulatory attributes from the input images, and then combine their decisions to obtain higher-level units, such as visemes or words. We evaluate our approach in a preliminary experiment on a small audio-visual database, using several image noise conditions, and compare it to the standard viseme-based modeling approach.

...read moreread less

47 citations

Proceedings Article•DOI•

IDeixis: image-based Deixis for finding location-based information

[...]

Tom Yeh¹, Konrad Tollmar¹, Trevor Darrell¹•Institutions (1)

Massachusetts Institute of Technology¹

24 Apr 2004

TL;DR: This work introduces a point-by-photograph paradigm, where users can specify a location simply by taking pictures, and uses content-based image retrieval methods to search the web or other databases for matching images and their source pages to find relevant location-based information.

...read moreread less

Abstract: We demonstrate an image-based approach to specifying location and finding location-based information from camera-equipped mobile devices. We introduce a point-by-photograph paradigm, where users can specify a location simply by taking pictures. Our technique uses content-based image retrieval methods to search the web or other databases for matching images and their source pages to find relevant location-based information. In contrast to conventional approaches to location detection, our method can refer to distant locations and does not require any physical infrastructure beyond mobile internet service. We have developed a prototype on a camera phone and conducted user studies to demonstrate the efficacy of our approach compared to other alternatives.

...read moreread less

39 citations

Book Chapter•DOI•

IDeixis - Searching the Web with mobile images for location-based information

[...]

Konrad Tollmar¹, Konrad Tollmar², Tom Yeh¹, Trevor Darrell¹•Institutions (2)

Massachusetts Institute of Technology¹, Lund University²

13 Sep 2004

TL;DR: This paper introduces a point-by-photograph paradigm, where users can specify a location simply by taking pictures, and uses content-based image retrieval methods to search the web or other databases for matching images and their source pages to find relevant location-based information.

...read moreread less

Abstract: In this paper we describe an image-based approach to finding location-based information from camera-equipped mobile devices. We introduce a point-by-photograph paradigm, where users can specify a location simply by taking pictures. Our technique uses content-based image retrieval methods to search the web or other databases for matching images and their source pages to find relevant location-based information. In contrast to conventional approaches to location detection, our method can refer to distant locations and does not require any physical infrastructure beyond mobile internet service. We have developed a prototype on a camera phone and conducted user studies to demonstrate the efficacy of our approach.

...read moreread less

Proceedings Article•DOI•

Nodding in conversations with a robot

[...]

Christopher Lee¹, Neal Lesh¹, Candace L. Sidner¹, Louis-Philippe Morency², Ashish Kapoor², Trevor Darrell² - Show less +2 more•Institutions (2)

Mitsubishi Electric Research Laboratories¹, Massachusetts Institute of Technology²

24 Apr 2004

TL;DR: This demo describes the ongoing efforts to build a robot that can collaborate with a person in hosting activities and reports on extensions to the robot's existing gestural abilities to be able to recognize nodding in conversations.

...read moreread less

Abstract: In this demo we describe our ongoing efforts to build a robot that can collaborate with a person in hosting activities. We illustrate our current robot's conversations, which include gestures of various types, and report on extensions to the robot's existing gestural abilities to be able to recognize nodding in conversations.

...read moreread less

Proceedings Article•DOI•

From conversational tooltips to grounded discourse: head poseTracking in interactive dialog systems

[...]

Louis-Philippe Morency¹, Trevor Darrell¹•Institutions (1)

Massachusetts Institute of Technology¹

13 Oct 2004

TL;DR: The design of a module that detects head pose and gesture cues is presented and examples of its integration in three different conversational agents with varying degrees of discourse model complexity are shown.

...read moreread less

Abstract: Head pose and gesture offer several key conversational grounding cues and are used extensively in face-to-face interaction among people. While the machine interpretation of these cues has previously been limited to output modalities, recent advances in face-pose tracking allow for systems which are robust and accurate enough to sense natural grounding gestures. We present the design of a module that detects these cues and show examples of its integration in three different conversational agents with varying degrees of discourse model complexity. Using a scripted discourse model and off-the-shelf animation and speech-recognition components, we demonstrate the use of this module in a novel "conversational tooltip" task, where additional information is spontaneously provided by an animated character when users attendto various physical objects or characters in the environment. We further describe the integration of our module in two systems where animated and robotic characters interact with users based on rich discourse and semantic models.

...read moreread less

Book Chapter•DOI•

Tracking People with a Sparse Network of Bearing Sensors

[...]

Ali Rahimi¹, Brian Dunagan¹, Trevor Darrell¹•Institutions (1)

Massachusetts Institute of Technology¹

11 May 2004

TL;DR: This work shows that if information about the dynamics of the target is available, it can estimate the trajectory of thetarget without visible ground planes or overlapping cameras.

...read moreread less

Abstract: Recent techniques for multi-camera tracking have relied on either overlap between the fields of view of the cameras or on a visible ground plane. We show that if information about the dynamics of the target is available, we can estimate the trajectory of the target without visible ground planes or overlapping cameras.

...read moreread less

Book Chapter•DOI•

Light field appearance manifolds

[...]

Chris Mario Christoudias, Louis-Philippe Morency, Trevor Darrell

11 May 2004

TL;DR: A novel 3D appearance model using image-based rendering techniques, which can represent complex lighting conditions, structures, and surfaces and overcomes the limitations of polygonal based appearance models and uses light fields that are acquired in real-time.

...read moreread less

Abstract: Statistical shape and texture appearance models are powerful image representations, but previously had been restricted to 2D or 3D shapes with smooth surfaces and lambertian reflectance. In this paper we present a novel 3D appearance model using image-based rendering techniques, which can represent complex lighting conditions, structures, and surfaces. We construct a light field manifold capturing the multi-view appearance of an object class and extend the direct search algorithm of Cootes and Taylor to match new light fields or 2D images of an object to a point on this manifold. When matching to a 2D image the reconstructed light field can be used to render unseen views of the object. Our technique differs from previous view-based active appearance models in that model coefficients between views are explicitly linked, and that we do not model any pose variation within the shape model at a single view. It overcomes the limitations of polygonal based appearance models and uses light fields that are acquired in real-time.

...read moreread less

Proceedings Article•DOI•

Navigating in virtual environments using a vision-based interface

[...]

Konrad Tollmar, David Demirdjian, Trevor Darrell

23 Oct 2004

TL;DR: The space of possible perceptual interface abstractions for full-body navigation for 3-D environments using real-time articulated body tracking with standard cameras and personal computers is analyzed, and a prototype system based on these results is presented.

...read moreread less

Abstract: Interacting and navigating virtual environments usually requires a wired interface, game console, or keyboard. The advent of perceptual interface techniques allows a new option: the passive and untethered sensing of users' pose and gesture to allow them to maneuver through and manipulate virtual worlds. We describe new algorithms for interacting with 3-D environments using real-time articulated body tracking with standard cameras and personal computers. Our method is based on rigid stereo-motion estimation algorithms and can accurately track upper body pose in real-time. With our tracking system users can navigate virtual environments using 3-D gesture and body poses. We analyze the space of possible perceptual interface abstractions for full-body navigation, and present a prototype system based on these results. We finally describe an initial evaluation of our prototype system with users guiding avatars through a series of 3-D virtual game worlds.

...read moreread less

Virtual visual hulls: Example-based 3D shape inference from silhouettes

[...]

Kristen Grauman, Gregory Shakhnarovich¹, Trevor Darrell¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Dec 2004

TL;DR: In this article, a nonparametric density model of object shape is learned for the given object class by collecting multi-view silhouette examples from calibrated, though possibly varied, camera rigs.

...read moreread less

Abstract: We present a method for estimating the 3D visual hull of an object from a known class given a single silhouette or sequence of silhouettes observed from an unknown viewpoint. A non-parametric density model of object shape is learned for the given object class by collecting multi-view silhouette examples from calibrated, though possibly varied, camera rigs. To infer a 3D shape from a single input silhouette, we search for 3D shapes which maximize the posterior given the observed contour. The input is matched to component single views of the multi-view training examples. A set of viewpoint-aligned virtual views are generated from the visual hulls corresponding to these examples. The most likely visual hull for the input is then found by interpolating between the contours of these aligned views. When the underlying shape is ambiguous given a single view silhouette, we produce multiple visual hull hypotheses; if a sequence of input images is available, a dynamic programming approach is applied to find the maximum likelihood path through the feasible hypotheses over time. We show results of our algorithm on real and synthetic images of people.

...read moreread less

Book Chapter•DOI•

Virtual Visual Hulls: Example-Based 3D Shape Inference from Silhouettes

[...]

Kristen Grauman¹, Gregory Shakhnarovich¹, Trevor Darrell¹•Institutions (1)

Massachusetts Institute of Technology¹

16 May 2004

TL;DR: A method for estimating the 3D visual hull of an object from a known class given a single silhouette or sequence of silhouettes observed from an unknown viewpoint and results are shown on real and synthetic images of people.

...read moreread less

Abstract: We present a method for estimating the 3D visual hull of an object from a known class given a single silhouette or sequence of silhouettes observed from an unknown viewpoint. A non-parametric density model of object shape is learned for the given object class by collecting multi-view silhouette examples from calibrated, though possibly varied, camera rigs. To infer a 3D shape from a single input silhouette, we search for 3D shapes which maximize the posterior given the observed contour. The input is matched to component single views of the multi-view training examples. A set of viewpoint-aligned virtual views are generated from the visual hulls corresponding to these examples. The most likely visual hull for the input is then found by interpolating between the contours of these aligned views. When the underlying shape is ambiguous given a single view silhouette, we produce multiple visual hull hypotheses; if a sequence of input images is available, a dynamic programming approach is applied to find the maximum likelihood path through the feasible hypotheses over time. We show results of our algorithm on real and synthetic images of people.

...read moreread less

Book Chapter•DOI•

Combining Simple Models to Approximate Complex Dynamics

[...]

Leonid Taycher¹, John W. Fisher¹, Trevor Darrell¹•Institutions (1)

Massachusetts Institute of Technology¹

16 May 2004

TL;DR: This work combines constituent dynamical systems in a manner similar to a Product of HMMs model, and presents an approximate non-loopy filtering algorithm based on sequential application of Belief Propagation to acyclic subgraphs of the model.

...read moreread less

Abstract: Stochastic tracking of structured models in monolithic state spaces often requires modeling complex distributions that are difficult to represent with either parametric or sample-based approaches. We show that if redundant representations are available, the individual state estimates may be improved by combining simpler dynamical systems, each of which captures some aspect of the complex behavior. For example, human body parts may be robustly tracked individually, but the resulting pose combinations may not satisfy articulation constraints. Conversely, the results produced by full-body trackers satisfy such constraints, but such trackers are usually fragile due to the presence of clutter. We combine constituent dynamical systems in a manner similar to a Product of HMMs model. Hidden variables are introduced to represent system appearance. While the resulting model contains loops, making the inference hard in general, we present an approximate non-loopy filtering algorithm based on sequential application of Belief Propagation to acyclic subgraphs of the model.

...read moreread less

Patent•

Photo-based mobile pointing system and related techniques

[...]

Trevor Darrell¹, Tom Yeh¹, Konrad Tollmar¹•Institutions (1)

Massachusetts Institute of Technology¹

19 May 2004

TL;DR: In this paper, a mobile deixis device includes a camera to capture an image and a wireless handheld device coupled to the camera and to a wireless network to communicate the image with existing databases to find similar images.

...read moreread less

Abstract: A mobile deixis device includes a camera to capture an image and a wireless handheld device, coupled to the camera and to a wireless network, to communicate the image with existing databases to find similar images. The mobile deixis device further includes a processor, coupled to the device, to process found database records related to similar images and a display to view found database records that include web pages including images. With such an arrangement, users can specify a location of interest by simply pointing a camera-equipped cellular phone at the object of interest and by searching an image database or relevant web resources, users can quickly identify good matches from several close ones to find an object of interest.

...read moreread less

Virtual Visual Hulls: Example-Based 3D Shape Estimation from a Single Silhouette

[...]

Kristen Grauman, Gregory Shakhnarovich, Trevor Darrell

28 Jan 2004

TL;DR: This work presents a method for generating a “virtual visual hull”– an estimate of the 3D shape of an object from a known class, given a single silhouette observed from an unknown viewpoint.

...read moreread less

Abstract: Recovering a volumetric model of a person, car, or other object of interest from a single snapshot would be useful for many computer graphics applications. 3D model estimation in general is hard, and currently requires active sensors, multiple views, or integration over time. For a known object class, however, 3D shape can be successfully inferred from a single snapshot. We present a method for generating a “virtual visual hull”– an estimate of the 3D shape of an object from a known class, given a single silhouette observed from an unknown viewpoint. For a given class, a large database of multi-view silhouette examples from calibrated, though possibly varied, camera rigs are collected. To infer a novel single view input silhouette’s virtual visual hull, we search for 3D shapes in the database which are most consistent with the observed contour. The input is matched to component single views of the multi-view training examples. A set of viewpoint-aligned virtual views are generated from the visual hulls corresponding to these examples. The 3D shape estimate for the input is then found by interpolating between the contours of these aligned views. When the underlying shape is ambiguous given a single view silhouette, we produce multiple visual hull hypotheses; if a sequence of input images is available, a dynamic programming approach is applied to find the maximum likelihood path through the feasible hypotheses over time. We show results of our algorithm on real and synthetic images of people.

...read moreread less

Journal Article•DOI•

Session details: Multimodal interfaces that flex, adapt, and persist

[...]

Sharon Oviatt¹, Trevor Darrell², Myron D. Flickner³•Institutions (3)

Oregon Health & Science University¹, Massachusetts Institute of Technology², IBM³

01 Jan 2004-Communications of The ACM

Proceedings Article•DOI•

Real-time audio-visual tracking for meeting analysis

[...]

David Demirdjian¹, Kevin R. Wilson¹, Michael R. Siracusa¹, Trevor Darrell¹•Institutions (1)

Massachusetts Institute of Technology¹

13 Oct 2004

TL;DR: An audio-visual tracking system that can estimate the location of multiple people, detect the current speaker and build a model of interaction between people in a meeting is demonstrated.

...read moreread less

Abstract: We demonstrate an audio-visual tracking system for meeting analysis. A stereo camera and a microphone array are used to track multiple people and their speech activity in real-time. Our system can estimate the location of multiple people, detect the current speaker and build a model of interaction between people in a meeting.

...read moreread less

Showing papers by "Trevor Darrell published in 2004"