scispace - formally typeset
Search or ask a question

Showing papers by "Trevor Darrell published in 2007"


Journal ArticleDOI
TL;DR: A discriminative latent variable model for classification problems in structured domains where inputs can be represented by a graph of local observations and a hidden-state conditional random field framework learns a set of latent variables conditioned on local features.
Abstract: We present a discriminative latent variable model for classification problems in structured domains where inputs can be represented by a graph of local observations. A hidden-state conditional random field framework learns a set of latent variables conditioned on local features. Observations need not be independent and may overlap in space and time.

578 citations


Proceedings ArticleDOI
17 Jun 2007
TL;DR: A discriminative framework for simultaneous sequence segmentation and labeling which can capture both intrinsic and extrinsic class dynamics and incorporates hidden state variables which model the sub-structure of a class sequence and learn dynamics between class labels.
Abstract: Many problems in vision involve the prediction of a class label for each frame in an unsegmented sequence. In this paper, we develop a discriminative framework for simultaneous sequence segmentation and labeling which can capture both intrinsic and extrinsic class dynamics. Our approach incorporates hidden state variables which model the sub-structure of a class sequence and learn dynamics between class labels. Each class label has a disjoint set of associated hidden states, which enables efficient training and inference in our model. We evaluated our method on the task of recognizing human gestures from unsegmented video streams and performed experiments on three different datasets of head and eye gestures. Our results demonstrate that our model compares favorably to Support Vector Machines, Hidden Markov Models, and Conditional Random Fields on visual gesture recognition tasks.

424 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: This work derives a novel active category learning method based on the probabilistic regression model, and shows that a significant boost in classification performance is possible, especially when the amount of training data for a category is ultimately very small.
Abstract: Discriminative methods for visual object category recognition are typically non-probabilistic, predicting class labels but not directly providing an estimate of uncertainty. Gaussian Processes (GPs) are powerful regression techniques with explicit uncertainty models; we show here how Gaussian Processes with covariance functions defined based on a Pyramid Match Kernel (PMK) can be used for probabilistic object category recognition. The uncertainty model provided by GPs offers confidence estimates at test points, and naturally allows for an active learning paradigm in which points are optimally selected for interactive labeling. We derive a novel active category learning method based on our probabilistic regression model, and show that a significant boost in classification performance is possible, especially when the amount of training data for a category is ultimately very small.

400 citations


Journal Article
TL;DR: The pyramid match maps unordered feature sets to multi-resolution histograms and computes a weighted histogram intersection in order to find implicit correspondences based on the finest resolution histogram cell where a matched pair first appears.
Abstract: In numerous domains it is useful to represent a single example by the set of the local features or parts that comprise it. However, this representation poses a challenge to many conventional machine learning techniques, since sets may vary in cardinality and elements lack a meaningful ordering. Kernel methods can learn complex functions, but a kernel over unordered set inputs must somehow solve for correspondences---generally a computationally expensive task that becomes impractical for large set sizes. We present a new fast kernel function called the pyramid match that measures partial match similarity in time linear in the number of features. The pyramid match maps unordered feature sets to multi-resolution histograms and computes a weighted histogram intersection in order to find implicit correspondences based on the finest resolution histogram cell where a matched pair first appears. We show the pyramid match yields a Mercer kernel, and we prove bounds on its error relative to the optimal partial matching cost. We demonstrate our algorithm on both classification and regression tasks, including object recognition, 3-D human pose inference, and time of publication estimation for documents, and we show that the proposed method is accurate and significantly more efficient than current approaches.

383 citations


Patent
15 Mar 2007
TL;DR: In this article, a method for classifying or comparing objects includes detecting points of interest within two objects, computing feature descriptors at said points of interests, forming a multi-resolution histogram over feature descriptor for each object and computing a weighted intersection of multi-resolved histogram of each object.
Abstract: A method for classifying or comparing objects includes detecting points of interest within two objects, computing feature descriptors at said points of interest, forming a multi-resolution histogram over feature descriptors for each object and computing a weighted intersection of multi-resolution histogram for each object. An alternative embodiment includes a method for matching objects by defining a plurality of bins for multi-resolution histograms having various levels and a plurality of cluster groups, each group having a center, for each point of interest, calculating a bin index, a bin count and a maximal distance to the bin center and providing a path vector indicative of the bins chosen at each level. Still another embodiment includes a method for matching objects comprising creating a set of feature vectors for each object of interest, mapping each set of feature vectors to a single high-dimensional vector to create an embedding vector and encoding each embedding vector with a binary hash string.

218 citations


Proceedings ArticleDOI
20 Jun 2007
TL;DR: This work introduces a method for Gaussian Process Classification using latent variable models trained with discriminative priors over the latent space, which can learn a discrim inative latent space from a small training set.
Abstract: Supervised learning is difficult with high dimensional input spaces and very small training sets, but accurate classification may be possible if the data lie on a low-dimensional manifold. Gaussian Process Latent Variable Models can discover low dimensional manifolds given only a small number of examples, but learn a latent space without regard for class labels. Existing methods for discriminative manifold learning (e.g., LDA, GDA) do constrain the class distribution in the latent space, but are generally deterministic and may not generalize well with limited training data. We introduce a method for Gaussian Process Classification using latent variable models trained with discriminative priors over the latent space, which can learn a discriminative latent space from a small training set.

157 citations


Proceedings ArticleDOI
17 Jun 2007
TL;DR: This paper describes a method for learning representations from large quantities of unlabeled images which have associated captions, which significantly outperforms a fully-supervised baseline model and a model that ignores the captions and learns a visual representation by performing PCA on the unlabeling images alone.
Abstract: Current methods for learning visual categories work well when a large amount of labeled data is available, but can run into severe difficulties when the number of labeled examples is small. When labeled data is scarce it may be beneficial to use unlabeled data to learn an image representation that is low-dimensional, but nevertheless captures the information required to discriminate between image categories. This paper describes a method for learning representations from large quantities of unlabeled images which have associated captions; the goal is to improve learning in future image classification problems. Experiments show that our method significantly outperforms (1) a fully-supervised baseline model, (2) a model that ignores the captions and learns a visual representation by performing PCA on the unlabeled images alone and (3) a model that uses the output of word classifiers trained using captions and unlabeled data. Our current work concentrates on captions as the source of meta-data, but more generally other types of meta-data could be used.

111 citations


Journal ArticleDOI
TL;DR: Using a discriminative approach to contextual prediction and multi-modal integration, performance of head gesture detection was improved with context features even when the topic of the test set was significantly different than the training set.

101 citations


Proceedings ArticleDOI
17 Jun 2007
TL;DR: A bounded approximate similarity search algorithm that finds (1 + epsiv)-approximate nearest neighbor images in O(N1/1+epsiv) time for a database containing N images represented by (varying numbers of) local features.
Abstract: Matching local features across images is often useful when comparing or recognizing objects or scenes, and efficient techniques for obtaining image-to-image correspondences have been developed [4, 11, 16]. However, given a query image, searching a very large image database with such measures remains impractical. We introduce a sub-linear time randomized hashing algorithm for indexing sets of feature vectors under their partial correspondences. We develop an efficient embedding function for the normalized partial matching similarity between sets, and show how to exploit random hyperplane properties to construct hash functions that satisfy locality-sensitive constraints. The result is a bounded approximate similarity search algorithm that finds (1 + epsiv)-approximate nearest neighbor images in O(N1/1+epsiv) time for a database containing N images represented by (varying numbers of) local features. We demonstrate our approach applied to image retrieval for images represented by sets of local appearance features, and show that searching over correspondences is now scalable to large image databases.

68 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: A new efficient algorithm is presented to incrementally compute set-of-trees (forest) vocabulary representations, and it is shown that adaptive vocabularies offer significant performance advantages in comparison to a single, fixed vocabulary.
Abstract: Histogram pyramid representations computed from a vocabulary tree of visual words have proven valuable for a range of image indexing and recognition tasks; however, they have only used a single, fixed partition of feature space. We present a new efficient algorithm to incrementally compute set-of-trees (forest) vocabulary representations, and show that they improve recognition and indexing performance in methods which use histogram pyramids. Our algorithm incrementally adapts a vocabulary forest with an Inverted file system at the leaf nodes and automatically keeps existing histogram pyramid database entries up-to-date in a forward filesystem. It is possible not only to apply vocabulary tree indexing algorithms directly, but also to compute pyramid match kernel values efficiently. On dynamic recognition tasks where categories or objects under consideration may change over time, we show that adaptive vocabularies offer significant performance advantages in comparison to a single, fixed vocabulary.

61 citations


Journal ArticleDOI
TL;DR: A semisupervised regression algorithm that learns to transform one time series into another time series given examples of the transformation, applied to tracking, which is closely related to nonlinear system identification and manifold learning techniques.
Abstract: We describe a semisupervised regression algorithm that learns to transform one time series into another time series given examples of the transformation. This algorithm is applied to tracking, where a time series of observations from sensors is transformed to a time series describing the pose of a target. Instead of defining and implementing such transformations for each tracking task separately, our algorithm learns a memoryless transformation of time series from a few example input-output mappings. The algorithm searches for a smooth function that fits the training examples and, when applied to the input time series, produces a time series that evolves according to assumed dynamics. The learning procedure is fast and lends itself to a closed-form solution. It is closely related to nonlinear system identification and manifold learning techniques. We demonstrate our algorithm on the tasks of tracking RFID tags from signal strength measurements, recovering the pose of rigid objects, deformable bodies, and articulated bodies from video sequences. For these tasks, this algorithm requires significantly fewer examples compared to fully supervised regression algorithms or semisupervised learning algorithms that do not take the dynamics of the output time series into account.

Book ChapterDOI
28 Jun 2007
TL;DR: This paper addresses the problem of fusing visual and acoustic information to predict object categories, when an image of the object and speech input from the user is available to the HRI system, and shows improved classification rates on a dataset containing a wide variety of object categories.
Abstract: Multimodal scene understanding is an integral part of humanrobot interaction (HRI) in situated environments. Especially useful is category-level recognition, where the the system can recognize classes of objects of scenes rather than specific instances (e.g., any chair vs. this particular chair.) Humans use multiple modalities to understand which object category is being referred to, simultaneously interpreting gesture, speech and visual appearance, and using one modality to disambiguate the information contained in the others. In this paper, we address the problem of fusing visual and acoustic information to predict object categories, when an image of the object and speech input from the user is available to the HRI system. Using probabilistic decision fusion, we show improved classification rates on a dataset containing a wide variety of object categories, compared to using either modality alone.

Book ChapterDOI
28 Jun 2007
TL;DR: A new framework for contextual recognition based on Latent-Dynamic Conditional Random Field (LDCRF) models to learn the sub-structure and external dynamics of contextual cues is proposed.
Abstract: Eye gaze and gesture form key conversational grounding cues that are used extensively in face-to-face interaction among people. To accurately recognize visual feedback during interaction, people often use contextual knowledge from previous and current events to anticipate when feedback is most likely to occur. In this paper, we investigate how dialog context from an embodied conversational agent (ECA) can improve visual recognition of eye gestures. We propose a new framework for contextual recognition based on Latent-Dynamic Conditional Random Field (LDCRF) models to learn the sub-structure and external dynamics of contextual cues. Our experiments show that adding contextual information improves visual recognition of eye gestures and demonstrate that the LDCRF model for context-based recognition of gaze aversion gestures outperforms Support Vector Machines, Hidden Markov Models, and Conditional Random Fields.

Journal ArticleDOI
TL;DR: This work proposes a probabilistic framework for incorporating global dynamics knowledge into the local feature extraction processes and shows the utility of this framework for improving feature tracking and thus shape and motion estimates in a batch factorization algorithm.

Proceedings ArticleDOI
15 Nov 2007
TL;DR: A method for learning about physical objects found in a situated environment based on visual and spoken input provided by the user, which operates on generic databases of labeled object images and transcribed speech data, plus unlabeled audio and images of a user refering to objects in the environment.
Abstract: Object recognition is an important part of human-computer interaction in situated environments, such as a home or an office. Especially useful is category-level recognition (e.g., recognizing the class of chairs, as opposed to a particular chair.) While humans can employ multimodal cues for cate-gorizing objects during situated conversational interactions, most computer algorithms currently rely on vision-only or speech-only recognition. We are developing a method for learning about physical objects found in a situated environment based on visual and spoken input provided by the user. The algorithm operates on generic databases of labeled object images and transcribed speech data, plus unlabeled audio and images of a user refering to objects in the environment. By exploiting the generic labeled databases, the algorithm would associate probable object-referring words with probable visual representations of those objects, and use both modalities to determine the object label. The first advantage of this approach over visual-only or speech-only recognition is the ability to disambiguate object categories using complementary information sources. The second advantage is that, using the additional unlabeled data gathered during the interaction, the system can potentially improve its recognition of new category instances in the physical environment in which it is situated, as well as of new utterances spoken by the same user, compared to a system that uses only the generic labeled databases. It can achieve this by adapting its generic object classifiers and its generic speech and language models

Proceedings Article
01 Jan 2007
TL;DR: Speech recognition systems are now used in a wide variety of domains and have recently been introduced incars for hand-free control of radio, cell-phone and navigation applications.
Abstract: Speech recognition systems are now used in a wide variety of domains. They have recently been introduced incars for hand-free control of radio, cell-phone and navigation applications. However, due ...

Proceedings ArticleDOI
12 Nov 2007
TL;DR: Preliminary experiments have demonstrated that the speaker's visual information during the system's reply is potentially useful and accuracy of automatic detection is close to human performance.
Abstract: Automatic detection of communication errors in conversational systems has been explored extensively in the speech community. However, most previous studies have used only acoustic cues. Visual information has also been used by the speech community to improve speech recognition in dialogue systems, but this visual information is only used when the speaker is communicating vocally. A recent perceptual study indicated that human observers can detect communication problems when they see the visual footage of the speaker during the system's reply. In this paper, we present work in progress towards the development of a communication error detector that exploits this visual cue. In datasets we collected or acquired, facial motion features and head poses were estimated while users were listening to the system response and passed to a classifier for detecting a communication error. Preliminary experiments have demonstrated that the speaker's visual information during the system's reply is potentially useful and accuracy of automatic detection is close to human performance.

Book ChapterDOI
31 Oct 2007
TL;DR: A discriminative approach to contextual prediction and multi-modal integration was able to improve the performance of head gesture detection even when the topic of the test set was significantly different than the training set.
Abstract: Head pose and gesture offer several key conversational grounding cues and are used extensively in face-to-face interaction among people. When recognizing visual feedback, people use more than their visual perception. Knowledge about the current topic and expectations from previous utterances help guide our visual perception in recognizing nonverbal cues. In this chapter, we investigate how dialogue context from an embodied conversational agent (ECA) can improve visual recognition of user gestures. We present a recognition framework which (1) extracts contextual features from an ECA’s dialogue manager, (2) computes a prediction of head nod and head shakes, and (3) integrates the contextual predictions with the visual observation of a vision-based head gesture recognizer. We found a subset of lexical, prosodic, timing and gesture features that are easily available in most ECA architectures and can be used to learn how to predict user feedback. Using a discriminative approach to contextual prediction and multi-modal integration, we were able to improve the performance of head gesture detection even when the topic of the test set was significantly different than the training set.