scispace - formally typeset
Search or ask a question

Showing papers by "Trevor Darrell published in 2009"


Proceedings Article
07 Dec 2009
TL;DR: An algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary embeddings is developed.
Abstract: Fast retrieval methods are increasingly critical for many large-scale analysis tasks, and there have been several recent methods that attempt to learn hash functions for fast and accurate nearest neighbor searches. In this paper, we develop an algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary embeddings. We develop a scalable coordinate-descent algorithm for our proposed hashing objective that is able to efficiently learn hash functions in a variety of settings. Unlike existing methods such as semantic hashing and spectral hashing, our method is easily kernelized and does not require restrictive assumptions about the underlying distribution of the data. We present results over several domains to demonstrate that our method outperforms existing state-of-the-art techniques.

914 citations


Proceedings ArticleDOI
14 Jun 2009
TL;DR: A simple and effective projected gradient method for optimization of l1, regularized problems and results show that it is effective in discovering jointly sparse solutions.
Abstract: In recent years the l1, ∞ norm has been proposed for joint regularization. In essence, this type of regularization aims at extending the l1 framework for learning sparse models to a setting where the goal is to learn a set of jointly sparse models. In this paper we derive a simple and effective projected gradient method for optimization of l1, ∞ regularized problems. The main challenge in developing such a method resides on being able to compute efficient projections to the l1, ∞ ball. We present an algorithm that works in O(n log n) time and O(n) memory where n is the number of parameters. We test our algorithm in a multi-task image annotation problem. Our results show that l1, ∞ leads to better performance than both l2 and l1 regularization and that it is is effective in discovering jointly sparse solutions.

136 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: This paper proposes an efficient method for concurrent object localization and recognition based on a data-dependent multi-class branch-and-bound formalism, and develops two extensions to consider non-rectangular bounding regions that demonstrate their ability to achieve higher recognition scores compared to traditional rectangular bounding boxes.
Abstract: Object localization and recognition are important problems in computer vision. However, in many applications, exhaustive search over all object models and image locations is computationally prohibitive. While several methods have been proposed to make either recognition or localization more efficient, few have dealt with both tasks simultaneously. This paper proposes an efficient method for concurrent object localization and recognition based on a data-dependent multi-class branch-and-bound formalism. Existing bag-of-features recognition techniques which can be expressed as weighted combinations of feature counts can be readily adapted to our method. We present experimental results that demonstrate the merit of our algorithm in terms of recognition accuracy, localization accuracy, and speed, compared to baseline approaches including exhaustive search, implicit-shape model (ISM), and efficient sub-window search (ESS). Moreover, we develop two extensions to consider non-rectangular bounding regions-composite boxes and polygons-and demonstrate their ability to achieve higher recognition scores compared to traditional rectangular bounding boxes.

76 citations


Proceedings Article
07 Dec 2009
TL;DR: This work implements a novel LDA-SIFT formulation which performs LDA prior to any vector quantization step, and discovers latent topics which are characteristic of particular transparent patches and quantize the SIFT space into transparent visual words according to the latent topic dimensions.
Abstract: Existing methods for visual recognition based on quantized local features can perform poorly when local features exist on transparent surfaces, such as glass or plastic objects. There are characteristic patterns to the local appearance of transparent objects, but they may not be well captured by distances to individual examples or by a local pattern codebook obtained by vector quantization. The appearance of a transparent patch is determined in part by the refraction of a background pattern through a transparent medium: the energy from the background usually dominates the patch appearance. We model transparent local patch appearance using an additive model of latent factors: background factors due to scene content, and factors which capture a local edge energy distribution characteristic of the refraction. We implement our method using a novel LDA-SIFT formulation which performs LDA prior to any vector quantization step; we discover latent topics which are characteristic of particular transparent patches and quantize the SIFT space into transparent visual words according to the latent topic dimensions. No knowledge of the background scene is required at test time; we show examples recognizing transparent glasses in a domestic environment.

69 citations



Proceedings ArticleDOI
20 Jun 2009
TL;DR: This paper proposes a probabilistic heteroscedastic approach to co-training that simultaneously discovers the amount of noise on a per-sample basis, while solving the classification task, which results in high performance in the presence of occlusion or other complex observation noise processes.
Abstract: Many perception problems involve datasets that are naturally comprised of multiple streams or modalities for which supervised training data is only sparsely available. In cases where there is a degree of conditional independence between such views, a class of semi-supervised learning techniques that are based on maximizing view agreement over unlabeled data has been proven successful in a wide range of machine learning domains. However, these `co-training' or `multi-view' learning methods have had relatively limited application in vision, due in part to the assumption of constant per-channel noise models. In this paper we propose a probabilistic heteroscedastic approach to co-training that simultaneously discovers the amount of noise on a per-sample basis, while solving the classification task. This results in high performance in the presence of occlusion or other complex observation noise processes. We demonstrate our approach in two domains, multi-view object recognition from low-fidelity sensor networks and audio-visual classification.

42 citations


01 Jan 2009
TL;DR: This work focuses on multiple kernel learning approaches to multi-view learning, which have recently become very popular since they can easily combine information from multiple views, e.g., by adding or multiplying kernels.
Abstract: Multiple kernel learning approaches to multi-view learning [1, 11, 7] have recently become very popular since they can easily combine information from multiple views, e.g., by adding or multiplying kernels. They are particularly effective when the views are class conditionally independent, since the errors committed by each view can be corrected by the other views. Most methods assume that a single set of kernel weights is sufficient for accurate classification, however, one can expect that the set of features important to discriminate between different examples can vary locally. As a result the performance of such global techniques can degrade in the presence of complex noise processes, e.g., heteroscedastic noise, missing data, or when the discriminative properties vary across the input space.

40 citations


Proceedings ArticleDOI
20 Oct 2009
TL;DR: The classical problem of object recognition in low-power, low-bandwidth distributed camera networks is studied and it is shown that between a network of cameras, high-dimensional SIFT histograms share a joint sparse pattern corresponding to a set of common features in 3-D.
Abstract: In this paper, we study the classical problem of object recognition in low-power, low-bandwidth distributed camera networks. The ability to perform robust object recognition is crucial for applications such as visual surveillance to track and identify objects of interest, and compensate visual nuisances such as occlusion and pose variation between multiple camera views. We propose an effective framework to perform distributed object recognition using a network of smart cameras and a computer as the base station. Due to the limited bandwidth between the cameras and the computer, the method utilizes the available computational power on the smart sensors to locally extract and compress SIFT-type image features to represent individual camera views. In particular, we show that between a network of cameras, high-dimensional SIFT histograms share a joint sparse pattern corresponding to a set of common features in 3-D. Such joint sparse patterns can be explicitly exploited to accurately encode the distributed signal via random projection, which is unsupervised and independent to the sensor modality. On the base station, we study multiple decoding schemes to simultaneously recover the multiple-view object features based on the distributed compressive sensing theory. The system has been implemented on the Berkeley CITRIC smart camera platform. The efficacy of the algorithm is validated through extensive simulation and experiments.

39 citations


Journal ArticleDOI
TL;DR: It is shown that articulatory feature-based models outperform baseline models, and several aspects of the models are studied, such as the effects of allowing articulatory asynchrony, of using dictionary-based versus whole-word models and of incorporating classifier outputs via virtual evidence versus alternative observation models.
Abstract: We study the problem of automatic visual speech recognition (VSR) using dynamic Bayesian network (DBN)-based models consisting of multiple sequences of hidden states, each corresponding to an articulatory feature (AF) such as lip opening (LO) or lip rounding (LR). A bank of discriminative articulatory feature classifiers provides input to the DBN, in the form of either virtual evidence (VE) (scaled likelihoods) or raw classifier margin outputs. We present experiments on two tasks, a medium-vocabulary word-ranking task and a small-vocabulary phrase recognition task. We show that articulatory feature-based models outperform baseline models, and we study several aspects of the models, such as the effects of allowing articulatory asynchrony, of using dictionary-based versus whole-word models, and of incorporating classifier outputs via virtual evidence versus alternative observation models.

33 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: A prior over the dimensionality of the latent space that penalizes high dimensional spaces is introduced, and this method can outperform graph-based dimensionality reduction techniques as well as previously suggested initialization strategies.
Abstract: Discovering the underlying low-dimensional latent structure in high-dimensional perceptual observations (e.g., images, video) can, in many cases, greatly improve performance in recognition and tracking. However, non-linear dimensionality reduction methods are often susceptible to local minima and perform poorly when initialized far from the global optimum, even when the intrinsic dimensionality is known a priori. In this work we introduce a prior over the dimensionality of the latent space that penalizes high dimensional spaces, and simultaneously optimize both the latent space and its intrinsic dimensionality in a continuous fashion. Ad-hoc initialization schemes are unnecessary with our approach; we initialize the latent space to the observation space and automatically infer the latent dimensionality. We report results applying our prior to various probabilistic non-linear dimensionality reduction tasks, and show that our method can outperform graph-based dimensionality reduction techniques as well as previously suggested initialization strategies. We demonstrate the effectiveness of our approach when tracking and classifying human motion.

30 citations


Proceedings ArticleDOI
30 Mar 2009
TL;DR: This work explores the problem of resolving the second person English pronoun you in multi-party dialogue, using a combination of linguistic and visual features, and shows that a multimodal system is often preferable to a unimodal one.
Abstract: We explore the problem of resolving the second person English pronoun you in multi-party dialogue, using a combination of linguistic and visual features. First, we distinguish generic and referential uses, then we classify the referential uses as either plural or singular, and finally, for the latter cases, we identify the addressee. In our first set of experiments, the linguistic and visual features are derived from manual transcriptions and annotations, but in the second set, they are generated through entirely automatic means. Results show that a multimodal system is often preferable to a unimodal one.

Proceedings Article
07 Dec 2009
TL;DR: An unsupervised method that, given a word, automatically selects non-abstract senses of that word from an online ontology and generates images depicting the corresponding entities and takes as input only the name of an object category.
Abstract: We propose an unsupervised method that, given a word, automatically selects non-abstract senses of that word from an online ontology and generates images depicting the corresponding entities. When faced with the task of learning a visual model based only on the name of an object, a common approach is to find images on the web that are associated with the object name and train a visual classifier from the search result. As words are generally polysemous, this approach can lead to relatively noisy models if many examples due to outlier senses are added to the model. We argue that images associated with an abstract word sense should be excluded when training a visual classifier to learn a model of a physical object. While image clustering can group together visually coherent sets of returned images, it can be difficult to distinguish whether an image cluster relates to a desired object or to an abstract sense of the word. We propose a method that uses both image features and the text associated with the images to relate latent topics to particular senses. Our model does not require any human supervision, and takes as input only the name of an object category. We show results of retrieving concrete-sense images in two available multimodal, multi-sense databases, as well as experiment with object classifiers trained on concrete-sense images returned by our method for a set of ten common office objects.

Dissertation
01 Jan 2009
TL;DR: A joint sparsity transfer algorithm for image classification based on the observation that related categories might be learnable using only a small subset of shared relevant features and an optimization algorithm whose time and memory complexity is O( n log n) with n being the number of parameters of the joint model.
Abstract: An ideal image classifier should be able to exploit complex high dimensional feature representations even when only a few labeled examples are available for training. To achieve this goal we develop transfer learning algorithms that: (1) Leverage unlabeled data annotated with meta-data; and (2) Exploit labeled data from related categories. In the first part of this thesis we show how to use the structure learning framework (Ando and Zhang, 2005) to learn efficient image representations from unlabeled images annotated with meta-data. In the second part we present a joint sparsity transfer algorithm for image classification. Our algorithm is based on the observation that related categories might be learnable using only a small subset of shared relevant features. To find these features we propose to train classifiers jointly with a shared regularization penalty that minimizes the total number of features involved in the approximation. To solve the joint sparse approximation problem we develop an optimization algorithm whose time and memory complexity is O( n log n) with n being the number of parameters of the joint model. We conduct experiments on news-topic and keyword prediction image classification tasks. We test our method in two settings: a transfer learning and multitask learning setting and show that in both cases leveraging knowledge from related categories can improve performance when training data per category is scarce. Furthermore, our results demonstrate that our model can successfully recover jointly sparse solutions. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)


01 Jan 2009
TL;DR: This paper describes hierarchical discriminative probabilistic techniques for learning visual object category models, which recovers a nested set of object categories with chosen kernel combinations for discrimination at each level of the tree using a Gaussian Process based framework.
Abstract: Recognition of general visual categories requires a diverse set of feature types, but not all are equally relevant to individual categories; efficient recognition arises by learning the potentially sparse features for each class and understanding the relationship between features common to related classes. This paper describes hierarchical discriminative probabilistic techniques for learning visual object category models. Our method recovers a nested set of object categories with chosen kernel combinations for discrimination at each level of the tree. We use a Gaussian Process based framework, with a parameterized sparsity penalty to favor compact classification hierarchies. We exploit structural properties of Gaussian Processes in a multi-class setting to gain computational efficiency and employ evidence maximization to optimally infer kernel weights from training data. Experiments on benchmark datasets show that our hierarchical probabilistic kernel combination scheme offers a benefit in both computational efficiency and performance: we report a significant improvement in accuracy compared to the current best whole-image kernel combination schemes on Caltech 101, as well as a two order-ofmagnitude improvement in efficiency.