scispace - formally typeset
Search or ask a question

Showing papers by "Trevor Darrell published in 2010"


Book ChapterDOI
05 Sep 2010
TL;DR: This paper introduces a method that adapts object models acquired in a particular visual domain to new imaging conditions by learning a transformation that minimizes the effect of domain-induced changes in the feature distribution.
Abstract: Domain adaptation is an important emerging topic in computer vision. In this paper, we present one of the first studies of domain shift in the context of object recognition. We introduce a method that adapts object models acquired in a particular visual domain to new imaging conditions by learning a transformation that minimizes the effect of domain-induced changes in the feature distribution. The transformation is learned in a supervised manner and can be applied to categories for which there are no labeled examples in the new domain. While we focus our evaluation on object recognition tasks, the transform-based adaptation technique we develop is general and could be applied to nonimage data. Another contribution is a new multi-domain object database, freely available for download. We experimentally demonstrate the ability of our method to improve recognition on categories with few or no target domain labels and moderate to large changes in the imaging conditions.

2,624 citations


Proceedings Article
06 Dec 2010
TL;DR: This paper shows that structured sparsity allows us to address the multi-view learning problem by alternately solving two convex optimization problems and shows that the resulting factorized latent spaces generalize over existing approaches in that they allow having latent dimensions shared between any subset of the views instead of between all the views only.
Abstract: Recent approaches to multi-view learning have shown that factorizing the information into parts that are shared across all views and parts that are private to each view could effectively account for the dependencies and independencies between the different input modalities. Unfortunately, these approaches involve minimizing non-convex objective functions. In this paper, we propose an approach to learning such factorized representations inspired by sparse coding techniques. In particular, we show that structured sparsity allows us to address the multi-view learning problem by alternately solving two convex optimization problems. Furthermore, the resulting factorized latent spaces generalize over existing approaches in that they allow having latent dimensions shared between any subset of the views instead of between all the views only. We show that our approach outperforms state-of-the-art methods on the task of human pose estimation.

207 citations


Journal ArticleDOI
TL;DR: This work shows that with an appropriate combination of kernels a significant boost in classification performance is possible, and indicates the utility of active learning with probabilistic predictive models, especially when the amount of training data labels that may be sought for a category is ultimately very small.
Abstract: Discriminative methods for visual object category recognition are typically non-probabilistic, predicting class labels but not directly providing an estimate of uncertainty. Gaussian Processes (GPs) provide a framework for deriving regression techniques with explicit uncertainty models; we show here how Gaussian Processes with covariance functions defined based on a Pyramid Match Kernel (PMK) can be used for probabilistic object category recognition. Our probabilistic formulation provides a principled way to learn hyperparameters, which we utilize to learn an optimal combination of multiple covariance functions. It also offers confidence estimates at test points, and naturally allows for an active learning paradigm in which points are optimally selected for interactive labeling. We show that with an appropriate combination of kernels a significant boost in classification performance is possible. Further, our experiments indicate the utility of active learning with probabilistic predictive models, especially when the amount of training data labels that may be sought for a category is ultimately very small.

202 citations


Proceedings Article
31 Mar 2010
TL;DR: This paper proposes a robust approach to factorizing the latent space into shared and private spaces by introducing orthogonality constraints, which penalize redundant latent representations.
Abstract: Existing approaches to multi-view learning are particularly effective when the views are either independent (i.e, multi-kernel approaches) or fully dependent (i.e., shared latent spaces). However, in real scenarios, these assumptions are almost never truly satisfied. Recently, two methods have attempted to tackle this problem by factorizing the information and learn separate latent spaces for modeling the shared (i.e., correlated) and private (i.e., independent) parts of the data. However, these approaches are very sensitive to parameters setting or initialization. In this paper we propose a robust approach to factorizing the latent space into shared and private spaces by introducing orthogonality constraints, which penalize redundant latent representations. Furthermore, unlike previous approaches, we simultaneously learn the structure and dimensionality of the latent spaces by relying on a regularizer that encourages the latent space of each data stream to be low dimensional. To demonstrate the benefits of our approach, we apply it to two existing shared latent space models that assume full dependence of the views, the sGPLVM and the sKIE, and show that our constraints improve the performance of these models on the task of pose estimation from monocular images.

117 citations


Journal ArticleDOI
24 May 2010
TL;DR: In this paper, the authors argue that social network context may be the key for large-scale face recognition to succeed, and they leverage the resources and structure of such social networks to improve face recognition rates on the images shared.
Abstract: Personal photographs are being captured in digital form at an accelerating rate, and our computational tools for searching, browsing, and sharing these photos are struggling to keep pace. One promising approach is automatic face recognition, which would allow photos to be organized by the identities of the individuals they contain. However, achieving accurate recognition at the scale of the Web requires discriminating among hundreds of millions of individuals and would seem to be a daunting task. This paper argues that social network context may be the key for large-scale face recognition to succeed. Many personal photographs are shared on the Web through online social network sites, and we can leverage the resources and structure of such social networks to improve face recognition rates on the images shared. Drawing upon real photo collections from volunteers who are members of a popular online social network, we asses the availability of resources to improve face recognition and discuss techniques for applying these resources.

87 citations


Proceedings ArticleDOI
25 Oct 2010
TL;DR: The idea of growing multimodal location estimation as a research field in the multimedia community is described and a multimedia approach to leverage cues from the visual and the acoustic portions of a video as well as from given metadata is proposed.
Abstract: In this article we define a multimedia content analysis problem, which we call multimodal location estimation: Given a video/image/audio file, the task is to determine where it was recorded. A single indication, such as a unique landmark, might already pinpoint a location precisely. In most cases, however, a combination of evidence from the visual and the acoustic domain will only narrow down the set of possible answers. Therefore, approaches to tackle this task should be inherently multimedia. While the task is hard, in fact sometimes unsolvable, training data can be leveraged from the Internet in large amounts. Moreover, even partially successful automatic estimation of location opens up new possibilities in video content matching, archiving, and organization. It could revolutionize law enforcement and computer-aided intelligence agency work, especially since both semi-automatic and fully automatic approaches would be possible. In this article, we describe our idea of growing multimodal location estimation as a research field in the multimedia community. Based on examples and scenarios, we propose a multimedia approach to leverage cues from the visual and the acoustic portions of a video as well as from given metadata. We also describe experiments to estimate the amount of available training data that could potentially be used as publicly available infrastructure for research in this field. Finally, we present an initial set of results based on acoustic and visual cues and discuss the massive challenges involved and some possible paths to solutions.

42 citations


Book ChapterDOI
05 Sep 2010
TL;DR: This paper investigates the problem of exploiting multiple sources of information for object recognition tasks when additional modalities that are not present in the labeled training set are available for inference and makes use of the unlabeled data to learn a mapping from the existing modalities to the new ones.
Abstract: In this paper we investigate the problem of exploiting multiple sources of information for object recognition tasks when additional modalities that are not present in the labeled training set are available for inference. This scenario is common to many robotics sensing applications and is in contrast with the assumption made by existing approaches that require at least some labeled examples for each modality. To leverage the previously unseen features, we make use of the unlabeled data to learn a mapping from the existing modalities to the new ones. This allows us to predict the missing data for the labeled examples and exploit all modalities using multiple kernel learning. We demonstrate the effectiveness of our approach on several multi-modal tasks including object recognition from multi-resolution imagery, grayscale and color images, as well as images and text. Our approach outperforms multiple kernel learning on the original modalities, as well as nearest-neighbor and bootstrapping schemes.

34 citations


Proceedings Article
06 Dec 2010
TL;DR: An efficient metric branch-and-bound algorithm is developed for the search task, imposing 3-D size constraints as part of an optimal search for a set of features which indicate the presence of a category.
Abstract: Metric constraints are known to be highly discriminative for many objects, but if training is limited to data captured from a particular 3-D sensor the quantity of training data may be severly limited. In this paper, we show how a crucial aspect of 3-D information-object and feature absolute size-can be added to models learned from commonly available online imagery, without use of any 3-D sensing or reconstruction at training time. Such models can be utilized at test time together with explicit 3-D sensing to perform robust search. Our model uses a "2.1D" local feature, which combines traditional appearance gradient statistics with an estimate of average absolute depth within the local window. We show how category size information can be obtained from online images by exploiting relatively unbiquitous metadata fields specifying camera intrinstics. We develop an efficient metric branch-and-bound algorithm for our search task, imposing 3-D size constraints as part of an optimal search for a set of features which indicate the presence of a category. Experiments on test scenes captured with a traditional stereo rig are shown, exploiting training data from from purely monocular sources with associated EXIF metadata.

17 citations


01 Jan 2010
TL;DR: This paper argues that social network context may be the key for large-scale face recognition to succeed and asses the availability of resources to improve face recognition and discusses techniques for applying these resources.
Abstract: Personal photographs are being captured in digital form at an accelerating rate, and our computational tools for searching, browsing, and sharing these photos are struggling to keep pace. One promising approach is automatic face recognition, which would allow photos to be organized by the identities of the individuals they contain. However, achieving accurate recognition at the scale of the Web requires discriminating among hundreds of millions of individuals and would seem to be a daunting task. This paper argues that social network context may be the key for large-scale face recogni- tion to succeed. Many personal photographs are shared on the Web through online social network sites, and we can leverage the resources and structure of such social networks to improve face recognition rates on the images shared. Drawing upon real photo collections from volunteers who are members of a popular online social network, we asses the availability of resources to improve face recognition and discuss techniques for applying these resources.

16 citations


01 Jan 2010
TL;DR: This work learns a representation which minimizes the effect of shifting between source and target domains using a novel metric learning approach and demonstrates the ability of the adaptation method to improve performance of classifiers on new domains that have very little labeled data.
Abstract: We propose a method to perform adaptive transfer of visual category knowledge from labeled datasets acquired in one image domain to other environments. We learn a representation which minimizes the effect of shifting between source and target domains using a novel metric learning approach. The key idea of our approach to domain adaptation is to learn a metric that compensates for the transformation of the object representation that occurred due to the domain shift. In addition to being one of the first studies of domain adaptation for object recognition, this work develops a general adaptation technique that could be applied to non-image data. Another contribution is a new image database for studying the effects of visual domain shift on object recognition. We demonstrate the ability of our adaptation method to improve performance of classifiers on new domains that have very little labeled data.

5 citations


01 Jan 2010
TL;DR: This work introduces a method that adapts object models acquired in a particular visual domain to new imaging conditions by learning a transformation which minimizes the effect of domain-induced changes in the feature distribution, and proves that the resulting model may be kernelized to learn non-linear transformations under a variety of regularizers.
Abstract: We introduce a method that adapts object models acquired in a particular visual domain to new imaging conditions by learning a transformation which minimizes the effect of domain-induced changes in the feature distribution. The transformation is learned in a supervised manner, and can be applied to categories unseen at training time. We prove that the resulting model may be kernelized to learn non-linear transformations under a variety of regularizers. In addition to being one of the first studies of domain adaptation for object recognition, this work develops a general theoretical framework for adaptation that could be applied to non-image data. We present a new image database for studying the effects of visual domain shift on object recognition, and demonstrate the ability of our method to improve recognition on categories with few or no target domain labels, moderate to large changes in the imaging conditions, and even changes in the feature representation.

01 Jan 2010
TL;DR: This paper proposes a method to learn shared and private latent spaces that are inherently disjoint by intro ducing orthogonality constraints, and shows significant performance improvement over the original models, as well as over the existing shared-private factorizations in the context of pose estimation.
Abstract: Many machine learning problems inherently involve multiple views. Kernel combination approaches to multiview learning [1] are particularly effective when the views are independent. In contrast, other methods take advantage of the dependencies in the data. The best-known example is Canonical Correlation Analysis (CCA), which learns latent representations of the views whose correlation is maxi mal. Unfortunately, this can result in trivial solutions in the presence of highly correlated noise. Recently, non-linear shared latent variable models that do not suffer from this problem have been proposed: the shared Gaussian process latent variable model (sGPLVM) [4], and the shared kernel information embedding (sKIE) [5]. However, in real scenarios, information in the views is typically neither fully independent nor fully correlated. The few approaches that have tried to factorize the information into shared and private components [2, 3] are typically initialized with CCA, and thus suffer from its inherent weaknesses. In this paper, we propose a method to learn shared and private latent spaces that are inherently disjoint by intro ducing orthogonality constraints. Furthermore, we discov er the structure and dimensionality of the latent representat ion of each data stream by encouraging it to be low dimensional, while still allowing to generate the data. Combined together, these constraints encourage finding factorized l atent spaces that are non-redundant, and that can capture the shared-private separation of the data. We demonstrate the effectiveness of our approach by applying it to two existing models, the sGPLVM [4] and the sKIE [5], and show significant performance improvement over the original models, as well as over the existing shared-private factorizations [2 , 3] in the context of pose estimation.

01 Jan 2010
TL;DR: In this article, a Content-Based Image Retrieval search using attributes and user feedback is proposed to speed up the reunification of children with their families should they get separated in a disaster.
Abstract: During a disaster, children may be quickly wrenched from their families. Research shows that children in such circumstances are often unable or unwilling to give their names or other identifying information. Currently in the US, there is no existing system in the public health infrastructure that exploits image-based analysis to effectively expedite reunification when children can't be identified. Working with the Children's Hospital Boston, we have engineered a system to speed reunification of children with their families, should they get separated in a disaster. Our system is based on a Content-Based Image Retrieval search using attributes and user feedback. In this thesis we will describe the system and a series of evaluations, including a realistic disaster drill set up and run jointly with the Children's Hospital.