scispace - formally typeset
Search or ask a question

Showing papers by "Trevor Darrell published in 2012"


Proceedings ArticleDOI
16 Jun 2012
TL;DR: This paper shows that learning more adaptive receptive fields increases performance even with a significantly smaller codebook size at the coding layer, and adopts the idea of over-completeness to learn the optimal pooling parameters.
Abstract: In this paper we examine the effect of receptive field designs on classification accuracy in the commonly adopted pipeline of image classification. While existing algorithms usually use manually defined spatial regions for pooling, we show that learning more adaptive receptive fields increases performance even with a significantly smaller codebook size at the coding layer. To learn the optimal pooling parameters, we adopt the idea of over-completeness by starting with a large number of receptive field candidates, and train a classifier with structured sparsity to only use a sparse subset of all the features. An efficient algorithm based on incremental feature selection and retraining is proposed for fast learning. With this method, we achieve the best published performance on the CIFAR-10 dataset, using a much lower dimensional feature space than previous methods.

284 citations


Journal ArticleDOI
TL;DR: An algorithm is presented which, given a 2D cloth polygon and a desired sequence of folds, outputs a motion plan for executing the corresponding manipulations, deemed g-folds, on a minimal number of robot grippers.
Abstract: We consider the problem of autonomous robotic laundry folding, and propose a solution to the perception and manipulation challenges inherent to the task. At the core of our approach is a quasi-static cloth model which allows us to neglect the complex dynamics of cloth under significant parts of the state space, allowing us to reason instead in terms of simple geometry. We present an algorithm which, given a 2D cloth polygon and a desired sequence of folds, outputs a motion plan for executing the corresponding manipulations, deemed g-folds, on a minimal number of robot grippers. We define parametrized fold sequences for four clothing categories: towels, pants, short-sleeved shirts, and long-sleeved shirts, each represented as polygons. We then devise a model-based optimization approach for visually inferring the class and pose of a spread-out or folded clothing article from a single image, such that the resulting polygon provides a parse suitable for these folding primitives. We test the manipulation and perception tasks individually, and combine them to implement an autonomous folding system on the Willow Garage PR2. This enables the PR2 to identify a clothing article spread out on a table, execute the computed folding sequence, and visually track its progress over successive folds.

272 citations


Book ChapterDOI
07 Oct 2012
TL;DR: This paper presents both a novel domain transform mixture model which outperforms a single transform model when multiple domains are present, and a novel constrained clustering method that successfully discovers latent domains.
Abstract: Recent domain adaptation methods successfully learn cross-domain transforms to map points between source and target domains. Yet, these methods are either restricted to a single training domain, or assume that the separation into source domains is known a priori. However, most available training data contains multiple unknown domains. In this paper, we present both a novel domain transform mixture model which outperforms a single transform model when multiple domains are present, and a novel constrained clustering method that successfully discovers latent domains. Our discovery method is based on a novel hierarchical clustering technique that uses available object category information to constrain the set of feasible domain separations. To illustrate the effectiveness of our approach we present experiments on two commonly available image datasets with and without known domain labels: in both cases our method outperforms baseline techniques which use no domain adaptation or domain adaptation methods that presume a single underlying domain shift.

212 citations


Book ChapterDOI
07 Oct 2012
TL;DR: An intermediate representation for deformable part models is developed and it is shown that this representation has favorable performance characteristics for multi-class problems when the number of classes is high and is well suited to a parallel implementation.
Abstract: We develop an intermediate representation for deformable part models and show that this representation has favorable performance characteristics for multi-class problems when the number of classes is high. Our model uses sparse coding of part filters to represent each filter as a sparse linear combination of shared dictionary elements. This leads to a universal set of parts that are shared among all object classes. Reconstruction of the original part filter responses via sparse matrix-vector product reduces computation relative to conventional part filter convolutions. Our model is well suited to a parallel implementation, and we report a new GPU DPM implementation that takes advantage of sparse coding of part filters. The speed-up offered by our intermediate representation and parallel computation enable real-time DPM detection of 20 different object classes on a laptop computer.

97 citations


Proceedings Article
03 Dec 2012
TL;DR: This method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time and is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning.
Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning.

64 citations


Proceedings Article
03 Dec 2012
TL;DR: A deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element, which scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs.
Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous -often more complicated- methods on several vision and speech benchmarks.

62 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: This work uses a non-parametric Bayesian regression technique - local Gaussian process regression - to learn for each pixel's narrow-gamut color a probability distribution over the scene colors that could have created it, and shows that these distributions are effective in simple probabilistic adaptations of two popular applications: multi-exposure imaging and photometric stereo.
Abstract: Consumer digital cameras use tone-mapping to produce compact, narrow-gamut images that are nonetheless visually pleasing. In doing so, they discard or distort substantial radiometric signal that could otherwise be used for computer vision. Existing methods attempt to undo these effects through deterministic maps that de-render the reported narrow-gamut colors back to their original wide-gamut sensor measurements. Deterministic approaches are unreliable, however, because the reverse narrow-to-wide mapping is one-to-many and has inherent uncertainty. Our solution is to use probabilistic maps, providing uncertainty estimates useful to many applications. We use a non-parametric Bayesian regression technique — local Gaussian process regression — to learn for each pixel's narrow-gamut color a probability distribution over the scene colors that could have created it. Using a variety of consumer cameras we show that these distributions, once learned from training data, are effective in simple probabilistic adaptations of two popular applications: multi-exposure imaging and photometric stereo. Our results on these applications are better than those of corresponding deterministic approaches, especially for saturated and out-of-gamut colors.

41 citations


Proceedings ArticleDOI
29 Oct 2012
TL;DR: In this paper, the authors proposed an image representation called Detection Bank, which is based on the detection images from a large number of windowed object detectors where an image is represented by different statistics derived from these detections and extended to video by aggregating the key frame level image representations through mean and max pooling.
Abstract: While low-level image features have proven to be effective representations for visual recognition tasks such as object recognition and scene classification, they are inadequate to capture complex semantic meaning required to solve high-level visual tasks such as multimedia event detection and recognition Recognition or retrieval of events and activities can be improved if specific discriminative objects are detected in a video sequence In this paper, we propose an image representation, called Detection Bank, based on the detection images from a large number of windowed object detectors where an image is represented by different statistics derived from these detections This representation is extended to video by aggregating the key frame level image representations through mean and max pooling We empirically show that it captures complementary information to state-of-the-art representations such as Spatial Pyramid Matching and Object Bank These descriptors combined with our Detection Bank representation significantly outperforms any of the representations alone on TRECVID MED 2011 data

35 citations


Posted Content
TL;DR: In this paper, the authors combine the two approaches by presenting a novel HDP-based topic model that automatically learns both shared and private topics, which is shown to be especially useful for querying the contents of one domain given samples of the other.
Abstract: Multi-modal data collections, such as corpora of paired images and text snippets, require analysis methods beyond single-view component and topic models. For continuous observations the current dominant approach is based on extensions of canonical correlation analysis, factorizing the variation into components shared by the different modalities and those private to each of them. For count data, multiple variants of topic models attempting to tie the modalities together have been presented. All of these, however, lack the ability to learn components private to one modality, and consequently will try to force dependencies even between minimally correlating modalities. In this work we combine the two approaches by presenting a novel HDP-based topic model that automatically learns both shared and private topics. The model is shown to be especially useful for querying the contents of one domain given samples of the other.

28 citations


01 Jan 2012
TL;DR: In this article, the authors describe the evaluation results for TRECVID 2012 Multimedia Event Detection (MED) and multimedia Event Recounting (MER) tasks as a part of SRI-Sarnoff AURORA system that is developed under the IARPA ALDDIN program.
Abstract: In this paper, we describe the evaluation results for TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks as a part of SRI-Sarnoff AURORA system that is developed under the IARPA ALDDIN program. In AURORA system, we incorporated various low-level features that capture color, appearance, motion, and audio information in videos. Based on these low-level features, we developed Fixed-Pattern and Object-Orientated spatial feature pooling, which result in significant performance improvement to our system. In addition, we collected more than 1800 concepts and designed a set of concept pooling approaches to build the Concept Based Event Representation (CBER, i.e., high-level features). We submitted six runs exploring various fusions of low-level features, high-level features, and ASR/OCR features for MED task. All runs achieve satisfactory results. In particular, two EK10Ex runs for both pre-specified events (PS-Events) and ad-hoc events (AH-Events) obtain relatively better results. In MER task, we developed an approach to provide a breakdown of the evidences of why the MED decision has been made by exploring the SVM-based event detector. Furthermore, we designed evidence specific verification and detection to reduce uncertainty and improve key evidence discovery. Our MER evaluation results for MER-to-Event are very good.

27 citations


Posted Content
TL;DR: In this article, a conditional entropy criterion is used to detect view disagreement in a multi-view learning approach and samples with view disagreement are filtered and standard multiview learning methods can be successfully applied to the remaining samples.
Abstract: Traditional multi-view learning approaches suffer in the presence of view disagreement,i.e., when samples in each view do not belong to the same class due to view corruption, occlusion or other noise processes. In this paper we present a multi-view learning approach that uses a conditional entropy criterion to detect view disagreement. Once detected, samples with view disagreement are filtered and standard multi-view learning methods can be successfully applied to the remaining samples. Experimental evaluation on synthetic and audio-visual databases demonstrates that the detection and filtering of view disagreement considerably increases the performance of traditional multi-view learning approaches.

Journal ArticleDOI
TL;DR: This novel image-based reunification system reduced the number of images reviewed before parents identified their children and could be further developed to assist future family reunifications in a disaster.
Abstract: Objectives Reuniting children with their families after a disaster poses unique challenges. The objective was to pilot test the ability of a novel image-based tool to assist a parent in identifying a picture of his or her children. Methods A previously developed image-based indexing and retrieval tool that employs two advanced vision search algorithms was used. One algorithm, Feature-Attribute-Matching, extracts facial features (skin color, eye color, and age) of a photograph and then matches according to parental input. The other algorithm, User-Feedback, allows parents to choose children on the screen that appear similar to theirs and then reprioritizes the images in the database. This was piloted in a convenience sample of parentchild pairs in a pediatric tertiary care hospital. A photograph of each participating child was added to a preexisting image database. A double-blind randomized crossover trial was performed to measure the percentage of database reviewed and time using the Feature-Attribute-Matching-plus-User-Feedback strategy or User-Feedback strategy only. Search results were compared to a theoretical random search. Afterward, parents completed a survey evaluating satisfaction. Results Fifty-one parentchild pairs completed the study. The Feature-Attribute-Matching-plus-User-Feedback strategy was superior to the User-Feedback strategy in decreasing the percentage of database reviewed (mean +/- SD = 24.1 +/- 20.1% vs. 35.6 +/- 27.2%; mean difference = -11.5%; 95% confidence interval [CI] = -21.5% to -1.4%; p = 0.03). Both were superior to the random search (p < 0.001). Time for both searches was similar despite fewer images reviewed in the Feature-Attribute-Matching-plus-User-Feedback strategy. Sixty-eight percent of parents were satisfied with the search and 87% felt that this tool would be very or extremely helpful in a disaster. Conclusions This novel image-based reunification system reduced the number of images reviewed before parents identified their children. This technology could be further developed to assist future family reunifications in a disaster.

ReportDOI
14 Nov 2012
TL;DR: It is argued that mid-level representations can bridge the gap between existing low-level models, which are incapable of capturing the structure of interactive verbs, and contemporary high-level schemes, which rely on the output of potentially brittle intermediate detectors and trackers.
Abstract: : We argue that mid-level representations can bridge the gap between existing low-level models, which are incapable of capturing the structure of interactive verbs, and contemporary high-level schemes, which rely on the output of potentially brittle intermediate detectors and trackers. We develop a novel descriptor based on generic object foreground segments our representation forms a histogram-of-gradient representation that is grounded to the frame of detected key-segments. Importantly, our method does not require objects to be identi ed reliably in order to compute a ro- bust representation. We evaluate an integrated system including novel key-segment activity descriptors on a large-scale video dataset containing 48 common verbs, for which we present a comprehensive evaluation protocol. Our results con rm that a descriptor de ned on mid-level primitives operating at a higher-level than local spatio-temporal features, but at a lower-level than trajectories of detected objects, can provide a substantial improvement relative to either alone or to their combination.

Proceedings Article
14 Aug 2012
TL;DR: In this paper, the authors combine the two approaches by presenting a novel HDP-based topic model that automatically learns both shared and private topics, which is shown to be especially useful for querying the contents of one domain given samples of the other.
Abstract: Multi-modal data collections, such as corpora of paired images and text snippets, require analysis methods beyond single-view component and topic models. For continuous observations the current dominant approach is based on extensions of canonical correlation analysis, factorizing the variation into components shared by the different modalities and those private to each of them. For count data, multiple variants of topic models attempting to tie the modalities together have been presented. All of these, however, lack the ability to learn components private to one modality, and consequently will try to force dependencies even between minimally correlating modalities. In this work we combine the two approaches by presenting a novel HDP-based topic model that automatically learns both shared and private topics. The model is shown to be especially useful for querying the contents of one domain given samples of the other.

01 Mar 2012
TL;DR: This paper presents results from the experimental evaluation for the TRECVID 2011 MED11 (Multimedia Event Detection) task as a part of Team SRI-Sarnoff's AURORA system being developed under the IARPA ALADDIN Program.
Abstract: : In this paper, we present results from the experimental evaluation for the TRECVID 2011 MED11 (Multimedia Event Detection) task as a part of Team SRI-Sarnoffs AURORA system being developed under the IARPA ALADDIN Program. Our approach employs two classes of content descriptions for describing videos depicting diverse events: (1) Low level features and their aggregates, and (2) Semantic concepts that capture scenes, objects and atomic actions that are local in space-time. In this presentation we summarize our system design and the content descriptions used. We also present four MED11 experiments that we submitted, discuss the results and lessons learned.

01 Jan 2012
TL;DR: This work presents a system for learning nouns directly from images, using probabilistic predictions generated by visual classifiers as the input to Bayesian word learning, and compares this system to human performance in an automated, large-scale experiment.
Abstract: Learning the meaning of a novel noun from a few labeled objects is one of the simplest aspects of learning a language, but approximating human performance on this task is still a significant challenge for current machine learning systems. Current methods typically fail to find the appropriate level of generalization in a concept hierarchy for a given visual stimulus. Recent work in cognitive science on Bayesian models of word learning partially addresses this challenge, but it assumes that the labels of objects are given (hence no object recognition) and it has only been evaluated in small domains. We present a system for learning nouns directly from images, using probabilistic predictions generated by visual classifiers as the input to Bayesian word learning, and compare this system to human performance in an automated, large-scale experiment. The system captures a significant proportion of the variance in human responses. Combining the uncertain outputs of the visual classifiers with the ability to identify an appropriate level of abstraction that comes from Bayesian word learning allows the system to outperform alternatives that either cannot deal with visual stimuli or use a more conventional computer vision approach.