scispace - formally typeset
Search or ask a question

Showing papers by "Antonio Torralba published in 2006"


Journal ArticleDOI
TL;DR: An original approach of attentional guidance by global scene context is presented that combines bottom-up saliency, scene context, and top-down mechanisms at an early stage of visual processing and predicts the image regions likely to be fixated by human observers performing natural search tasks in real-world scenes.
Abstract: Many experiments have shown that the human visual system makes extensive use of contextual information for facilitating object search in natural scenes. However, the question of how to formally model contextual influences is still open. On the basis of a Bayesian framework, the authors present an original approach of attentional guidance by global scene context. The model comprises 2 parallel pathways; one pathway computes local features (saliency) and the other computes global (scene-centered) features. The contextual guidance model of attention combines bottom-up saliency, scene context, and top-down mechanisms at an early stage of visual processing and predicts the image regions likely to be fixated by human observers performing natural search tasks in real-world scenes.

1,613 citations


Book ChapterDOI
TL;DR: It is shown that the structure of a scene image can be estimated by the mean of global image features, providing a statistical summary of the spatial layout properties (Spatial Envelope representation) of the scene.
Abstract: Humans can recognize the gist of a novel image in a single glance, independent of its complexity. How is this remarkable feat accomplished? On the basis of behavioral and computational evidence, this paper describes a formal approach to the representation and the mechanism of scene gist understanding, based on scene-centered, rather than object-centered primitives. We show that the structure of a scene image can be estimated by the mean of global image features, providing a statistical summary of the spatial layout properties (Spatial Envelope representation) of the scene. Global features are based on configurations of spatial scales and are estimated without invoking segmentation or grouping operations. The scene-centered approach is not an alternative to local image analysis but would serve as a feed-forward and parallel pathway of visual processing, able to quickly constrain local feature analysis and enhance object recognition in cluttered natural scenes.

1,468 citations


Book ChapterDOI
TL;DR: Current datasets are lacking in several respects, and this paper discusses some of the lessons learned from existing efforts, as well as innovative ways to obtain very large and diverse annotated datasets.
Abstract: Appropriate datasets are required at all stages of object recognition research, including learning visual models of object and scene categories, detecting and localizing instances of these models in images, and evaluating the performance of recognition algorithms Current datasets are lacking in several respects, and this paper discusses some of the lessons learned from existing efforts, as well as innovative ways to obtain very large and diverse annotated datasets It also suggests a few criteria for gathering future datasets

250 citations


Book ChapterDOI
TL;DR: This work shows that by combining local and global features of the image, they get significantly improved detection rates and since the gist is much cheaper to compute than most local detectors, they can potentially gain a large increase in speed.
Abstract: Traditional approaches to object detection only look at local pieces of the image, whether it be within a sliding window or the regions around an interest point detector. However, such local pieces can be ambiguous, especially when the object of interest is small, or imaging conditions are otherwise unfavorable. This ambiguity can be reduced by using global features of the image — which we call the “gist” of the scene — as an additional source of evidence. We show that by combining local and global features, we get significantly improved detection rates. In addition, since the gist is much cheaper to compute than most local detectors, we can potentially gain a large increase in speed as well.

171 citations


Journal ArticleDOI
01 Jul 2006
TL;DR: It is shown that by taking into account perceptual grouping mechanisms it is possible to build compelling hybrid images with stable percepts at each distance to create compelling displays in which the image appears to change as the viewing distance changes.
Abstract: We present hybrid images, a technique that produces static images with two interpretations, which change as a function of viewing distance. Hybrid images are based on the multiscale processing of images by the human visual system and are motivated by masking studies in visual perception. These images can be used to create compelling displays in which the image appears to change as the viewing distance changes. We show that by taking into account perceptual grouping mechanisms it is possible to build compelling hybrid images with stable percepts at each distance. We show examples in which hybrid images are used to create textures that become visible only when seen up-close, to generate facial expressions whose interpretation changes with viewing distance, and to visualize changes over time within a single picture.

132 citations


02 Sep 2006
TL;DR: In this article, a random lens is defined as one for which the function relating the input light ray to the output sensor location is pseudo-random, and two machine learning methods are compared for both camera calibration and image reconstruction.
Abstract: We call a random lens one for which the function relating the input light ray to the output sensor location is pseudo-random. Imaging systems with random lenses can expand the space of possible camera designs, allowing new trade-offs in optical design and potentially adding new imaging capabilities. Machine learning methods are critical for both camera calibration and image reconstruction from the sensor data. We develop the theory and compare two different methods for calibration and reconstruction: an MAP approach, and basis pursuit from compressive sensing [5]. We show proof-of-concept experimental results from a random lens made from a multi-faceted mirror, showing successful calibration and image reconstruction. We illustrate the potential for super-resolution and 3D imaging.

108 citations


Proceedings ArticleDOI
17 Jun 2006
TL;DR: An integrated, probabilistic model for the appearance and three-dimensional geometry of cluttered scenes and a robust likelihood model accounts for outliers in matched stereo features, allowing effective learning of 3D object structure from partial 2D segmentations.
Abstract: We develop an integrated, probabilistic model for the appearance and three-dimensional geometry of cluttered scenes. Object categories are modeled via distributions over the 3D location and appearance of visual features. Uncertainty in the number of object instances depicted in a particular image is then achieved via a transformed Dirichlet process. In contrast with image-based approaches to object recognition, we model scale variations as the perspective projection of objects in different 3D poses. To calibrate the underlying geometry, we incorporate binocular stereo images into the training process. A robust likelihood model accounts for outliers in matched stereo features, allowing effective learning of 3D object structure from partial 2D segmentations. Applied to a dataset of office scenes, our model detects objects at multiple scales via a coarse reconstruction of the corresponding 3D geometry.

80 citations


Book ChapterDOI
TL;DR: This work presents a learning procedure, based on boosted decision stumps, that reduces the computational and sample complexity, by finding common features that can be shared across the classes (and/or views).
Abstract: We consider the problem of detecting a large number of different classes of objects in cluttered scenes. We present a learning procedure, based on boosted decision stumps, that reduces the computational and sample complexity, by finding common features that can be shared across the classes (and/or views). Shared features, emerge in a model of object recognition trained to detect many object classes efficiently and robustly, and are preferred over class-specific features. Although that class-specific features achieve a more compact representation for a single category, the whole set of shared features is able to provide more efficient and robust representations when the system is trained to detect many object classes than the set of class-specific features. Classifiers based on shared features need less training data, since many classes share similar features (e.g., computer screens and posters can both be distinguished from the background by looking for the feature “edges in a rectangular arrangement”).

24 citations