scispace - formally typeset
Search or ask a question

Showing papers by "Antonio Torralba published in 2012"


13 Jan 2012
TL;DR: A benchmark data set containing 300 natural images with eye tracking data from 39 observers is proposed to compare model performances and it is shown that human performance increases with the number of humans to a limit.
Abstract: Many computational models of visual attention have been created from a wide variety of different approaches to predict where people look in images. Each model is usually introduced by demonstrating performances on new images, and it is hard to make immediate comparisons between models. To alleviate this problem, we propose a benchmark data set containing 300 natural images with eye tracking data from 39 observers to compare model performances. We calculate the performance of 10 models at predicting ground truth fixations using three different metrics. We provide a way for people to submit new models for evaluation online. We find that the Judd et al. and Graph-based visual saliency models perform best. In general, models with blurrier maps and models that include a center bias perform well. We add and optimize a blur and center bias for each model and show improvements. We compare performances to baseline models of chance, center and human performance. We show that human performance increases with the number of humans to a limit. We analyze the similarity of different models using multidimensional scaling and explore the relationship between model performance and fixation consistency. Finally, we offer observations about how to improve saliency models in the future.

564 citations


Book ChapterDOI
07 Oct 2012
TL;DR: Overall, this work finds that it is beneficial to explicitly account for bias when combining multiple datasets, and proposes a discriminative framework that directly exploits dataset bias during training.
Abstract: The presence of bias in existing object recognition datasets is now well-known in the computer vision community. While it remains in question whether creating an unbiased dataset is possible given limited resources, in this work we propose a discriminative framework that directly exploits dataset bias during training. In particular, our model learns two sets of weights: (1) bias vectors associated with each individual dataset, and (2) visual world weights that are common to all datasets, which are learned by undoing the associated bias from each dataset. The visual world weights are expected to be our best possible approximation to the object model trained on an unbiased dataset, and thus tend to have good generalization ability. We demonstrate the effectiveness of our model by applying the learned weights to a novel, unseen dataset, and report superior results for both classification and detection tasks compared to a classical SVM that does not account for the presence of bias. Overall, we find that it is beneficial to explicitly account for bias when combining multiple datasets.

539 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: A database of 360° panoramic images organized into 26 place categories is constructed, and the problem of scene viewpoint recognition is introduced, to classify the type of place shown in a photo, and also recognize the observer's viewpoint within that category of place.
Abstract: We introduce the problem of scene viewpoint recognition, the goal of which is to classify the type of place shown in a photo, and also recognize the observer's viewpoint within that category of place. We construct a database of 360° panoramic images organized into 26 place categories. For each category, our algorithm automatically aligns the panoramas to build a full-view representation of the surrounding place. We also study the symmetry properties and canonical viewpoint of each place category. At test time, given a photo of a scene, the model can recognize the place category, produce a compass-like indication of the observer's most likely viewpoint within that place, and use this information to extrapolate beyond the available view, filling in the probable visual layout that would appear beyond the boundary of the photo.

306 citations


Book ChapterDOI
07 Oct 2012
TL;DR: Experiments shows that MDSH outperforms the state-of-the art, especially in the challenging regime of small distance thresholds, and introduces a "kernel trick" to allow us to compute with an exponentially large number of bits but using only memory and computation that grows linearly with dimension.
Abstract: With the growing availability of very large image databases, there has been a surge of interest in methods based on "semantic hashing", ie compact binary codes of data-points so that the Hamming distance between codewords correlates with similarity In reviewing and comparing existing methods, we show that their relative performance can change drastically depending on the definition of ground-truth neighbors Motivated by this finding, we propose a new formulation for learning binary codes which seeks to reconstruct the affinity between datapoints, rather than their distances We show that this criterion is intractable to solve exactly, but a spectral relaxation gives an algorithm where the bits correspond to thresholded eigenvectors of the affinity matrix, and as the number of datapoints goes to infinity these eigenvectors converge to eigenfunctions of Laplace-Beltrami operators, similar to the recently proposed Spectral Hashing (SH) method Unlike SH whose performance may degrade as the number of bits increases, the optimal code using our formulation is guaranteed to faithfully reproduce the affinities as the number of bits increases We show that the number of eigenfunctions needed may increase exponentially with dimension, but introduce a "kernel trick" to allow us to compute with an exponentially large number of bits but using only memory and computation that grows linearly with dimension Experiments shows that MDSH outperforms the state-of-the art, especially in the challenging regime of small distance thresholds

181 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the context model improves object recognition performance and provides a coherent interpretation of a scene, which enables a reliable image querying system by multiple object categories and can be applied to scene understanding tasks that local detectors alone cannot solve.
Abstract: There has been a growing interest in exploiting contextual information in addition to local features to detect and localize multiple object categories in an image. A context model can rule out some unlikely combinations or locations of objects and guide detectors to produce a semantically coherent interpretation of a scene. However, the performance benefit of context models has been limited because most of the previous methods were tested on data sets with only a few object categories, in which most images contain one or two object categories. In this paper, we introduce a new data set with images that contain many instances of different object categories, and propose an efficient model that captures the contextual information among more than a hundred object categories using a tree structure. Our model incorporates global image features, dependencies between object categories, and outputs of local detectors into one probabilistic framework. We demonstrate that our context model improves object recognition performance and provides a coherent interpretation of a scene, which enables a reliable image querying system by multiple object categories. In addition, our model can be applied to scene understanding tasks that local detectors alone cannot solve, such as detecting objects out of context or querying for the most typical and the least typical scenes in a data set.

166 citations


Journal ArticleDOI
TL;DR: It is shown that physical support relationships between objects can provide useful contextual information for both object recognition and out-of-context detection.

129 citations


Proceedings Article
03 Dec 2012
TL;DR: This work proposes a probabilistic framework that models how and which local regions from an image may be forgotten using a data-driven approach that combines local and global images features.
Abstract: While long term human visual memory can store a remarkable amount of visual information, it tends to degrade over time. Recent works have shown that image memorability is an intrinsic property of an image that can be reliably estimated using state-of-the-art image features and machine learning algorithms. However, the class of features and image information that is forgotten has not been explored yet. In this work, we propose a probabilistic framework that models how and which local regions from an image may be forgotten using a data-driven approach that combines local and global images features. The model automatically discovers memorability maps of individual images without any human annotation. We incorporate multiple image region attributes in our algorithm, leading to improved memorability prediction of images as compared to previous works.

127 citations


Proceedings Article
03 Dec 2012
TL;DR: This work builds a discriminative parts-based detector that models the appearance of the cuboid corners and internal edges while enforcing consistency to a 3D cuboid model and out-performs baseline detectors that use 2D constraints alone on the task of localizing cuboids corners.
Abstract: In this paper we seek to detect rectangular cuboids and localize their corners in uncalibrated single-view images depicting everyday scenes. In contrast to recent approaches that rely on detecting vanishing points of the scene and grouping line segments to form cuboids, we build a discriminative parts-based detector that models the appearance of the cuboid corners and internal edges while enforcing consistency to a 3D cuboid model. Our model copes with different 3D viewpoints and aspect ratios and is able to detect cuboids across many different object categories. We introduce a database of images with cuboid annotations that spans a variety of indoor and outdoor scenes and show qualitative and quantitative results on our collected database. Our model out-performs baseline detectors that use 2D constraints alone on the task of localizing cuboid corners.

126 citations


Proceedings Article
27 Jun 2012
TL;DR: A hierarchical Bayesian model that learns categories from single training examples that transfers acquired knowledge from previously learned categories to a novel category, in the form of a prior over category means and variances is developed.
Abstract: We develop a hierarchical Bayesian model that learns categories from single training examples. The model transfers acquired knowledge from previously learned categories to a novel category, in the form of a prior over category means and variances. The model discovers how to group categories into meaningful super-categories that express different priors for new classes. Given a single example of a novel category, we can efficiently infer which super-category the novel category belongs to, and thereby estimate not only the new category's mean but also an appropriate similarity metric based on parameters inherited from the super-category. On MNIST and MSR Cambridge image datasets the model learns useful representations of novel categories based on just a single training example, and performs significantly better than simpler hierarchical Bayesian approaches. It can also discover new categories in a completely unsupervised fashion, given just one or a few examples.

118 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: Two types of “accidental” images that can be formed in scenes are identified and studied, including an accidental pinhole camera image that can reveal structures outside a room, or the unseen shape of the light aperture into the room.
Abstract: We identify and study two types of “accidental” images that can be formed in scenes. The first is an accidental pinhole camera image. These images are often mistaken for shadows, but can reveal structures outside a room, or the unseen shape of the light aperture into the room. The second class of accidental images are “inverse” pinhole camera images, formed by subtracting an image with a small occluder present from a reference image without the occluder. The reference image can be an earlier frame of a video sequence. Both types of accidental images happen in a variety of different situations (an indoor scene illuminated by natural light, a street with a person walking under the shadow of a building, etc.). Accidental cameras can reveal information about the scene outside the image, the lighting conditions, or the aperture by which light enters the scene.

62 citations


Proceedings ArticleDOI
28 Nov 2012
TL;DR: The notion of image memorability and the elements that make it memorable are discussed and evidence for the phenomenon of visual inception is introduced: can the authors make people believe they have seen an image they have not?
Abstract: When glancing at a magazine, or browsing the Internet, we are continuously being exposed to photographs. However, not all images are equal in memory; some stitch to our minds, while others are forgotten. In this paper we discuss the notion of image memorability and the elements that make it memorable. Our recent works have shown that image memorability is a stable and intrinsic property of images that is shared across different viewers. Given that this is the case, we discuss the possibility of modifying the memorability of images by identifying the memorability of image regions. Further, we introduce and provide evidence for the phenomenon of visual inception: can we make people believe they have seen an image they have not?

Posted Content
TL;DR: An expert image annotator relates her experience on segmenting and labeling tens of thousands of images and the notes she took try to highlight the difficulties encountered, the solutions adopted, and the decisions made in order to get a consistent set of annotations.
Abstract: We are under the illusion that seeing is effortless, but frequently the visual system is lazy and makes us believe that we understand something when in fact we don't. Labeling a picture forces us to become aware of the difficulties underlying scene understanding. Suddenly, the act of seeing is not effortless anymore. We have to make an effort in order to understand parts of the picture that we neglected at first glance. In this report, an expert image annotator relates her experience on segmenting and labeling tens of thousands of images. During this process, the notes she took try to highlight the difficulties encountered, the solutions adopted, and the decisions made in order to get a consistent set of annotations. Those annotations constitute the SUN database.

Journal ArticleDOI
TL;DR: This work used a feature-selection scheme with desirable explaining-away properties to determine a compact set of attributes that characterizes the memorability of any individual image, and finds that images of enclosed spaces containing people with visible faces are memorable, while images of vistas and peaceful scenes are not.

Posted Content
TL;DR: Four algorithms to visualize feature spaces commonly used in object detection are described, with different trade-offs in speed accuracy, and scalability, and their most successful algorithm uses ideas from sparse coding to learn a pair of dictionaries that enable regression between HOG features and natural images, and can invert features at interactive rates.
Abstract: : This paper presents methods to visualize feature spaces commonly used in object detection. The tools in this paper allow a human to put on feature space glasses and see the visual world as a computer might see it. We found that these glasses allow us to gain insight into the behavior of computer vision systems. We show a variety of experiments with our visualizations, such as examining the linear separability of recognition in HOG space, generating high scoring super objects for an object detector, and diagnosing false positives. We pose the visualization problem as one of feature inversion, i.e. recovering the natural image that generated a feature descriptor. We describe four algorithms to tackle this task, with different trade-offs in speed accuracy, and scalability. Our most successful algorithm uses ideas from sparse coding to learn a pair of dictionaries that enable regression between HOG features and natural images, and can invert features at interactive rates. We believe these visualizations are useful tools to add to an object detector researcher's toolbox, and code is available.

Proceedings ArticleDOI
28 Nov 2012
TL;DR: Steps toward a unified 3D parsing of everyday scenes are described, including the ability to localize geometric primitives in images, such as cuboids and cylinders, which often comprise many everyday objects.
Abstract: An early goal of computer vision was to build a system that could automatically understand a 3D scene just by looking. This requires not only the ability to extract 3D information from image information alone, but also to handle the large variety of different environments that comprise our visual world. This paper summarizes our recent efforts toward these goals. First, we describe the SUN database, which is a collection of annotated images spanning 908 different scene categories. This database allows us to systematically study the space of possible everyday scenes and to establish a benchmark for scene and object recognition. We also explore ways of coping with the variety of viewpoints within these scenes. For this, we have introduced a database of 360° panoramic images for many of the scene categories in the SUN database and have explored viewpoint recognition within the environments. Finally, we describe steps toward a unified 3D parsing of everyday scenes: (i) the ability to localize geometric primitives in images, such as cuboids and cylinders, which often comprise many everyday objects, and (ii) an integrated system to extract the 3D structure of the scene and objects depicted in an image.

Dissertation
01 Jan 2012
TL;DR: This dissertation proposes a system that can predict what lies just beyond the boundaries of the image using a large photo collection of images of the same class, but not from the same location in the real world, thus creating a photorealistic virtual space from real world images.
Abstract: Image features are widely used in computer vision applications from stereo matching to panorama stitching to object and scene recognition. They exploit image regularities to capture structure in images both locally, using a patch around an interest point, and globally, over the entire image. Image features need to be distinctive and robust toward variations in scene content, camera viewpoint and illumination conditions. Common tasks are matching local features across images and finding semantically meaningful matches amongst a large set of images. If there is enough structure or regularity in the images, we should be able not only to find good matches but also to predict parts of the objects or the scene that were not directly captured by the camera. One of the difficulties in evaluating the performance of image features in both the prediction and matching tasks is the availability of ground truth data. In this dissertation, we take two different approaches. First, we propose using a photorealistic virtual world for evaluating local feature descriptors and learning new feature detectors. Acquiring ground truth data and, in particular pixel to pixel correspondences between images, in complex 3D scenes under different viewpoint and illumination conditions in a controlled way is nearly impossible in a real world setting. Instead, we use a high-resolution 3D model of a city to gain complete and repeatable control of the environment. We calibrate our virtual world evaluations by comparing against feature rankings made from photographic data of the same subject matter (the Statue of Liberty). We then use our virtual world to study the effects on descriptor performance of controlled changes in viewpoint and illumination. We further employ machine learning techniques to train a model that would recognize visually rich interest points and optimize the performance of a given descriptor. In the latter part of the thesis, we take advantage of the large amounts of image data available on the Internet to explore the regularities in outdoor scenes and, more specifically, the matching and prediction tasks in street level images. Generally, people are very adept at predicting what they might encounter as they navigate through the world. They use all of their prior experience to make such predictions even when placed in unfamiliar environment. We propose a system that can predict what lies just beyond the boundaries of the image using a large photo collection of images of the same class, but not from the same location in the real world. We evaluate the performance of the system using different global or quantized densely extracted local features. We demonstrate how to build seamless transitions between the query and prediction images, thus creating a photorealistic virtual space from real world images.

01 Jan 2012
TL;DR: This thesis emulates the prior knowledge base of humans by creating a large and heterogeneous database and annotation tool for videos depicting real world scenes by adopting a data-driven framework powered by scene matching techniques to retrieve the videos nearest to the query clip and integrate the motion information in the nearest videos.
Abstract: As humans, we can say many things about the scenes surrounding us. For instance, we can tell what type of scene and location an image depicts, describe what objects live in it, their material properties, or their spatial arrangement. These comprise descriptions of a scene and are majorly studied areas in computer vision. This thesis, however, hypotheses that observers have an inherent prior knowledge that can be applied to the scene at hand. This prior knowledge can be translated into the cognisance of which objects move, or in the trajectories and velocities to expect. Conversely, when faced with unusual events such as car accidents, humans are very well tuned to identify them regardless of having observed the scene a priori. This is, in part, due to prior observations that we have for scenes with similar configurations to the current one. This thesis emulates the prior knowledge base of humans by creating a large and heterogeneous database and annotation tool for videos depicting real world scenes. The first application of this thesis is in the area of unusual event detection. Given a short clip, the task is to identify the moving portions of the scene that depict abnormal events. We adopt a data-driven framework powered by scene matching techniques to retrieve the videos nearest to the query clip and integrate the motion information in the nearest videos. The result is a final clip with localized annotations for unusual activity. The second application lies in the area of event prediction. Given a static image, we adapt our framework to compile a prediction of motions to expect in the image. This result is crafted by integrating the knowledge of videos depicting scenes similar to the query image. With the help of scene matching, only scenes relevant to the queries are considered, resulting in reliable predictions. Our dataset, experimentation, and proposed model introduce and explore a new facet of scene understanding in images and videos. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

Book ChapterDOI
01 Apr 2012
TL;DR: This work proposes a latent variable ranking model that induces both the latent classes and the weights of the linear combination for each class from ranking triplets and has a clear computational advantages since it does not need to be retrained for each test query.
Abstract: Since their introduction, ranking SVM models [11] have become a powerful tool for training content-based retrieval systems. All we need for training a model are retrieval examples in the form of triplet constraints, i.e. examples specifying that relative to some query, a database item a should be ranked higher than database item b. These types of constraints could be obtained from feedback of users of the retrieval system. Most previous ranking models learn either a global combination of elementary similarity functions or a combination defined with respect to a single database item. Instead, we propose a "coarse to fine" ranking model where given a query we first compute a distribution over "coarse" classes and then use the linear combination that has been optimized for queries of that class. These coarse classes are hidden and need to be induced by the training algorithm. We propose a latent variable ranking model that induces both the latent classes and the weights of the linear combination for each class from ranking triplets. Our experiments over two large image datasets and a text retrieval dataset show the advantages of our model over learning a global combination as well as a combination for each test point (i.e. transductive setting). Furthermore, compared to the transductive approach our model has a clear computational advantages since it does not need to be retrained for each test query.