scispace - formally typeset
Search or ask a question

Showing papers by "Antonio Torralba published in 2011"


Proceedings ArticleDOI
20 Jun 2011
TL;DR: A comparison study using a set of popular datasets, evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value is presented.
Abstract: Datasets are an integral part of contemporary object recognition research. They have been the chief reason for the considerable progress in the field, not just as source of large amounts of training data, but also as means of measuring and comparing performance of competing algorithms. At the same time, datasets have often been blamed for narrowing the focus of object recognition research, reducing it to a single benchmark performance number. Indeed, some datasets, that started out as data capture efforts aimed at representing the visual world, have become closed worlds unto themselves (e.g. the Corel world, the Caltech-101 world, the PASCAL VOC world). With the focus on beating the latest benchmark numbers on the latest dataset, have we perhaps lost sight of the original purpose? The goal of this paper is to take stock of the current state of recognition datasets. We present a comparison study using a set of popular datasets, evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value. The experimental results, some rather surprising, suggest directions that can improve dataset collection as well as algorithm evaluation protocols. But more broadly, the hope is to stimulate discussion in the community regarding this very important, but largely neglected issue.

2,428 citations


Journal ArticleDOI
TL;DR: SIFT flow is proposed, a method to align an image to its nearest neighbors in a large image corpus containing a variety of scenes, where image information is transferred from the nearest neighbors to a query image according to the dense scene correspondence.
Abstract: While image alignment has been studied in different areas of computer vision for decades, aligning images depicting different scenes remains a challenging problem. Analogous to optical flow, where an image is aligned to its temporally adjacent frame, we propose SIFT flow, a method to align an image to its nearest neighbors in a large image corpus containing a variety of scenes. The SIFT flow algorithm consists of matching densely sampled, pixelwise SIFT features between two images while preserving spatial discontinuities. The SIFT features allow robust matching across different scene/object appearances, whereas the discontinuity-preserving spatial model allows matching of objects located at different parts of the scene. Experiments show that the proposed approach robustly aligns complex scene pairs containing significant spatial differences. Based on SIFT flow, we propose an alignment-based large database framework for image analysis and synthesis, where image information is transferred from the nearest neighbors to a query image according to the dense scene correspondence. This framework is demonstrated through concrete applications such as motion field prediction from a single image, motion synthesis via object transfer, satellite image registration, and face recognition.

1,726 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: A new large-scale video dataset designed to assess the performance of diverseVisual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoor areas with wide coverage is introduced.
Abstract: We introduce a new large-scale video dataset designed to assess the performance of diverse visual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoor areas with wide coverage. Previous datasets for action recognition are unrealistic for real-world surveillance because they consist of short clips showing one action by one individual [15, 8]. Datasets have been developed for movies [11] and sports [12], but, these actions and scene conditions do not apply effectively to surveillance videos. Our dataset consists of many outdoor scenes with actions occurring naturally by non-actors in continuously captured videos of the real world. The dataset includes large numbers of instances for 23 event types distributed throughout 29 hours of video. This data is accompanied by detailed annotations which include both moving object tracks and event examples, which will provide solid basis for large-scale evaluation. Additionally, we propose different types of evaluation modes for visual recognition tasks and evaluation metrics along with our preliminary experimental results. We believe that this dataset will stimulate diverse aspects of computer vision research and help us to advance the CVER tasks in the years ahead.

664 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel, nonparametric approach for object recognition and scene parsing using a new technology the authors name label transfer, which is easy to implement, has few parameters, and embeds contextual information naturally in the retrieval/alignment procedure.
Abstract: While there has been a lot of recent work on object recognition and image understanding, the focus has been on carefully establishing mathematical models for images, scenes, and objects. In this paper, we propose a novel, nonparametric approach for object recognition and scene parsing using a new technology we name label transfer. For an input image, our system first retrieves its nearest neighbors from a large database containing fully annotated images. Then, the system establishes dense correspondences between the input image and each of the nearest neighbors using the dense SIFT flow algorithm [28], which aligns two images based on local image structures. Finally, based on the dense scene correspondences obtained from SIFT flow, our system warps the existing annotations and integrates multiple cues in a Markov random field framework to segment and recognize the query image. Promising experimental results have been achieved by our nonparametric scene parsing system on challenging databases. Compared to existing object recognition approaches that require training classifiers or appearance models for each object category, our system is easy to implement, has few parameters, and embeds contextual information naturally in the retrieval/alignment procedure.

431 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: A hierarchical classification model that allows rare objects to borrow statistical strength from related objects that have many training examples and learns both a hierarchy for sharing visual appearance across 200 object categories and hierarchical parameters is presented.
Abstract: We present a hierarchical classification model that allows rare objects to borrow statistical strength from related objects that have many training examples. Unlike many of the existing object detection and recognition systems that treat different classes as unrelated entities, our model learns both a hierarchy for sharing visual appearance across 200 object categories and hierarchical parameters. Our experimental results on the challenging object localization and detection task demonstrate that the proposed model substantially improves the accuracy of the standard single object detectors that ignore hierarchical structure altogether.

385 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: It is shown that memorability is a stable property of an image that is shared across different viewers, and a database for which each picture will be remembered after a single view is introduced.
Abstract: When glancing at a magazine, or browsing the Internet, we are continuously being exposed to photographs. Despite of this overflow of visual information, humans are extremely good at remembering thousands of pictures along with some of their visual details. But not all images are equal in memory. Some stitch to our minds, and other are forgotten. In this paper we focus on the problem of predicting how memorable an image will be. We show that memorability is a stable property of an image that is shared across different viewers. We introduce a database for which we have measured the probability that each picture will be remembered after a single view. We analyze image features and labels that contribute to making an image memorable, and we train a predictor based on global image descriptors. We find that predicting image memorability is a task that can be addressed with current computer vision techniques. Whereas making memorable images is a challenging task in visualization and photography, this work is a first attempt to quantify this useful quality of images.

358 citations


ReportDOI
12 Dec 2011
TL;DR: In this article, the authors used the publicly available memorability dataset of Isola et al. and augmented object and scene annotations with interpretable spatial, content, and aesthetic image properties to determine a compact set of attributes that characterizes the memorability of any individual image.
Abstract: Artists, advertisers, and photographers are routinely presented with the task of creating an image that a viewer will remember. While it may seem like image memorability is purely subjective, recent work shows that it is not an inexplicable phenomenon: variation in memorability of images is consistent across subjects, suggesting that some images are intrinsically more memorable than others, independent of a subjects' contexts and biases. In this paper, we used the publicly available memorability dataset of Isola et al. [13], and augmented the object and scene annotations with interpretable spatial, content, and aesthetic image properties. We used a feature-selection scheme with desirable explaining-away properties to determine a compact set of attributes that characterizes the memorability of any individual image. We find that images of enclosed spaces containing people with visible faces are memorable, while images of vistas and peaceful scenes are not. Contrary to popular belief, unusual or aesthetically pleasing scenes do not tend to be highly memorable. This work represents one of the first attempts at understanding intrinsic image memorability, and opens a new domain of investigation at the interface between human cognition and computer vision.

167 citations


Journal ArticleDOI
TL;DR: This article collected about 500 trials per cell per observer for both target-present and target-absent displays in each of three classic search tasks: feature search, with the target defined by color, conjunction search, and spatial configuration search for a 2 among distractor 5s.
Abstract: Many visual search experiments measure response time (RT) as their primary dependent variable. Analyses typically focus on mean (or median) RT. However, given enough data, the RT distribution can be a rich source of information. For this paper, we collected about 500 trials per cell per observer for both target-present and target-absent displays in each of three classic search tasks: feature search, with the target defined by color; conjunction search, with the target defined by both color and orientation; and spatial configuration search for a 2 among distractor 5s. This large data set allows us to characterize the RT distributions in detail. We present the raw RT distributions and fit several psychologically motivated functions (ex-Gaussian, ex-Wald, Gamma, and Weibull) to the data. We analyze and interpret parameter trends from these four functions within the context of theories of visual search.

156 citations


Proceedings Article
12 Dec 2011
TL;DR: This work proposes a novel way of augmenting the training data for each class by borrowing and transforming examples from other classes, and demonstrates that the new object detector improves upon the current state-of-the-art detector on the challenging SUN09 object detection dataset.
Abstract: Despite the recent trend of increasingly large datasets for object detection, there still exist many classes with few training examples. To overcome this lack of training data for certain classes, we propose a novel way of augmenting the training data for each class by borrowing and transforming examples from other classes. Our model learns which training instances from other classes to borrow and how to transform the borrowed examples so that they become more similar to instances from the target class. Our experimental results demonstrate that our new object detector, with borrowed and transformed examples, improves upon the current state-of-the-art detector on the challenging SUN09 object detection dataset.

143 citations


Journal ArticleDOI
TL;DR: In this article, the authors focus on the problem of predicting how memorable an image will be, and find that memorability is a stable property of an image that is shared across different viewers, and introduce a database for which they have measured the probability that each picture will be remembered after a single view.
Abstract: When glancing at a magazine, or browsing the Internet, we are continuously being exposed to photographs. Despite of this overflow of visual information, humans are extremely good at remembering thousands of pictures along with some of their visual details. But not all images are equal in memory. Some stitch to our minds, and other are forgotten. In this paper we focus on the problem of predicting how memorable an image will be. We show that memorability is a stable property of an image that is shared across different viewers. We introduce a database for which we have measured the probability that each picture will be remembered after a single view. We analyze image features and labels that contribute to making an image memorable, and we train a predictor based on global image descriptors. We find that predicting image memorability is a task that can be addressed with current computer vision techniques. Whereas making memorable images is a challenging task in visualization and photography, this work is a first attempt to quantify this useful quality of images.

130 citations


Proceedings ArticleDOI
30 Aug 2011
TL;DR: A concept for automatic construction site monitoring by taking into account 4D information (3D over time), that is acquired from highly-overlapping digital aerial images, which largely supports automated methods toward full scene understanding.
Abstract: Summary form only given. We present a concept for automatic construction site monitoring by taking into account 4D information (3D over time), that is acquired from highly-overlapping digital aerial images. On the one hand today's maturity of flying micro aerial vehicles (MAVs) enables a low-cost and an efficient image acquisition of high-quality data that maps construction sites entirely from many varying viewpoints. On the other hand, due to low-noise sensors and high redundancy in the image data, recent developments in 3D reconstruction workflows have benefited the automatic computation of accurate and dense 3D scene information. Having both an inexpensive high-quality image acquisition and an efficient 3D analysis workflow enables monitoring, documentation and visualization of observed sites over time with short intervals. Relating acquired 4D site observations, composed of color, texture, geometry over time, largely supports automated methods toward full scene understanding, the acquisition of both the change and the construction site's progress.

02 Jul 2011
TL;DR: In this article, a hierarchical Bayesian model is proposed to transfer knowledge from previously learned categories to a novel category, in the form of a prior over category means and variances, which can discover how to group categories into meaningful super-categories that express different priors for new classes.
Abstract: We develop a hierarchical Bayesian model that learns categories from single training examples. The model transfers acquired knowledge from previously learned categories to a novel category, in the form of a prior over category means and variances. The model discovers how to group categories into meaningful super-categories that express different priors for new classes. Given a single example of a novel category, we can efficiently infer which super-category the novel category belongs to, and thereby estimate not only the new category's mean but also an appropriate similarity metric based on parameters inherited from the super-category. On MNIST and MSR Cambridge image datasets the model learns useful representations of novel categories based on just a single training example, and performs significantly better than simpler hierarchical Bayesian approaches. It can also discover new categories in a completely unsupervised fashion, given just one or a few examples.

Proceedings ArticleDOI
06 Nov 2011
TL;DR: This work proposes using a photorealistic virtual world to gain complete and repeatable control of the environment in order to evaluate image features, and calibrates the virtual world evaluations by comparing against feature rankings made from photographic data of the same subject matter.
Abstract: Image features are widely used in computer vision applications. They need to be robust to scene changes and image transformations. Designing and comparing feature descriptors requires the ability to evaluate their performance with respect to those transformations. We want to know how robust the descriptors are to changes in the lighting, scene, or viewing conditions. For this, we need ground truth data of different scenes viewed under different camera or lighting conditions in a controlled way. Such data is very difficult to gather in a real-world setting. We propose using a photorealistic virtual world to gain complete and repeatable control of the environment in order to evaluate image features. We calibrate our virtual world evaluations by comparing against feature rankings made from photographic data of the same subject matter (the Statue of Liberty). We find very similar feature rankings between the two datasets. We then use our virtual world to study the effects on descriptor performance of controlled changes in viewpoint and illumination. We also study the effect of augmenting the descriptors with depth information to improve performance.

Journal Article
TL;DR: The goal of the current study is to determine the prototypical exemplars that best represent each visual scene category; and to evaluate the performances of state-of-the-art global features algorithms at classifying different types of exemplars.

Proceedings Article
12 Dec 2011
TL;DR: Efficient learning and inference algorithms for the HDP-DBM model are presented and it is shown that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets.
Abstract: We introduce HD (or "Hierarchical-Deep") models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian models. Specifically we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training examples, by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets.

Dissertation
01 Jan 2011
TL;DR: A new way to model where people look from ground truth eye tracking data is described using techniques of machine learning that outperforms all existing models, and a benchmark data set is provided to quantitatively compare existing and future models.
Abstract: For many applications in graphics, design, and human computer interaction, it is essential to understand where humans look in a scene. This is a challenging task given that no one fully understands how the human visual system works. This thesis explores the way people look at different types of images and provides methods of predicting where they look in new scenes. We describe a new way to model where people look from ground truth eye tracking data using techniques of machine learning that outperforms all existing models, and provide a benchmark data set to quantitatively compare existing and future models. In addition we explore how image resolution affects where people look. Our experiments, models, and large eye tracking data sets should help future researchers better understand and predict where people look in order to create more powerful computational vision systems. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

Proceedings ArticleDOI
29 Dec 2011
TL;DR: This paper proposes a general framework to simultaneously perform object detection and segmentation on objects of different nature based on a boosting procedure which automatically decides - according to the object properties - whether is better to give more weight to the detection or segmentation process to improve both results.
Abstract: Numerous approaches to object detection and segmentation have been proposed so far. However, these methods are prone to fail in some general situations due to the proper object nature. For instance, classical approaches of object detection and segmentation obtain good results for some specific object classes (i.e. detection of pedestrians or segmentation of cars). However, these methods have troubles when detecting or segmenting object classes with different distinctive characteristics (i.e. cars and horses versus sky and road). In this paper, we propose a general framework to simultaneously perform object detection and segmentation on objects of different nature. Our approach is based on a boosting procedure which automatically decides - according to the object properties - whether is better to give more weight to the detection or segmentation process to improve both results. We validate our approach using different object classes from La-belMe, TUD and Weizmann databases, obtaining competitive detection and segmentation results.

Journal Article
TL;DR: This paper presents the raw RT distributions and fit several psychologically motivated functions (ex-Gaussian, ex-Wald, Gamma, and Weibull) to the data, and analyzes and interpret parameter trends from these four functions within the context of theories of visual search.
Abstract: Many visual search experiments measure response time (RT) as their primary dependent variable. Analyses typically focus on mean (or median) RT. However, given enough data, the RT distribution can be a rich source of information. For this paper, we collected about 500 trials per cell per observer for both target-present and target-absent displays in each of three classic search tasks: feature search, with the target defined by color; conjunction search, with the target defined by both color and orientation; and spatial configuration search for a 2 among distractor 5s. This large data set allows us to characterize the RT distributions in detail. We present the raw RT distributions and fit several psychologically motivated functions (ex-Gaussian, ex-Wald, Gamma, and Weibull) to the data. We analyze and interpret parameter trends from these four functions within the context of theories of visual search.

Journal ArticleDOI
TL;DR: This study identifies the ‘medium-blur’ condition, images approximately 32 pixels on a side, to be the limit for accurate 3-D shape perception and finds that degradation affects the perceived slant of point-estimates making images look flatter as degradation increases.
Abstract: How little do we need to perceive 3-D shape in monocular natural images? The shape-from-texture and shape-from-shading perspectives would motivate that 3-D perception vanishes once low-level cues are disrupted. Is this the case in human vision? Or can top – down influences salvage the percept? In this study we probe this question by employing a gauge-figure paradigm similar to that used by Koenderink et al (1992, Perception & Psychophysics52 487 – 496). Subjects were presented degraded natural images and instructed to make local assessments of slant and tilt at various locations thereby quantifying their internal 3-D percept. Analysis of subjects' responses reveals recognition to be a significant influence thereby allowing subjects to perceive 3-D shape at high levels of degradation. Specifically, we identify the ‘medium-blur’ condition, images approximately 32 pixels on a side, to be the limit for accurate 3-D shape perception. In addition, we find that degradation affects the perceived slant of point-es...

01 Jan 2011
TL;DR: This thesis exploits the advantages of tree- structured graphical models and considers modifications to overcome their limitations, and presents methods for learning such models given data at the finest scale by formulating a convex optimization problem.
Abstract: Probabilistic models commonly assume that variables are independent of each other conditioned on a subset of other variables. Graphical models provide a powerful framework for encoding such conditional independence structure of a large collection of random variables. A special class of graphical models with significant theoretical and practical importance is the class of tree-structured graphical models. Tree models have several advantages: they can be easily learned given data, their structures are often intuitive, and inference in tree models is highly efficient. However, tree models make strong conditional independence assumptions, which limit their modeling power substantially. This thesis exploits the advantages of tree- structured graphical models and considers modifications to overcome their limitations. To improve the modeling accuracy of tree models, we consider latent trees in which variables at some nodes represent the original (observed) variables of interest while others represent the latent variables added during the learning procedure. The appeal of such models is clear: the additional latent variables significantly increase the modeling power, and inference on trees is scalable with or without latent variables. We propose two computationally efficient and statistically consistent algorithms for learning latent trees, and compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree models. We exploit the advantages of tree models in the application of modeling contextual information of an image. Object co-occurrences and spatial relationships can be important cues in recognizing and localizing object instances. We develop tree-based context models and demonstrate that its simplicity enables us to integrate many sources of contextual information efficiently. In addition to object recognition, we are interested in using context models to detect objects that are out of their normal context. This task requires precise and careful modeling of object relationships, so we use a latent tree for object co-occurrences. Many of the latent variables can be interpreted as scene categories, capturing higher-order dependencies among object categories. Tree-structured graphical models have been widely used in multi-resolution (MR) modeling. In the last part of the thesis, we move beyond trees, and propose a new modeling framework that allows additional dependency structure at each scale of an MR tree model. We mainly focus on MR models with jointly Gaussian variables, and assume that variables at each scale have sparse covariance structure (as opposed to fully-uncorrelated structure in MR trees) conditioned on variables at other scales. We develop efficient inference algorithms that are partly based on inference on the embedded MR tree and partly based on local filtering at each scale. In addition, we present methods for learning such models given data at the finest scale by formulating a convex optimization problem. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

01 Aug 2011
TL;DR: It is demonstrated that this new object-level image prior not only scales well to include arbitrary high-order object relationships, but also seamlessly integrates multiple-sources of image information such as scene categorization, scene parsing and object detection.
Abstract: Although context is a key component to the success of building an object recognition system, it is difficult to scale and integrate existing formulations of contextual rules to take into account multiplesources of information. In this paper, we propose a generic, object-level image prior to represent rich, complicated contextual relationships. A maximum entropy distribution is learned to model the possible layouts of objects and scenes by placing constraints on the prior distribution. We demonstrate that this new object-level image prior not only scales well to include arbitrary high-order object relationships, but also seamlessly integrates multiple-sources of image information such as scene categorization, scene parsing and object detection. The result is a more comprehensive understanding of the image.