scispace - formally typeset
Search or ask a question

Showing papers by "Antonio Torralba published in 2014"


Proceedings Article
08 Dec 2014
TL;DR: A new scene-centric database called Places with over 7 million labeled pictures of scenes is introduced with new methods to compare the density and diversity of image datasets and it is shown that Places is as dense as other scene datasets and has more diversity.
Abstract: Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called Places with over 7 million labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that Places is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art results on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show differences in the internal representations of object-centric and scene-centric networks.

2,960 citations


Posted Content
TL;DR: In this paper, the authors show that object detectors emerge from training CNNs to perform scene classification, and demonstrate that the same network can perform both scene recognition and object localization in a single forward pass without ever having been explicitly taught the notion of objects.
Abstract: With the success of new computational architectures for visual processing, such as convolutional neural networks (CNN) and access to image databases with millions of labeled examples (e.g., ImageNet, Places), the state of the art in computer vision is advancing rapidly. One important factor for continued progress is to understand the representations that are learned by the inner layers of these deep architectures. Here we show that object detectors emerge from training CNNs to perform scene classification. As scenes are composed of objects, the CNN for scene classification automatically discovers meaningful objects detectors, representative of the learned scene categories. With object detectors emerging as a result of learning to recognize scenes, our work demonstrates that the same network can perform both scene recognition and object localization in a single forward-pass, without ever having been explicitly taught the notion of objects.

874 citations


Journal ArticleDOI
TL;DR: It is shown that memorability is an intrinsic and stable property of an image that is shared across different viewers, and remains stable across delays, and is a first attempt to quantify this useful property of images.
Abstract: When glancing at a magazine, or browsing the Internet, we are continuously exposed to photographs. Despite this overflow of visual information, humans are extremely good at remembering thousands of pictures along with some of their visual details. But not all images are equal in memory. Some stick in our minds while others are quickly forgotten. In this paper, we focus on the problem of predicting how memorable an image will be. We show that memorability is an intrinsic and stable property of an image that is shared across different viewers, and remains stable across delays. We introduce a database for which we have measured the probability that each picture will be recognized after a single view. We analyze a collection of image features, labels, and attributes that contribute to making an image memorable, and we train a predictor based on global image descriptors. We find that predicting image memorability is a task that can be addressed with current computer vision techniques. While making memorable images is a challenging task in visualization, photography, and education, this work is a first attempt to quantify this useful property of images.

254 citations


Book ChapterDOI
06 Sep 2014
TL;DR: A learning-based framework that takes steps towards assessing how well people perform actions in videos by training a regression model from spatiotemporal pose features to scores obtained from expert judges and can provide interpretable feedback on how people can improve their action.
Abstract: While recent advances in computer vision have provided reliable methods to recognize actions in both images and videos, the problem of assessing how well people perform actions has been largely unexplored in computer vision. Since methods for assessing action quality have many real-world applications in healthcare, sports, and video retrieval, we believe the computer vision community should begin to tackle this challenging problem. To spur progress, we introduce a learning-based framework that takes steps towards assessing how well people perform actions in videos. Our approach works by training a regression model from spatiotemporal pose features to scores obtained from expert judges. Moreover, our approach can provide interpretable feedback on how people can improve their action. We evaluate our method on a new Olympic sports dataset, and our experiments suggest our framework is able to rank the athletes more accurately than a non-expert human. While promising, our method is still a long way to rivaling the performance of expert judges, indicating that there is significant opportunity in computer vision research to improve on this difficult yet important task.

177 citations


Book ChapterDOI
06 Sep 2014
TL;DR: This work proposes to characterize the identity of a city via an attribute analysis of 2 million geo-tagged images from 21 cities over 3 continents, using the scene attributes of these images to build a higher-level set of 7 city attributes, tailored to the form and function of cities.
Abstract: After hundreds of years of human settlement, each city has formed a distinct identity, distinguishing itself from other cities. In this work, we propose to characterize the identity of a city via an attribute analysis of 2 million geo-tagged images from 21 cities over 3 continents. First, we estimate the scene attributes of these images and use this representation to build a higher-level set of 7 city attributes, tailored to the form and function of cities. Then, we conduct the city identity recognition experiments on the geo-tagged images and identify images with salient city identity on each city attribute. Based on the misclassification rate of the city identity recognition, we analyze the visual similarity among different cities. Finally, we discuss the potential application of computer vision to urban planning.

131 citations


Book ChapterDOI
06 Sep 2014
TL;DR: A novel approach to the problem of localizing objects in an image and estimating their fine-pose by proposing FPM, a fine pose parts-based model that combines geometric information in the form of shared 3D parts in deformable part based models, and appearance information inthe form of objectness to achieve both fast and accurate fine pose estimation.
Abstract: We introduce a novel approach to the problem of localizing objects in an image and estimating their fine-pose. Given exact CAD models, and a few real training images with aligned models, we propose to leverage the geometric information from CAD models and appearance information from real images to learn a model that can accurately estimate fine pose in real images. Specifically, we propose FPM, a fine pose parts-based model, that combines geometric information in the form of shared 3D parts in deformable part based models, and appearance information in the form of objectness to achieve both fast and accurate fine pose estimation. Our method significantly outperforms current state-of-the-art algorithms in both accuracy and speed.

112 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work proposes to look beyond the visible elements of a scene; it is demonstrated that a scene is not just a collection of objects and their configuration or the labels assigned to its pixels - it is so much more.
Abstract: A common thread that ties together many prior works in scene understanding is their focus on the aspects directly present in a scene such as its categorical classification or the set of objects In this work, we propose to look beyond the visible elements of a scene; we demonstrate that a scene is not just a collection of objects and their configuration or the labels assigned to its pixels - it is so much more From a simple observation of a scene, we can tell a lot about the environment surrounding the scene such as the potential establishments near it, the potential crime rate in the area, or even the economic climate Here, we explore several of these aspects from both the human perception and computer vision perspective Specifically, we show that it is possible to predict the distance of surrounding establishments such as McDonald's or hospitals even by using scenes located far from them We go a step further to show that both humans and computers perform well at navigating the environment based only on visual cues from scenes Lastly, we show that it is possible to predict the crime rates in an area simply by looking at a scene without any real-time criminal activity Simply put, here, we illustrate that it is possible to look beyond the visible scene

83 citations


Journal Article
TL;DR: The Scene Understanding database as mentioned in this paper contains 908 distinct scene categories and 131,072 images with co-occurrence statistics and the contextual relationship between scenes and objects, and two human experiments are performed to quantify human scene recognition accuracy and measure how typical each image is of its assigned scene category.
Abstract: Progress in scene understanding requires reasoning about the rich and diverse visual environments that make up our daily experience. To this end, we propose the Scene Understanding database, a nearly exhaustive collection of scenes categorized at the same level of specificity as human discourse. The database contains 908 distinct scene categories and 131,072 images. Given this data with both scene and object labels available, we perform in-depth analysis of co-occurrence statistics and the contextual relationship. To better understand this large scale taxonomy of scene categories, we perform two human experiments: we quantify human scene recognition accuracy, and we measure how typical each image is of its assigned scene category. Next, we perform computational experiments: scene recognition with global image features, indoor versus outdoor classification, and “scene detection,” in which we relax the assumption that one image depicts only one scene category. Finally, we relate human experiments to machine performance and explore the relationship between human and machine recognition errors and the relationship between image “typicality” and machine recognition accuracy.

44 citations


ReportDOI
TL;DR: The results suggest that transferring knowledge from language into vision can help machines understand why a person might be performing an action in an image, and recently developed natural language models to mine knowledge stored in massive amounts of text.
Abstract: : Humans have the remarkable capability to infer the motivations of other people's actions, likely due to cognitive skills known in psychophysics as the theory of mind. In this paper, we strive to build a computational model that predicts the motivation behind the actions of people from images. To our knowledge, this challenging problem has not yet been extensively explored in computer vision. We present a novel learning based framework that uses high-level visual recognition to infer why people are performing an actions in images. However, the information in an image alone may not be sufficient to automatically solve this task. Since humans can rely on their own experiences to infer motivation, we propose to give computer vision systems access to some of these experiences by using recently developed natural language models to mine knowledge stored in massive amounts of text. While we are still far away from automatically inferring motivation, our results suggest that transferring knowledge from language into vision can help machines understand why a person might be performing an action in an image.

41 citations


Journal ArticleDOI
TL;DR: In this paper, the authors identify and study two types of "accidental" images that can be formed in scenes: the first is an accidental pinhole camera image and the second is "inverse" camera image, formed by subtracting an image with a small occluder present from a reference image.
Abstract: We identify and study two types of "accidental" images that can be formed in scenes. The first is an accidental pinhole camera image. The second class of accidental images are "inverse" pinhole camera images, formed by subtracting an image with a small occluder present from a reference image without the occluder. Both types of accidental cameras happen in a variety of different situations. For example, an indoor scene illuminated by natural light, a street with a person walking under the shadow of a building, etc. The images produced by accidental cameras are often mistaken for shadows or interreflections. However, accidental images can reveal information about the scene outside the image, the lighting conditions, or the aperture by which light enters the scene.

27 citations


Journal Article
TL;DR: This work identifies and study two types of “accidental” images that can be formed in scenes, one of which is an accidental pinhole camera image, and the other is “inverse”Pinhole camera images, formed by subtracting an image with a small Occluder present from a reference image without the occluder.
Abstract: We identify and study two types of "accidental" images that can be formed in scenes. The first is an accidental pinhole camera image. The second class of accidental images are "inverse" pinhole camera images, formed by subtracting an image with a small occluder present from a reference image without the occluder. Both types of accidental cameras happen in a variety of different situations. For example, an indoor scene illuminated by natural light, a street with a person walking under the shadow of a building, etc. The images produced by accidental cameras are often mistaken for shadows or interreflections. However, accidental images can reveal information about the scene outside the image, the lighting conditions, or the aperture by which light enters the scene.

Posted Content
TL;DR: The problem of predicting why a person has performed an action in images is introduced and results suggest that transferring knowledge from language into vision can help machines understand why people in images might be performing an action.
Abstract: Understanding human actions is a key problem in computer vision. However, recognizing actions is only the first step of understanding what a person is doing. In this paper, we introduce the problem of predicting why a person has performed an action in images. This problem has many applications in human activity understanding, such as anticipating or explaining an action. To study this problem, we introduce a new dataset of people performing actions annotated with likely motivations. However, the information in an image alone may not be sufficient to automatically solve this task. Since humans can rely on their lifetime of experiences to infer motivation, we propose to give computer vision systems access to some of these experiences by using recently developed natural language models to mine knowledge stored in massive amounts of text. While we are still far away from fully understanding motivation, our results suggest that transferring knowledge from language into vision can help machines understand why people in images might be performing an action.

ReportDOI
01 Jan 2014
TL;DR: A novel SVM formulation is presented that constrains the orientation of the SVM hyperplane to agree with the human visual system, and suggests that transferring this human bias into machines can help object recognition systems generalize across datasets.
Abstract: : The human mind can remarkably imagine objects that it has never seen, touched, or heard, all in vivid detail. Motivated by the desire to harness this rich source of information from the human mind, this paper investigates how to extract classifiers from the human visual system and leverage them in a machine. We introduce a method that, inspired by wellknown tools in human psychophysics, estimates the classifier that the human visual system might use for recognition but in computer vision feature spaces. Our experiments are surprising, and suggest that classifiers from the human visual system can be transferred into a machine with some success. Since these classifiers seem to capture favorable biases in the human visual system, we present a novel SVM formulation that constrains the orientation of the SVM hyperplane to agree with the human visual system. Our results suggest that transferring this human bias into machines can help object recognition systems generalize across datasets. Moreover, we found that people's culture may subtly vary the objects that people imagine, which influences this bias. Overall, human imagination can be an interesting resource for future visual recognition systems.

Posted Content
TL;DR: In this paper, a method that estimates the biases that the human visual system might use for recognition, but in computer vision feature spaces, is introduced. And the authors suggest that transferring this human bias into machines may help object recognition systems generalize across datasets and perform better when very little training data is available.
Abstract: Although the human visual system can recognize many concepts under challenging conditions, it still has some biases. In this paper, we investigate whether we can extract these biases and transfer them into a machine recognition system. We introduce a novel method that, inspired by well-known tools in human psychophysics, estimates the biases that the human visual system might use for recognition, but in computer vision feature spaces. Our experiments are surprising, and suggest that classifiers from the human visual system can be transferred into a machine with some success. Since these classifiers seem to capture favorable biases in the human visual system, we further present an SVM formulation that constrains the orientation of the SVM hyperplane to agree with the bias from human visual system. Our results suggest that transferring this human bias into machines may help object recognition systems generalize across datasets and perform better when very little training data is available.

Proceedings ArticleDOI
24 Mar 2014
TL;DR: This paper introduces the first unsupervised method for automated hierarchical modeling of high-level latent regions using densely sampled, geotagged ground imagery at a global scale and shows the effectiveness of this method for discovering regional distributions at vastly different scales, including the Boston area, the United States, and the World.
Abstract: Densely and regularly sampled ground-level imagery collected from semi-automated vehicle collection platforms (e.g. Street ViewTM or EarthMineTM) is rapidly becoming available on a global scale. Ground level views of neighborhoods around the world can provide unique regional indicators (e.g. demographics, zoning, socioeconomic health) at a finer resolution than current census data. To this end, this paper introduces the first unsupervised method for automated hierarchical modeling of high-level latent regions using densely sampled, geotagged ground imagery at a global scale. Our unsupervised, nonparametric approach models a multi-scale regularly sampled grid of panoramic ground-level images (where available), producing a multi-scale, multivariate estimate of the latent types of locations (e.g. commercial street, wooded area, park, country road, suburban street, etc.). Region types are model distributions of location types characteristic of particular neighborhoods, towns, or cities. These latent region types are used to cluster all locations into distinct geographic entities with regional signatures that can be used for comparison. We show the effectiveness of this method for discovering regional distributions at vastly different scales, including the Boston area, the United States, and the World.

Proceedings ArticleDOI
24 Aug 2014
TL;DR: This work presents a non-linear object detector called Exemplar Network that efficiently encodes the space of all possible mixture models, and offers a framework that generalizes recent exemplar-based object detection with monolithic detectors.
Abstract: We present a non-linear object detector called Exemplar Network. Our model efficiently encodes the space of all possible mixture models, and offers a framework that generalizes recent exemplar-based object detection with monolithic detectors. We evaluate our method on the traffic scene dataset that we collected using onboard cameras, and demonstrate an orientation estimation. Our model has both the interpretability and accessibility necessary for industrial applications. One can easily apply our method to a variety of applications.