Showing papers by "Antonio Torralba published in 2014"

PDF

Open Access

Proceedings Article•

Learning Deep Features for Scene Recognition using Places Database

[...]

Bolei Zhou¹, Agata Lapedriza¹, Jianxiong Xiao², Antonio Torralba¹, Aude Oliva¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Princeton University²

08 Dec 2014

TL;DR: A new scene-centric database called Places with over 7 million labeled pictures of scenes is introduced with new methods to compare the density and diversity of image datasets and it is shown that Places is as dense as other scene datasets and has more diversity.

...read moreread less

Abstract: Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called Places with over 7 million labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that Places is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art results on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show differences in the internal representations of object-centric and scene-centric networks.

...read moreread less

2,960 citations

Posted Content•

Object Detectors Emerge in Deep Scene CNNs

[...]

Bolei Zhou¹, Aditya Khosla¹, Agata Lapedriza¹, Aude Oliva¹, Antonio Torralba¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

22 Dec 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors show that object detectors emerge from training CNNs to perform scene classification, and demonstrate that the same network can perform both scene recognition and object localization in a single forward pass without ever having been explicitly taught the notion of objects.

...read moreread less

Abstract: With the success of new computational architectures for visual processing, such as convolutional neural networks (CNN) and access to image databases with millions of labeled examples (e.g., ImageNet, Places), the state of the art in computer vision is advancing rapidly. One important factor for continued progress is to understand the representations that are learned by the inner layers of these deep architectures. Here we show that object detectors emerge from training CNNs to perform scene classification. As scenes are composed of objects, the CNN for scene classification automatically discovers meaningful objects detectors, representative of the learned scene categories. With object detectors emerging as a result of learning to recognize scenes, our work demonstrates that the same network can perform both scene recognition and object localization in a single forward-pass, without ever having been explicitly taught the notion of objects.

...read moreread less

874 citations

Journal Article•DOI•

What Makes a Photograph Memorable

[...]

Phillip Isola¹, Jianxiong Xiao¹, Devi Parikh², Antonio Torralba¹, Aude Oliva¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Virginia Tech²

01 Jul 2014-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: It is shown that memorability is an intrinsic and stable property of an image that is shared across different viewers, and remains stable across delays, and is a first attempt to quantify this useful property of images.

...read moreread less

Abstract: When glancing at a magazine, or browsing the Internet, we are continuously exposed to photographs. Despite this overflow of visual information, humans are extremely good at remembering thousands of pictures along with some of their visual details. But not all images are equal in memory. Some stick in our minds while others are quickly forgotten. In this paper, we focus on the problem of predicting how memorable an image will be. We show that memorability is an intrinsic and stable property of an image that is shared across different viewers, and remains stable across delays. We introduce a database for which we have measured the probability that each picture will be recognized after a single view. We analyze a collection of image features, labels, and attributes that contribute to making an image memorable, and we train a predictor based on global image descriptors. We find that predicting image memorability is a task that can be addressed with current computer vision techniques. While making memorable images is a challenging task in visualization, photography, and education, this work is a first attempt to quantify this useful property of images.

...read moreread less

254 citations

Book Chapter•DOI•

Assessing the Quality of Actions

[...]

Hamed Pirsiavash¹, Carl Vondrick¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

06 Sep 2014

TL;DR: A learning-based framework that takes steps towards assessing how well people perform actions in videos by training a regression model from spatiotemporal pose features to scores obtained from expert judges and can provide interpretable feedback on how people can improve their action.

...read moreread less

Abstract: While recent advances in computer vision have provided reliable methods to recognize actions in both images and videos, the problem of assessing how well people perform actions has been largely unexplored in computer vision. Since methods for assessing action quality have many real-world applications in healthcare, sports, and video retrieval, we believe the computer vision community should begin to tackle this challenging problem. To spur progress, we introduce a learning-based framework that takes steps towards assessing how well people perform actions in videos. Our approach works by training a regression model from spatiotemporal pose features to scores obtained from expert judges. Moreover, our approach can provide interpretable feedback on how people can improve their action. We evaluate our method on a new Olympic sports dataset, and our experiments suggest our framework is able to rank the athletes more accurately than a non-expert human. While promising, our method is still a long way to rivaling the performance of expert judges, indicating that there is significant opportunity in computer vision research to improve on this difficult yet important task.

...read moreread less

177 citations

Book Chapter•DOI•

Recognizing City Identity via Attribute Analysis of Geo-tagged Images

[...]

Bolei Zhou¹, Liu Liu¹, Aude Oliva¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

06 Sep 2014

TL;DR: This work proposes to characterize the identity of a city via an attribute analysis of 2 million geo-tagged images from 21 cities over 3 continents, using the scene attributes of these images to build a higher-level set of 7 city attributes, tailored to the form and function of cities.

...read moreread less

Abstract: After hundreds of years of human settlement, each city has formed a distinct identity, distinguishing itself from other cities. In this work, we propose to characterize the identity of a city via an attribute analysis of 2 million geo-tagged images from 21 cities over 3 continents. First, we estimate the scene attributes of these images and use this representation to build a higher-level set of 7 city attributes, tailored to the form and function of cities. Then, we conduct the city identity recognition experiments on the geo-tagged images and identify images with salient city identity on each city attribute. Based on the misclassification rate of the city identity recognition, we analyze the visual similarity among different cities. Finally, we discuss the potential application of computer vision to urban planning.

...read moreread less

131 citations

Book Chapter•DOI•

FPM: Fine Pose Parts-Based Model with 3D CAD Models

[...]

Joseph J. Lim¹, Aditya Khosla¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

06 Sep 2014

TL;DR: A novel approach to the problem of localizing objects in an image and estimating their fine-pose by proposing FPM, a fine pose parts-based model that combines geometric information in the form of shared 3D parts in deformable part based models, and appearance information inthe form of objectness to achieve both fast and accurate fine pose estimation.

...read moreread less

Abstract: We introduce a novel approach to the problem of localizing objects in an image and estimating their fine-pose. Given exact CAD models, and a few real training images with aligned models, we propose to leverage the geometric information from CAD models and appearance information from real images to learn a model that can accurately estimate fine pose in real images. Specifically, we propose FPM, a fine pose parts-based model, that combines geometric information in the form of shared 3D parts in deformable part based models, and appearance information in the form of objectness to achieve both fast and accurate fine pose estimation. Our method significantly outperforms current state-of-the-art algorithms in both accuracy and speed.

...read moreread less

112 citations

Proceedings Article•DOI•

Looking Beyond the Visible Scene

[...]

Aditya Khosla¹, Byoungkwon An², Joseph J. Lim¹, Antonio Torralba¹•Institutions (2)

Massachusetts Institute of Technology¹, Vassar College²

23 Jun 2014

TL;DR: This work proposes to look beyond the visible elements of a scene; it is demonstrated that a scene is not just a collection of objects and their configuration or the labels assigned to its pixels - it is so much more.

...read moreread less

Abstract: A common thread that ties together many prior works in scene understanding is their focus on the aspects directly present in a scene such as its categorical classification or the set of objects In this work, we propose to look beyond the visible elements of a scene; we demonstrate that a scene is not just a collection of objects and their configuration or the labels assigned to its pixels - it is so much more From a simple observation of a scene, we can tell a lot about the environment surrounding the scene such as the potential establishments near it, the potential crime rate in the area, or even the economic climate Here, we explore several of these aspects from both the human perception and computer vision perspective Specifically, we show that it is possible to predict the distance of surrounding establishments such as McDonald's or hospitals even by using scenes located far from them We go a step further to show that both humans and computers perform well at navigating the environment based only on visual cues from scenes Lastly, we show that it is possible to predict the crime rates in an area simply by looking at a scene without any real-time criminal activity Simply put, here, we illustrate that it is possible to look beyond the visible scene

...read moreread less

83 citations

Journal Article•

SUN Database: Exploring a Large Collection of Scene Categories

[...]

Jianxiong Xiao, James Hays, Antonio Torralba, Krista A. Ehinger, Aude Oliva - Show less +1 more

01 Aug 2014-Springer US

TL;DR: The Scene Understanding database as mentioned in this paper contains 908 distinct scene categories and 131,072 images with co-occurrence statistics and the contextual relationship between scenes and objects, and two human experiments are performed to quantify human scene recognition accuracy and measure how typical each image is of its assigned scene category.

...read moreread less

Abstract: Progress in scene understanding requires reasoning about the rich and diverse visual environments that make up our daily experience. To this end, we propose the Scene Understanding database, a nearly exhaustive collection of scenes categorized at the same level of specificity as human discourse. The database contains 908 distinct scene categories and 131,072 images. Given this data with both scene and object labels available, we perform in-depth analysis of co-occurrence statistics and the contextual relationship. To better understand this large scale taxonomy of scene categories, we perform two human experiments: we quantify human scene recognition accuracy, and we measure how typical each image is of its assigned scene category. Next, we perform computational experiments: scene recognition with global image features, indoor versus outdoor classification, and “scene detection,” in which we relax the assumption that one image depicts only one scene category. Finally, we relate human experiments to machine performance and explore the relationship between human and machine recognition errors and the relationship between image “typicality” and machine recognition accuracy.

...read moreread less

44 citations

Report•DOI•

Inferring the Why in Images

[...]

Hamed Pirsiavash¹, Carl Vondrick¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: The results suggest that transferring knowledge from language into vision can help machines understand why a person might be performing an action in an image, and recently developed natural language models to mine knowledge stored in massive amounts of text.

...read moreread less

Abstract: : Humans have the remarkable capability to infer the motivations of other people's actions, likely due to cognitive skills known in psychophysics as the theory of mind. In this paper, we strive to build a computational model that predicts the motivation behind the actions of people from images. To our knowledge, this challenging problem has not yet been extensively explored in computer vision. We present a novel learning based framework that uses high-level visual recognition to infer why people are performing an actions in images. However, the information in an image alone may not be sufficient to automatically solve this task. Since humans can rely on their own experiences to infer motivation, we propose to give computer vision systems access to some of these experiences by using recently developed natural language models to mine knowledge stored in massive amounts of text. While we are still far away from automatically inferring motivation, our results suggest that transferring knowledge from language into vision can help machines understand why a person might be performing an action in an image.

...read moreread less

41 citations

Journal Article•DOI•

Accidental Pinhole and Pinspeck Cameras

[...]

Antonio Torralba¹, William T. Freeman¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Nov 2014-International Journal of Computer Vision

TL;DR: In this paper, the authors identify and study two types of "accidental" images that can be formed in scenes: the first is an accidental pinhole camera image and the second is "inverse" camera image, formed by subtracting an image with a small occluder present from a reference image.

...read moreread less

Abstract: We identify and study two types of "accidental" images that can be formed in scenes. The first is an accidental pinhole camera image. The second class of accidental images are "inverse" pinhole camera images, formed by subtracting an image with a small occluder present from a reference image without the occluder. Both types of accidental cameras happen in a variety of different situations. For example, an indoor scene illuminated by natural light, a street with a person walking under the shadow of a building, etc. The images produced by accidental cameras are often mistaken for shadows or interreflections. However, accidental images can reveal information about the scene outside the image, the lighting conditions, or the aperture by which light enters the scene.

...read moreread less

27 citations

Journal Article•

Accidental Pinhole and Pinspeck Cameras

[...]

Antonio Torralba¹, William T. Freeman¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Mar 2014-Springer US

TL;DR: This work identifies and study two types of “accidental” images that can be formed in scenes, one of which is an accidental pinhole camera image, and the other is “inverse”Pinhole camera images, formed by subtracting an image with a small Occluder present from a reference image without the occluder.

...read moreread less

Posted Content•

Predicting Motivations of Actions by Leveraging Text

[...]

Carl Vondrick¹, Deniz Oktay¹, Hamed Pirsiavash², Antonio Torralba¹•Institutions (2)

Massachusetts Institute of Technology¹, University of Maryland, College Park²

20 Jun 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: The problem of predicting why a person has performed an action in images is introduced and results suggest that transferring knowledge from language into vision can help machines understand why people in images might be performing an action.

...read moreread less

Abstract: Understanding human actions is a key problem in computer vision. However, recognizing actions is only the first step of understanding what a person is doing. In this paper, we introduce the problem of predicting why a person has performed an action in images. This problem has many applications in human activity understanding, such as anticipating or explaining an action. To study this problem, we introduce a new dataset of people performing actions annotated with likely motivations. However, the information in an image alone may not be sufficient to automatically solve this task. Since humans can rely on their lifetime of experiences to infer motivation, we propose to give computer vision systems access to some of these experiences by using recently developed natural language models to mine knowledge stored in massive amounts of text. While we are still far away from fully understanding motivation, our results suggest that transferring knowledge from language into vision can help machines understand why people in images might be performing an action.

...read moreread less

Report•DOI•

Acquiring Visual Classifiers from Human Imagination

[...]

Carl Vondrick¹, Hamed Pirsiavash¹, Aude Oliva¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2014

TL;DR: A novel SVM formulation is presented that constrains the orientation of the SVM hyperplane to agree with the human visual system, and suggests that transferring this human bias into machines can help object recognition systems generalize across datasets.

...read moreread less

Abstract: : The human mind can remarkably imagine objects that it has never seen, touched, or heard, all in vivid detail. Motivated by the desire to harness this rich source of information from the human mind, this paper investigates how to extract classifiers from the human visual system and leverage them in a machine. We introduce a method that, inspired by wellknown tools in human psychophysics, estimates the classifier that the human visual system might use for recognition but in computer vision feature spaces. Our experiments are surprising, and suggest that classifiers from the human visual system can be transferred into a machine with some success. Since these classifiers seem to capture favorable biases in the human visual system, we present a novel SVM formulation that constrains the orientation of the SVM hyperplane to agree with the human visual system. Our results suggest that transferring this human bias into machines can help object recognition systems generalize across datasets. Moreover, we found that people's culture may subtly vary the objects that people imagine, which influences this bias. Overall, human imagination can be an interesting resource for future visual recognition systems.

...read moreread less

Posted Content•

Learning visual biases from human imagination

[...]

Carl Vondrick¹, Hamed Pirsiavash², Aude Oliva¹, Antonio Torralba¹•Institutions (2)

Massachusetts Institute of Technology¹, University of Maryland, Baltimore County²

17 Oct 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a method that estimates the biases that the human visual system might use for recognition, but in computer vision feature spaces, is introduced. And the authors suggest that transferring this human bias into machines may help object recognition systems generalize across datasets and perform better when very little training data is available.

...read moreread less

Abstract: Although the human visual system can recognize many concepts under challenging conditions, it still has some biases. In this paper, we investigate whether we can extract these biases and transfer them into a machine recognition system. We introduce a novel method that, inspired by well-known tools in human psychophysics, estimates the biases that the human visual system might use for recognition, but in computer vision feature spaces. Our experiments are surprising, and suggest that classifiers from the human visual system can be transferred into a machine with some success. Since these classifiers seem to capture favorable biases in the human visual system, we further present an SVM formulation that constrains the orientation of the SVM hyperplane to agree with the bias from human visual system. Our results suggest that transferring this human bias into machines may help object recognition systems generalize across datasets and perform better when very little training data is available.

...read moreread less

Proceedings Article•DOI•

Unsupervised Non-parametric Geospatial Modeling from Ground Imagery

[...]

Nathan Frey, Antonio Torralba¹, Chris Stauffer•Institutions (1)

Massachusetts Institute of Technology¹

24 Mar 2014

TL;DR: This paper introduces the first unsupervised method for automated hierarchical modeling of high-level latent regions using densely sampled, geotagged ground imagery at a global scale and shows the effectiveness of this method for discovering regional distributions at vastly different scales, including the Boston area, the United States, and the World.

...read moreread less

Abstract: Densely and regularly sampled ground-level imagery collected from semi-automated vehicle collection platforms (e.g. Street ViewTM or EarthMineTM) is rapidly becoming available on a global scale. Ground level views of neighborhoods around the world can provide unique regional indicators (e.g. demographics, zoning, socioeconomic health) at a finer resolution than current census data. To this end, this paper introduces the first unsupervised method for automated hierarchical modeling of high-level latent regions using densely sampled, geotagged ground imagery at a global scale. Our unsupervised, nonparametric approach models a multi-scale regularly sampled grid of panoramic ground-level images (where available), producing a multi-scale, multivariate estimate of the latent types of locations (e.g. commercial street, wooded area, park, country road, suburban street, etc.). Region types are model distributions of location types characteristic of particular neighborhoods, towns, or cities. These latent region types are used to cluster all locations into distinct geographic entities with regional signatures that can be used for comparison. We show the effectiveness of this method for discovering regional distributions at vastly different scales, including the Boston area, the United States, and the World.

...read moreread less

Proceedings Article•DOI•

Exemplar Network: A Generalized Mixture Model

[...]

Chikao Tsuchiya, Tomasz Malisiewicz¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

24 Aug 2014

TL;DR: This work presents a non-linear object detector called Exemplar Network that efficiently encodes the space of all possible mixture models, and offers a framework that generalizes recent exemplar-based object detection with monolithic detectors.

...read moreread less

Abstract: We present a non-linear object detector called Exemplar Network. Our model efficiently encodes the space of all possible mixture models, and offers a framework that generalizes recent exemplar-based object detection with monolithic detectors. We evaluate our method on the traffic scene dataset that we collected using onboard cameras, and demonstrate an orientation estimation. Our model has both the interpretability and accessibility necessary for industrial applications. One can easily apply our method to a variety of applications.

...read moreread less