Showing papers by "Antonio Torralba published in 2015"

PDF

Open Access

Posted Content•

Learning Deep Features for Discriminative Localization

[...]

Bolei Zhou¹, Aditya Khosla¹, Agata Lapedriza¹, Aude Oliva¹, Antonio Torralba¹ - Show less +1 more•Institutions (1)

14 Dec 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors revisited the global average pooling layer and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels.

...read moreread less

Abstract: In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014, which is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach. We demonstrate that our network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them

...read moreread less

5,065 citations

Proceedings Article•DOI•

Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books

[...]

Yukun Zhu¹, Ryan Kiros¹, Richard S. Zemel², Ruslan Salakhutdinov¹, Raquel Urtasun¹, Antonio Torralba³, Sanja Fidler¹ - Show less +3 more•Institutions (3)

University of Toronto¹, Canadian Institute for Advanced Research², Massachusetts Institute of Technology³

07 Dec 2015

TL;DR: The authors align books to their movie releases to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in the current datasets, and propose a context-aware CNN to combine information from multiple sources.

...read moreread less

Abstract: Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in the current datasets. To align movies and books we propose a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.

...read moreread less

2,105 citations

Proceedings Article•

Skip-thought vectors

[...]

Ryan Kiros¹, Yukun Zhu¹, Ruslan Salakhutdinov², Richard S. Zemel², Antonio Torralba³, Raquel Urtasun¹, Sanja Fidler¹ - Show less +3 more•Institutions (3)

University of Toronto¹, Canadian Institute for Advanced Research², Massachusetts Institute of Technology³

07 Dec 2015

TL;DR: This article used the continuity of text from books to train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage, which can produce highly generic sentence representations that are robust and perform well in practice.

...read moreread less

Abstract: We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice.

...read moreread less

1,802 citations

Posted Content•

Skip-Thought Vectors

[...]

Ryan Kiros¹, Yukun Zhu¹, Ruslan Salakhutdinov², Richard S. Zemel², Antonio Torralba³, Raquel Urtasun¹, Sanja Fidler¹ - Show less +3 more•Institutions (3)

University of Toronto¹, Canadian Institute for Advanced Research², Massachusetts Institute of Technology³

22 Jun 2015-arXiv: Computation and Language

TL;DR: The approach for unsupervised learning of a generic, distributed sentence encoder is described, using the continuity of text from books to train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage.

...read moreread less

1,115 citations

Proceedings Article•

Object Detectors Emerge in Deep Scene CNNs

[...]

Bolei Zhou¹, Aditya Khosla¹, Agata Lapedriza¹, Aude Oliva¹, Antonio Torralba¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

01 May 2015

TL;DR: This work demonstrates that the same network can perform both scene recognition and object localization in a single forward-pass, without ever having been explicitly taught the notion of objects.

...read moreread less

Abstract: With the success of new computational architectures for visual processing, such as convolutional neural networks (CNN) and access to image databases with millions of labeled examples (e.g., ImageNet, Places), the state of the art in computer vision is advancing rapidly. One important factor for continued progress is to understand the representations that are learned by the inner layers of these deep architectures. Here we show that object detectors emerge from training CNNs to perform scene classification. As scenes are composed of objects, the CNN for scene classification automatically discovers meaningful objects detectors, representative of the learned scene categories. With object detectors emerging as a result of learning to recognize scenes, our work demonstrates that the same network can perform both scene recognition and object localization in a single forward-pass, without ever having been explicitly taught the notion of objects.

...read moreread less

649 citations

Posted Content•

Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

[...]

Yukun Zhu¹, Ryan Kiros¹, Richard S. Zemel², Ruslan Salakhutdinov¹, Raquel Urtasun¹, Antonio Torralba³, Sanja Fidler¹ - Show less +3 more•Institutions (3)

University of Toronto¹, Canadian Institute for Advanced Research², Massachusetts Institute of Technology³

22 Jun 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: To align movies and books, a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book are proposed.

...read moreread less

Abstract: Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in current datasets. To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.

...read moreread less

307 citations

Proceedings Article•DOI•

Understanding and Predicting Image Memorability at a Large Scale

[...]

Aditya Khosla¹, Akhil S. Raju¹, Antonio Torralba¹, Aude Oliva¹•Institutions (1)

Massachusetts Institute of Technology¹

07 Dec 2015

TL;DR: LaMem is built, the largest annotated image memorability dataset to date, using Convolutional Neural Networks, to demonstrate that one can now robustly estimate the memorability of images from many different classes, positioning memorability and deep memorability features as prime candidates to estimate the utility of information for cognitive systems.

...read moreread less

Abstract: Progress in estimating visual memorability has been limited by the small scale and lack of variety of benchmark data. Here, we introduce a novel experimental procedure to objectively measure human memory, allowing us to build LaMem, the largest annotated image memorability dataset to date (containing 60,000 images from diverse sources). Using Convolutional Neural Networks (CNNs), we show that fine-tuned deep features outperform all other features by a large margin, reaching a rank correlation of 0.64, near human consistency (0.68). Analysis of the responses of the high-level CNN layers shows which objects and regions are positively, and negatively, correlated with memorability, allowing us to create memorability maps for each image and provide a concrete method to perform image memorability manipulation. This work demonstrates that one can now robustly estimate the memorability of images from many different classes, positioning memorability and deep memorability features as prime candidates to estimate the utility of information for cognitive systems. Our model and data are available at: http://memorability.csail.mit.edu.

...read moreread less

285 citations

Posted Content•

Anticipating Visual Representations from Unlabeled Video

[...]

Carl Vondrick¹, Hamed Pirsiavash², Antonio Torralba¹•Institutions (2)

Massachusetts Institute of Technology¹, University of Maryland, Baltimore County²

29 Apr 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects is presented. But this task is challenging partly because it requires leveraging extensive knowledge of the world that is difficult to write down.

...read moreread less

Abstract: Anticipating actions and objects before they start or appear is a difficult problem in computer vision with several real-world applications. This task is challenging partly because it requires leveraging extensive knowledge of the world that is difficult to write down. We believe that a promising resource for efficiently learning this knowledge is through readily available unlabeled video. We present a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects. The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future. Visual representations are a promising prediction target because they encode images at a higher semantic level than pixels yet are automatic to compute. We then apply recognition algorithms on our predicted representation to anticipate objects and actions. We experimentally validate this idea on two datasets, anticipating actions one second in the future and objects five seconds in the future.

...read moreread less

282 citations

Journal Article•DOI•

Intrinsic and extrinsic effects on image memorability

[...]

Zoya Bylinskii¹, Phillip Isola¹, Constance M. Bainbridge¹, Antonio Torralba¹, Aude Oliva¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Nov 2015-Vision Research

TL;DR: This work finds that intrinsic differences in memorability exist at a finer-grained scale than previously documented and proposes an information-theoretic model of image distinctiveness that can automatically predict how changes in context change the memorability of natural images.

...read moreread less

187 citations

Posted Content•

MovieQA: Understanding Stories in Movies through Question-Answering

[...]

Makarand Tapaswi¹, Yukun Zhu², Rainer Stiefelhagen¹, Antonio Torralba³, Raquel Urtasun², Sanja Fidler² - Show less +2 more•Institutions (3)

Karlsruhe Institute of Technology¹, University of Toronto², Massachusetts Institute of Technology³

09 Dec 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: The MovieQA dataset, which aims to evaluate automatic story comprehension from both video and text, is introduced and existing QA techniques are extended to show that question-answering with such open-ended semantics is hard.

...read moreread less

Abstract: We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 14,944 questions about 408 movies with high semantic diversity. The questions range from simpler "Who" did "What" to "Whom", to "Why" and "How" certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- video clips, plots, subtitles, scripts, and DVS. We analyze our data through various statistics and methods. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We make this data set public along with an evaluation benchmark to encourage inspiring work in this challenging domain.

...read moreread less

181 citations

Proceedings Article•

Where are they looking

[...]

Adrià Recasens¹, Aditya Khosla¹, Carl Vondrick¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

07 Dec 2015

TL;DR: A deep neural network-based approach for gaze-following and a new benchmark dataset, GazeFollow, for thorough evaluation are proposed and it is shown that this approach produces reliable results, even when viewing only the back of the head.

...read moreread less

Abstract: Humans have the remarkable ability to follow the gaze of other people to identify what they are looking at. Following eye gaze, or gaze-following, is an important ability that allows us to understand what other people are thinking, the actions they are performing, and even predict what they might do next. Despite the importance of this topic, this problem has only been studied in limited scenarios within the computer vision community. In this paper, we propose a deep neural network-based approach for gaze-following and a new benchmark dataset, GazeFollow, for thorough evaluation. Given an image and the location of a head, our approach follows the gaze of the person and identifies the object being looked at. Our deep network is able to discover how to extract head pose and gaze orientation, and to select objects in the scene that are in the predicted line of sight and likely to be looked at (such as televisions, balls and food). The quantitative evaluation shows that our approach produces reliable results, even when viewing only the back of the head. While our method outperforms several baseline approaches, we are still far from reaching human performance on this task. Overall, we believe that gaze-following is a challenging and important problem that deserves more attention from the community.

...read moreread less

Posted Content•

Anticipating the future by watching unlabeled video.

[...]

Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

29 Apr 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: A large scale framework that capitalizes on temporal structure in unlabeled video to learn to anticipate both actions and objects in the future, and suggests that learning with unlabeling videos significantly helps forecast actions and anticipate objects.

...read moreread less

Abstract: In many computer vision applications, machines will need to reason beyond the present, and predict the future. This task is challenging because it requires leveraging extensive commonsense knowledge of the world that is difficult to write down. We believe that a promising resource for efficiently obtaining this knowledge is through the massive amounts of readily available unlabeled video. In this paper, we present a large scale framework that capitalizes on temporal structure in unlabeled video to learn to anticipate both actions and objects in the future. The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future. We experimentally validate this idea on two challenging "in the wild" video datasets, and our results suggest that learning with unlabeled videos significantly helps forecast actions and anticipate objects.

...read moreread less

Posted Content•

Visually Indicated Sounds

[...]

Andrew Owens¹, Phillip Isola², Josh H. McDermott¹, Antonio Torralba¹, Edward H. Adelson¹, William T. Freeman¹ - Show less +2 more•Institutions (2)

Massachusetts Institute of Technology¹, University of California, Berkeley²

28 Dec 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors use a recurrent neural network to predict sound features from videos and then produce a waveform from these features with an example-based synthesis procedure, showing that the sounds predicted by their model are realistic enough to fool participants in a "real or fake" psychophysical experiment, and convey significant information about material properties and physical interactions.

...read moreread less

Abstract: Objects make distinctive sounds when they are hit or scratched. These sounds reveal aspects of an object's material properties, as well as the actions that produced them. In this paper, we propose the task of predicting what sound an object makes when struck as a way of studying physical interactions within a visual scene. We present an algorithm that synthesizes sound from silent videos of people hitting and scratching objects with a drumstick. This algorithm uses a recurrent neural network to predict sound features from videos and then produces a waveform from these features with an example-based synthesis procedure. We show that the sounds predicted by our model are realistic enough to fool participants in a "real or fake" psychophysical experiment, and that they convey significant information about material properties and physical interactions.

...read moreread less

Proceedings Article•

Learning visual biases from human imagination

[...]

Carl Vondrick¹, Hamed Pirsiavash², Aude Oliva¹, Antonio Torralba¹•Institutions (2)

Massachusetts Institute of Technology¹, University of Maryland, Baltimore County²

07 Dec 2015

TL;DR: A novel method is introduced that, inspired by well-known tools in human psychophysics, estimates the biases that the human visual system might use for recognition, but in computer vision feature spaces, and presents an SVM formulation that constrains the orientation of the SVM hyperplane to agree with the bias from human visualSystem.

...read moreread less

Abstract: Although the human visual system can recognize many concepts under challenging conditions, it still has some biases. In this paper, we investigate whether we can extract these biases and transfer them into a machine recognition system. We introduce a novel method that, inspired by well-known tools in human psychophysics, estimates the biases that the human visual system might use for recognition, but in computer vision feature spaces. Our experiments are surprising, and suggest that classifiers from the human visual system can be transferred into a machine with some success. Since these classifiers seem to capture favorable biases in the human visual system, we further present an SVM formulation that constrains the orientation of the SVM hyperplane to agree with the bias from human visual system. Our results suggest that transferring this human bias into machines may help object recognition systems generalize across datasets and perform better when very little training data is available.

...read moreread less

Posted Content•

Visualizing Object Detection Features

[...]

Carl Vondrick¹, Aditya Khosla¹, Hamed Pirsiavash², Tomasz Malisiewicz, Antonio Torralba¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, University of Maryland, Baltimore County²

19 Feb 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: Algorithms to visualize feature spaces used by object detectors are introduced to allow for a more intuitive understanding of recognition systems and suggest that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors without improving the features.

...read moreread less

Abstract: We introduce algorithms to visualize feature spaces used by object detectors. Our method works by inverting a visual feature back to multiple natural images. We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector's failures. For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. This result suggests that many of these false alarms are caused by our choice of feature space, and supports that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. By visualizing feature spaces, we can gain a more intuitive understanding of recognition systems.

...read moreread less

Journal Article•DOI•

Mapping human visual representations in space and time by neural networks

[...]

Radoslaw Martin Cichy¹, Aditya Khosla¹, Dimitrios Pantazis¹, Antonio Torralba¹, Aude Oliva¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Sep 2015-Journal of Vision

TL;DR: CNNs are a promising formal model of human visual object recognition Combined with fMRI and MEG, they provide an integrated spatiotemporal and algorithmically explicit view of the first few hundred milliseconds of object recognition.

...read moreread less

Abstract: The neural machinery underlying visual object recognition comprises a hierarchy of cortical regions in the ventral visual stream. The spatiotemporal dynamics of information flow in this hierarchy of regions is largely unknown. Here we tested the hypothesis that there is a correspondence between the spatiotemporal neural processes in the human brain and the layer hierarchy of a deep convolutional neural network (CNN). We presented 118 images of real-world objects to human participants (N=15) while we measured their brain activity with functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG). We trained an 8 layer (5 convolutional layers, 3 fully connected layers) CNN to predict 683 object categories with 900K training images from the ImageNet dataset. We obtained layer-specific CNN responses to the same 118 images. To compare brain-imaging data with the CNN in a common framework, we used representational similarity analysis. The key idea is that if two conditions evoke similar patterns in brain imaging data, they should also evoke similar patterns in the computer model. We thus determined 'where' (fMRI) and 'when' (MEG) the CNNs predicted brain activity. We found a correspondence in hierarchy between cortical regions, processing time, and CNN layers. Low CNN layers predicted MEG activity early and high layers relatively later; low CNN layers predicted fMRI activity in early visual regions, and high layers in late visual regions. Surprisingly, the correspondence between CNN layer hierarchy and cortical regions held for the ventral and dorsal visual stream. Results were dependent on amount of training and type of training material. Our results show that CNNs are a promising formal model of human visual object recognition. Combined with fMRI and MEG, they provide an integrated spatiotemporal and algorithmically explicit view of the first few hundred milliseconds of object recognition. Meeting abstract presented at VSS 2015.

...read moreread less