Showing papers by "Antonio Torralba published in 2017"

PDF

Open Access

Proceedings Article•DOI•

[...]

Bolei Zhou¹, Hang Zhao¹, Xavier Puig¹, Sanja Fidler², Adela Barriuso¹, Antonio Torralba¹ - Show less +2 more•Institutions (2)

Massachusetts Institute of Technology¹, University of Toronto²

21 Jul 2017

TL;DR: The ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts, is introduced and it is shown that the trained scene parsing networks can lead to applications such as image content removal and scene synthesis.

...read moreread less

Abstract: Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the communitys efforts in data collection, there are still few image datasets covering a wide range of scenes and object categories with dense and detailed annotations for scene parsing. In this paper, we introduce and analyze the ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. A scene parsing benchmark is built upon the ADE20K with 150 object and stuff classes included. Several segmentation baseline models are evaluated on the benchmark. A novel network design called Cascade Segmentation Module is proposed to parse a scene into stuff, objects, and object parts in a cascade and improve over the baselines. We further show that the trained scene parsing networks can lead to applications such as image content removal and scene synthesis1.

...read moreread less

2,233 citations

Proceedings Article•DOI•

Network Dissection: Quantifying Interpretability of Deep Visual Representations

[...]

David Bau¹, Bolei Zhou¹, Aditya Khosla¹, Aude Oliva¹, Antonio Torralba¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Jul 2017

TL;DR: This work uses the proposed Network Dissection method to test the hypothesis that interpretability is an axis-independent property of the representation space, then applies the method to compare the latent representations of various networks when trained to solve different classification problems.

...read moreread less

Abstract: We propose a general framework called Network Dissection for quantifying the interpretability of latent representations of CNNs by evaluating the alignment between individual hidden units and a set of semantic concepts. Given any CNN model, the proposed method draws on a data set of concepts to score the semantics of hidden units at each intermediate convolutional layer. The units with semantics are labeled across a broad range of visual concepts including objects, parts, scenes, textures, materials, and colors. We use the proposed method to test the hypothesis that interpretability is an axis-independent property of the representation space, then we apply the method to compare the latent representations of various networks when trained to solve different classification problems. We further analyze the effect of training iterations, compare networks trained with different initializations, and measure the effect of dropout and batch normalization on the interpretability of deep visual representations. We demonstrate that the proposed method can shed light on characteristics of CNN models and training methods that go beyond measurements of their discriminative power.

...read moreread less

1,037 citations

Proceedings Article•DOI•

Learning Cross-Modal Embeddings for Cooking Recipes and Food Images

[...]

Amaia Salvador¹, Nicholas Hynes², Yusuf Aytar², Javier Marin², Ferda Ofli³, Ingmar Weber³, Antonio Torralba² - Show less +3 more•Institutions (3)

Polytechnic University of Catalonia¹, Massachusetts Institute of Technology², Qatar Computing Research Institute³

21 Jul 2017

TL;DR: This paper introduces Recipe1M, a new large-scale, structured corpus of over 1m cooking recipes and 800k food images, and demonstrates that regularization via the addition of a high-level classification objective both improves retrieval performance to rival that of humans and enables semantic vector arithmetic.

...read moreread less

Abstract: In this paper, we introduce Recipe1M, a new large-scale, structured corpus of over 1m cooking recipes and 800k food images. As the largest publicly available collection of recipe data, Recipe1M affords the ability to train high-capacity models on aligned, multi-modal data. Using these data, we train a neural network to find a joint embedding of recipes and images that yields impressive results on an image-recipe retrieval task. Additionally, we demonstrate that regularization via the addition of a high-level classification objective both improves retrieval performance to rival that of humans and enables semantic vector arithmetic. We postulate that these embeddings will provide a basis for further exploration of the Recipe1M dataset and food and cooking in general. Code, data and models are publicly available

...read moreread less

346 citations

Proceedings Article•

A Compositional Object-Based Approach to Learning Physical Dynamics

[...]

Michael B. Chang, Tomer Ullman¹, Antonio Torralba¹, Joshua B. Tenenbaum¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2017

TL;DR: The NPE's compositional representation of the structure in physical interactions improves its ability to predict movement, generalize across variable object count and different scene configurations, and infer latent properties of objects such as mass.

...read moreread less

Abstract: We present the Neural Physics Engine (NPE), a framework for learning simulators of intuitive physics that naturally generalize across variable object count and different scene configurations. We propose a factorization of a physical scene into composable object-based representations and a neural network architecture whose compositional structure factorizes object dynamics into pairwise interactions. Like a symbolic physics engine, the NPE is endowed with generic notions of objects and their interactions; realized as a neural network, it can be trained via stochastic gradient descent to adapt to specific object properties and dynamics of different worlds. We evaluate the efficacy of our approach on simple rigid body dynamics in two-dimensional worlds. By comparing to less structured architectures, we show that the NPE's compositional representation of the structure in physical interactions improves its ability to predict movement, generalize across variable object count and different scene configurations, and infer latent properties of objects such as mass.

...read moreread less

234 citations

Proceedings Article•DOI•

Generating the Future with Adversarial Transformers

[...]

Carl Vondrick¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jul 2017

TL;DR: This work presents a model that generates the future by transforming pixels in the past, and explicitly disentangles the models memory from the prediction, which helps the model learn desirable invariances.

...read moreread less

Abstract: We learn models to generate the immediate future in video. This problem has two main challenges. Firstly, since the future is uncertain, models should be multi-modal, which can be difficult to learn. Secondly, since the future is similar to the past, models store low-level details, which complicates learning of high-level semantics. We propose a framework to tackle both of these challenges. We present a model that generates the future by transforming pixels in the past. Our approach explicitly disentangles the models memory from the prediction, which helps the model learn desirable invariances. Experiments suggest that this model can generate short videos of plausible futures. We believe predictive models have many applications in robotics, health-care, and video understanding.

...read moreread less

194 citations

Posted Content•

Network Dissection: Quantifying Interpretability of Deep Visual Representations

[...]

David Bau¹, Bolei Zhou¹, Aditya Khosla¹, Aude Oliva¹, Antonio Torralba¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

19 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a general framework called Network Dissection is proposed for quantifying the interpretability of latent representations of CNNs by evaluating the alignment between individual hidden units and a set of semantic concepts.

...read moreread less

Abstract: We propose a general framework called Network Dissection for quantifying the interpretability of latent representations of CNNs by evaluating the alignment between individual hidden units and a set of semantic concepts. Given any CNN model, the proposed method draws on a broad data set of visual concepts to score the semantics of hidden units at each intermediate convolutional layer. The units with semantics are given labels across a range of objects, parts, scenes, textures, materials, and colors. We use the proposed method to test the hypothesis that interpretability of units is equivalent to random linear combinations of units, then we apply our method to compare the latent representations of various networks when trained to solve different supervised and self-supervised training tasks. We further analyze the effect of training iterations, compare networks trained with different initializations, examine the impact of network depth and width, and measure the effect of dropout and batch normalization on the interpretability of deep visual representations. We demonstrate that the proposed method can shed light on characteristics of CNN models and training methods that go beyond measurements of their discriminative power.

...read moreread less

175 citations

Proceedings Article•DOI•

Turning Corners into Cameras: Principles and Methods

[...]

Katherine L. Bouman¹, Vickie Ye², Adam B. Yedidia², Frédo Durand², Gregory W. Wornell², Antonio Torralba², William T. Freeman³ - Show less +3 more•Institutions (3)

Harvard University¹, Massachusetts Institute of Technology², Google³

01 Oct 2017

TL;DR: It is shown that walls, and other obstructions with edges, can be exploited as naturally-occurring “cameras” that reveal the hidden scenes beyond them and that adjacent wall edges yield a stereo camera from which the 2-D location of hidden, moving objects can be recovered.

...read moreread less

Abstract: We show that walls, and other obstructions with edges, can be exploited as naturally-occurring “cameras” that reveal the hidden scenes beyond them. In particular, we demonstrate methods for using the subtle spatio-temporal radiance variations that arise on the ground at the base of a wall's edge to construct a one-dimensional video of the hidden scene behind the wall. The resulting technique can be used for a variety of applications in diverse physical settings. From standard RGB video recordings, we use edge cameras to recover 1-D videos that reveal the number and trajectories of people moving in an occluded scene. We further show that adjacent wall edges, such as those that arise in the case of an open doorway, yield a stereo camera from which the 2-D location of hidden, moving objects can be recovered. We demonstrate our technique in a number of indoor and outdoor environments involving varied floor surfaces and illumination conditions.

...read moreread less

127 citations

Posted Content•

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

[...]

Hang Zhao¹, Antonio Torralba¹, Lorenzo Torresani², Zhicheng Yan³•Institutions (3)

Massachusetts Institute of Technology¹, Dartmouth College², University of Illinois at Urbana–Champaign³

26 Dec 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: On HACS Segments, the state-of-the-art methods of action proposal generation and action localization are evaluated, and the new challenges posed by the dense temporal annotations are highlighted.

...read moreread less

Abstract: This paper presents a new large-scale dataset for recognition and temporal localization of human actions collected from Web videos. We refer to it as HACS (Human Action Clips and Segments). We leverage both consensus and disagreement among visual classifiers to automatically mine candidate short clips from unlabeled videos, which are subsequently validated by human annotators. The resulting dataset is dubbed HACS Clips. Through a separate process we also collect annotations defining action segment boundaries. This resulting dataset is called HACS Segments. Overall, HACS Clips consists of 1.5M annotated clips sampled from 504K untrimmed videos, and HACS Seg-ments contains 139K action segments densely annotatedin 50K untrimmed videos spanning 200 action categories. HACS Clips contains more labeled examples than any existing video benchmark. This renders our dataset both a large scale action recognition benchmark and an excellent source for spatiotemporal feature learning. In our transferlearning experiments on three target datasets, HACS Clips outperforms Kinetics-600, Moments-In-Time and Sports1Mas a pretraining source. On HACS Segments, we evaluate state-of-the-art methods of action proposal generation and action localization, and highlight the new challenges posed by our dense temporal annotations.

...read moreread less

122 citations

Posted Content•

See, Hear, and Read: Deep Aligned Representations

[...]

Yusuf Aytar, Carl Vondrick, Antonio Torralba

03 Jun 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work utilizes large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language, and jointly train a deep convolutional network for aligned representation learning.

...read moreread less

Abstract: We capitalize on large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language. By leveraging over a year of sound from video and millions of sentences paired with images, we jointly train a deep convolutional network for aligned representation learning. Our experiments suggest that this representation is useful for several tasks, such as cross-modal retrieval or transferring classifiers between modalities. Moreover, although our network is only trained with image+text and image+sound pairs, it can transfer between text and sound as well, a transfer the network never observed during training. Visualizations of our representation reveal many hidden units which automatically emerge to detect concepts, independent of the modality.

...read moreread less

120 citations

Proceedings Article•DOI•

SegICP: Integrated Deep Semantic Segmentation and Pose Estimation

[...]

Jay Ming Wong¹, Vincent Kee², Tiffany Le², Syler Wagner¹, Gian Luca Mariottini¹, Abraham Schneider¹, Lei Hamilton¹, Rahul Chipalkatty¹, Mitchell Hebert¹, David M. S. Johnson¹, Jimmy Wu², Bolei Zhou², Antonio Torralba² - Show less +9 more•Institutions (2)

Charles Stark Draper Laboratory¹, Massachusetts Institute of Technology²

05 Mar 2017-arXiv: Robotics

TL;DR: SegICP couples convolutional neural networks and multi-hypothesis point cloud registration to achieve both robust pixel-wise semantic segmentation as well as accurate and real-time 6-DOF pose estimation for relevant objects.

...read moreread less

Abstract: Recent robotic manipulation competitions have highlighted that sophisticated robots still struggle to achieve fast and reliable perception of task-relevant objects in complex, realistic scenarios. To improve these systems' perceptive speed and robustness, we present SegICP, a novel integrated solution to object recognition and pose estimation. SegICP couples convolutional neural networks and multi-hypothesis point cloud registration to achieve both robust pixel-wise semantic segmentation as well as accurate and real-time 6-DOF pose estimation for relevant objects. Our architecture achieves 1cm position error and <5^\circ$ angle error in real time without an initial seed. We evaluate and benchmark SegICP against an annotated dataset generated by motion capture.

...read moreread less

110 citations

Journal Article•DOI•

Places: An Image Database for Deep Scene Understanding

[...]

Bolei Zhou¹, Aditya Khosla², Agata Lapedriza², Antonio Torralba¹, Aude Oliva¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Open University of Catalonia²

01 Sep 2017-Journal of Vision

TL;DR: The Places Database is described, a repository of 10 million scene photographs, labeled with scene semantic categories and attributes, comprising a quasi-exhaustive list of the types of environments encountered in the world.

...read moreread less

Abstract: The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification at tasks such as object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories and attributes, comprising a quasi-exhaustive list of the types of environments encountered in the world. Using state of the art Convolutional Neural Networks, we provide impressive baseline performances at scene classification. With its high-coverage and high-diversity of exemplars, the Places Database offers an ecosystem to guide future progress on currently intractable visual recognition problems.

...read moreread less

Proceedings Article•DOI•

Following Gaze in Video

[...]

Adrià Recasens¹, Carl Vondrick¹, Aditya Khosla², Antonio Torralba¹•Institutions (2)

Massachusetts Institute of Technology¹, Open University of Catalonia²

01 Oct 2017

TL;DR: An approach for following gaze in video by predicting where a person (in the video) is looking even when the object is in a different frame, using VideoGaze, a new dataset which is used as a benchmark to both train and evaluate models.

...read moreread less

Abstract: Following the gaze of people inside videos is an important signal for understanding people and their actions. In this paper, we present an approach for following gaze in video by predicting where a person (in the video) is looking even when the object is in a different frame. We collect VideoGaze, a new dataset which we use as a benchmark to both train and evaluate models. Given one frame with a person in it, our model estimates a density for gaze location in every frame and the probability that the person is looking in that particular frame. A key aspect of our approach is an end-to-end model that jointly estimates: saliency, gaze pose, and geometric relationships between views while only using gaze as supervision. Visualizations suggest that the model learns to internally solve these intermediate tasks automatically without additional supervision. Experiments show that our approach follows gaze in video better than existing approaches, enabling a richer understanding of human activities in video.

...read moreread less

Proceedings Article•DOI•

SegICP: Integrated deep semantic segmentation and pose estimation

[...]

Charles Stark Draper Laboratory¹, Massachusetts Institute of Technology²

01 Sep 2017

TL;DR: In this article, SegICP couples convolutional neural networks and multi-hypothesis point cloud registration to achieve both robust pixel-wise semantic segmentation as well as accurate and real-time 6-DOF pose estimation for relevant objects.

...read moreread less

Abstract: Recent robotic manipulation competitions have highlighted that sophisticated robots still struggle to achieve fast and reliable perception of task-relevant objects in complex, realistic scenarios. To improve these systems' perceptive speed and robustness, we present SegICP, a novel integrated solution to object recognition and pose estimation. SegICP couples convolutional neural networks and multi-hypothesis point cloud registration to achieve both robust pixel-wise semantic segmentation as well as accurate and real-time 6-DOF pose estimation for relevant objects. Our architecture achieves 1 cm position error and < 5° angle error in real time without an initial seed. We evaluate and benchmark SegICP against an annotated dataset generated by motion capture.

...read moreread less

Proceedings Article•DOI•

Is Saki #delicious?: The Food Perception Gap on Instagram and Its Relation to Health

[...]

Ferda Ofli¹, Yusuf Aytar², Ingmar Weber¹, Raggi al Hammouri¹, Antonio Torralba² - Show less +1 more•Institutions (2)

Khalifa University¹, Massachusetts Institute of Technology²

03 Apr 2017

TL;DR: This work uses data for 1.9 million images from Instagram from the US to look at systematic differences in how a machine would objectively label an image compared to how a human subjectively does, and shows that this difference, which it calls the "perception gap", relates to a number of health outcomes observed at the county level.

...read moreread less

Abstract: Food is an integral part of our life and what and how much we eat crucially affects our health. Our food choices largely depend on how we perceive certain characteristics of food, such as whether it is healthy, delicious or if it qualifies as a salad. But these perceptions differ from person to person and one person's "single lettuce leaf" might be another person's "side salad". Studying how food is perceived in relation to what it actually is typically involves a laboratory setup. Here we propose to use recent advances in image recognition to tackle this problem. Concretely, we use data for 1.9 million images from Instagram from the US to look at systematic differences in how a machine would objectively label an image compared to how a human subjectively does. We show that this difference, which we call the "perception gap", relates to a number of health outcomes observed at the county level. To the best of our knowledge, this is the first time that image recognition is being used to study the "misalignment" of how people describe food images vs. what they actually depict.

...read moreread less

Proceedings Article•DOI•

Open Vocabulary Scene Parsing

[...]

Hang Zhao¹, Xavier Puig¹, Bolei Zhou¹, Sanja Fidler², Antonio Torralba¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, University of Toronto²

01 Oct 2017

TL;DR: In this article, a joint image pixel and word concept embeddings framework is proposed, where word concepts are connected by semantic relations and the trained joint embedding space is further explored to show its interpretability.

...read moreread less

Abstract: Recognizing arbitrary objects in the wild has been a challenging problem due to the limitations of existing classification models and datasets. In this paper, we propose a new task that aims at parsing scenes with a large and open vocabulary, and several evaluation metrics are explored for this problem. Our approach is a joint image pixel and word concept embeddings framework, where word concepts are connected by semantic relations. We validate the open vocabulary prediction ability of our framework on ADE20K dataset which covers a wide variety of scenes and objects. We further explore the trained joint embedding space to show its interpretability.

...read moreread less

Posted Content•

Face-to-BMI: Using Computer Vision to Infer Body Mass Index on Social Media

[...]

Enes Kocabey¹, Mustafa Camurcu², Ferda Ofli³, Yusuf Aytar¹, Javier Marin¹, Antonio Torralba¹, Ingmar Weber³ - Show less +3 more•Institutions (3)

Massachusetts Institute of Technology¹, Northeastern University², Qatar Computing Research Institute³

09 Mar 2017-arXiv: Human-Computer Interaction

TL;DR: This paper shows how computer vision can be used to infer a person's BMI from social media images and hopes that this tool helps to advance the study of social aspects related to body weight.

...read moreread less

Abstract: A person's weight status can have profound implications on their life, ranging from mental health, to longevity, to financial income. At the societal level, "fat shaming" and other forms of "sizeism" are a growing concern, while increasing obesity rates are linked to ever raising healthcare costs. For these reasons, researchers from a variety of backgrounds are interested in studying obesity from all angles. To obtain data, traditionally, a person would have to accurately self-report their body-mass index (BMI) or would have to see a doctor to have it measured. In this paper, we show how computer vision can be used to infer a person's BMI from social media images. We hope that our tool, which we release, helps to advance the study of social aspects related to body weight.

...read moreread less

Posted Content•

SLAC: A Sparsely Labeled Dataset for Action Classification and Localization

[...]

Hang Zhao, Zhicheng Yan, Heng Wang, Lorenzo Torresani, Antonio Torralba - Show less +1 more

26 Dec 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: The proposed procedure dramatically reduces the amount of human labeling by automatically identifying hard clips, i.e., clips that contain coherent actions but lead to prediction disagreement between action classifiers, thus generating labels for highly informative samples at little cost.

...read moreread less

Abstract: This paper describes a procedure for the creation of large-scale video datasets for action classification and localization from unconstrained, realistic web data. The scalability of the proposed procedure is demonstrated by building a novel video benchmark, named SLAC (Sparsely Labeled ACtions), consisting of over 520K untrimmed videos and 1.75M clip annotations spanning 200 action categories. Using our proposed framework, annotating a clip takes merely 8.8 seconds on average. This represents a saving in labeling time of over 95% compared to the traditional procedure of manual trimming and localization of actions. Our approach dramatically reduces the amount of human labeling by automatically identifying hard clips, i.e., clips that contain coherent actions but lead to prediction disagreement between action classifiers. A human annotator can disambiguate whether such a clip truly contains the hypothesized action in a handful of seconds, thus generating labels for highly informative samples at little cost. We show that our large-scale dataset can be used to effectively pre-train action recognition models, significantly improving final metrics on smaller-scale benchmarks after fine-tuning. On Kinetics, UCF-101 and HMDB-51, models pre-trained on SLAC outperform baselines trained from scratch, by 2.0%, 20.1% and 35.4% in top-1 accuracy, respectively when RGB input is used. Furthermore, we introduce a simple procedure that leverages the sparse labels in SLAC to pre-train action localization models. On THUMOS14 and ActivityNet-v1.3, our localization model improves the mAP of baseline model by 8.6% and 2.5%, respectively.

...read moreread less

Posted Content•

Open Vocabulary Scene Parsing

[...]

Hang Zhao¹, Xavier Puig¹, Bolei Zhou¹, Sanja Fidler², Antonio Torralba¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, University of Toronto²

26 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: A joint image pixel and word concept embeddings framework, where word concepts are connected by semantic relations, and validated the open vocabulary prediction ability of this framework on ADE20K dataset which covers a wide variety of scenes and objects.

...read moreread less

Abstract: Recognizing arbitrary objects in the wild has been a challenging problem due to the limitations of existing classification models and datasets. In this paper, we propose a new task that aims at parsing scenes with a large and open vocabulary, and several evaluation metrics are explored for this problem. Our proposed approach to this problem is a joint image pixel and word concept embeddings framework, where word concepts are connected by semantic relations. We validate the open vocabulary prediction ability of our framework on ADE20K dataset which covers a wide variety of scenes and objects. We further explore the trained joint embedding space to show its interpretability.

...read moreread less

Posted Content•

Is Saki #delicious? The Food Perception Gap on Instagram and Its Relation to Health

[...]

Ferda Ofli¹, Yusuf Aytar², Ingmar Weber¹, Raggi al Hammouri¹, Antonio Torralba² - Show less +1 more•Institutions (2)

Khalifa University¹, Massachusetts Institute of Technology²

21 Feb 2017-arXiv: Computers and Society

TL;DR: This paper used image recognition to study the "misalignment" of how people describe food images vs what they actually depict, and showed that this difference relates to a number of health outcomes observed at the county level.

...read moreread less

Posted Content•

Temporal Relational Reasoning in Videos

[...]

Bolei Zhou¹, Alex Andonian¹, Aude Oliva¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

22 Nov 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: Temporal Relation Network (TRN) as mentioned in this paper is designed to learn and reason about temporal dependencies between video frames at multiple time scales, which can learn intuitive and interpretable visual common sense knowledge in videos.

...read moreread less

Abstract: Temporal relational reasoning, the ability to link meaningful transformations of objects or entities over time, is a fundamental property of intelligent species. In this paper, we introduce an effective and interpretable network module, the Temporal Relation Network (TRN), designed to learn and reason about temporal dependencies between video frames at multiple time scales. We evaluate TRN-equipped networks on activity recognition tasks using three recent video datasets - Something-Something, Jester, and Charades - which fundamentally depend on temporal relational reasoning. Our results demonstrate that the proposed TRN gives convolutional neural networks a remarkable capacity to discover temporal relations in videos. Through only sparsely sampled video frames, TRN-equipped networks can accurately predict human-object interactions in the Something-Something dataset and identify various human gestures on the Jester dataset with very competitive performance. TRN-equipped networks also outperform two-stream networks and 3D convolution networks in recognizing daily activities in the Charades dataset. Further analyses show that the models learn intuitive and interpretable visual common sense knowledge in videos.

...read moreread less

Posted Content•

Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

[...]

Andrew Owens¹, Jiajun Wu¹, Josh H. McDermott¹, William T. Freeman¹, Antonio Torralba¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

20 Dec 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a convolutional neural network is trained to predict a summary of the sound associated with a video frame, and the network learns a representation that conveys information about objects and scenes.

...read moreread less

Abstract: The sound of crashing waves, the roar of fast-moving cars -- sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. This paper extends an earlier conference paper, Owens et al. 2016, with additional experiments and discussion.

...read moreread less

Posted Content•

Learning to Act Properly: Predicting and Explaining Affordances from Images

[...]

Ching-Yao Chuang¹, Jiaman Li², Antonio Torralba¹, Sanja Fidler²•Institutions (2)

Massachusetts Institute of Technology¹, University of Toronto²

20 Dec 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a model that exploits Graph Neural Networks to propagate contextual information from the scene in order to perform detailed affordance reasoning about each object is proposed. But their work is limited to a single object.

...read moreread less

Abstract: We address the problem of affordance reasoning in diverse scenes that appear in the real world. Affordances relate the agent's actions to their effects when taken on the surrounding objects. In our work, we take the egocentric view of the scene, and aim to reason about action-object affordances that respect both the physical world as well as the social norms imposed by the society. We also aim to teach artificial agents why some actions should not be taken in certain situations, and what would likely happen if these actions would be taken. We collect a new dataset that builds upon ADE20k, referred to as ADE-Affordance, which contains annotations enabling such rich visual reasoning. We propose a model that exploits Graph Neural Networks to propagate contextual information from the scene in order to perform detailed affordance reasoning about each object. Our model is showcased through various ablation studies, pointing to successes and challenges in this complex task.

...read moreread less

Proceedings Article•

Face-to-BMI: Using Computer Vision to Infer Body Mass Index on Social Media

[...]

Enes Kocabey¹, Mustafa Camurcu², Ferda Ofli³, Yusuf Aytar¹, Javier Marin¹, Antonio Torralba¹, Ingmar Weber³ - Show less +3 more•Institutions (3)

Massachusetts Institute of Technology¹, Northeastern University², Qatar Computing Research Institute³

01 Jan 2017

TL;DR: In this article, computer vision can be used to infer a person's body mass index (BMI) from social media images, which can have profound implications on their life, ranging from mental health, to longevity, to financial income.

...read moreread less

Abstract: A person's weight status can have profound implications on their life, ranging from mental health, to longevity, to financial income. At the societal level, "fat shaming'" and other forms of "sizeism'' are a growing concern, while increasing obesity rates are linked to ever raising healthcare costs. For these reasons, researchers from a variety of backgrounds are interested in studying obesity from all angles. To obtain data, traditionally, a person would have to accurately self-report their body-mass index (BMI) or would have to see a doctor to have it measured. In this paper, we show how computer vision can be used to infer a person's BMI from social media images. We hope that our tool, which we release, helps to advance the study of social aspects related to body weight.

...read moreread less

Posted Content•

Interpreting Deep Visual Representations via Network Dissection

[...]

Bolei Zhou¹, David Bau¹, Aude Oliva¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

15 Nov 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: Network Dissection is described, a method that interprets networks by providing meaningful labels to their individual units that reveals that deep representations are more transparent and interpretable than they would be under a random equivalently powerful basis.

...read moreread less

Abstract: The success of recent deep convolutional neural networks (CNNs) depends on learning hidden representations that can summarize the important factors of variation behind the data. However, CNNs often criticized as being black boxes that lack interpretability, since they have millions of unexplained model parameters. In this work, we describe Network Dissection, a method that interprets networks by providing labels for the units of their deep visual representations. The proposed method quantifies the interpretability of CNN representations by evaluating the alignment between individual hidden units and a set of visual semantic concepts. By identifying the best alignments, units are given human interpretable labels across a range of objects, parts, scenes, textures, materials, and colors. The method reveals that deep representations are more transparent and interpretable than expected: we find that representations are significantly more interpretable than they would be under a random equivalently powerful basis. We apply the method to interpret and compare the latent representations of various network architectures trained to solve different supervised and self-supervised training tasks. We then examine factors affecting the network interpretability such as the number of the training iterations, regularizations, different initializations, and the network depth and width. Finally we show that the interpreted units can be used to provide explicit explanations of a prediction given by a CNN for an image. Our results highlight that interpretability is an important property of deep neural networks that provides new insights into their hierarchical structure.

...read moreread less

Proceedings Article•DOI•

Benchmarking Convolutional Neural Networks for Object Segmentation and Pose Estimation

[...]

Tiffany Le¹, Lei Hamilton¹, Antonio Torralba•Institutions (1)

Charles Stark Draper Laboratory¹

01 Oct 2017

TL;DR: In this article, several CNN architectures are benchmarked on a set of metrics for object segmentation and pose estimation, which show that metric performance is dependent on the complexity of network architectures.

...read moreread less

Abstract: Convolutional neural networks (CNNs), particularly those designed for object segmentation and pose estimation, are now applied to robotics applications involving mobile manipulation For these robotic applications to be successful, robust and accurate performance from the CNNs is critical Therefore, in order to develop an understanding of CNN performance, several CNN architectures are benchmarked on a set of metrics for object segmentation and pose estimation This paper presents these benchmarking results, which show that metric performance is dependent on the complexity of network architectures These findings can be used to guide and improve the development of CNNs for object segmentation and pose estimation in the future

...read moreread less

Posted Content•

Exploiting Occlusion in Non-Line-of-Sight Active Imaging

[...]

Christos Thrampoulidis¹, Gal Shulkind¹, Feihu Xu¹, William T. Freeman¹, Jeffrey H. Shapiro¹, Antonio Torralba¹, Franco N. C. Wong¹, Gregory W. Wornell¹ - Show less +4 more•Institutions (1)

Massachusetts Institute of Technology¹

16 Nov 2017-arXiv: Image and Video Processing

TL;DR: It is demonstrated that the presence of occluders in the hidden scene can obviate the need for collecting time-resolved measurements, and an accompanying analysis for such systems and their generalizations is developed.

...read moreread less

Abstract: Active non-line-of-sight imaging systems are of growing interest for diverse applications. The most commonly proposed approaches to date rely on exploiting time-resolved measurements, i.e., measuring the time it takes for short light pulses to transit the scene. This typically requires expensive, specialized, ultrafast lasers and detectors that must be carefully calibrated. We develop an alternative approach that exploits the valuable role that natural occluders in a scene play in enabling accurate and practical image formation in such settings without such hardware complexity. In particular, we demonstrate that the presence of occluders in the hidden scene can obviate the need for collecting time-resolved measurements, and develop an accompanying analysis for such systems and their generalizations. Ultimately, the results suggest the potential to develop increasingly sophisticated future systems that are able to identify and exploit diverse structural features of the environment to reconstruct scenes hidden from view.

...read moreread less

Journal Article•DOI•

Guest Editorial: Best of CVPR 2015

[...]

Kristin Grauman, Eric Learned-Miller, Antonio Torralba, Andrew Zisserman

01 Apr 2017-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The papers in this special section were presented at the CVPR 2015 conference that was held in Boston, MA.

...read moreread less

Abstract: The papers in this special section were presented at the CVPR 2015 conference that was held in Boston, MA.

...read moreread less