Top 38 papers published by Antonio Torralba from Massachusetts Institute of Technology in 2019

Journal Article•DOI•

Semantic Understanding of Scenes Through the ADE20K Dataset

[...]

Bolei Zhou¹, Hang Zhao², Xavier Puig², Tete Xiao³, Sanja Fidler⁴, Adela Barriuso², Antonio Torralba² - Show less +3 more•Institutions (4)

The Chinese University of Hong Kong¹, Massachusetts Institute of Technology², Peking University³, University of Toronto⁴

01 Mar 2019-International Journal of Computer Vision

TL;DR: The ADE20K dataset as discussed by the authors contains 25k images of complex everyday scenes containing a variety of objects in their natural spatial context, on average there are 19.5 instances and 10.5 object classes per image.

...read moreread less

Abstract: Semantic understanding of visual scenes is one of the holy grails of computer vision. Despite efforts of the community in data collection, there are still few image datasets covering a wide range of scenes and object categories with pixel-wise annotations for scene understanding. In this work, we present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. On average there are 19.5 instances and 10.5 object classes per image. Based on ADE20K, we construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K are able to segment a wide variety of scenes and objects.

...read moreread less

961 citations

Journal Article•DOI•

Learning the signatures of the human grasp using a scalable tactile glove.

[...]

Subramanian Sundaram, Petr Kellnhofer¹, Yunzhu Li¹, Jun-Yan Zhu¹, Antonio Torralba¹, Wojciech Matusik¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

01 May 2019-Nature

TL;DR: Tactile patterns obtained from a scalable sensor-embedded glove and deep convolutional neural networks help to explain how the human hand can identify and grasp individual objects and estimate their weights.

...read moreread less

Abstract: Humans can feel, weigh and grasp diverse objects, and simultaneously infer their material properties while applying the right amount of force—a challenging set of tasks for a modern robot1. Mechanoreceptor networks that provide sensory feedback and enable the dexterity of the human grasp2 remain difficult to replicate in robots. Whereas computer-vision-based robot grasping strategies3–5 have progressed substantially with the abundance of visual data and emerging machine-learning tools, there are as yet no equivalent sensing platforms and large-scale datasets with which to probe the use of the tactile information that humans rely on when grasping objects. Studying the mechanics of how humans grasp objects will complement vision-based robotic object handling. Importantly, the inability to record and analyse tactile signals currently limits our understanding of the role of tactile information in the human grasp itself—for example, how tactile maps are used to identify objects and infer their properties is unknown6. Here we use a scalable tactile glove and deep convolutional neural networks to show that sensors uniformly distributed over the hand can be used to identify individual objects, estimate their weight and explore the typical tactile patterns that emerge while grasping objects. The sensor array (548 sensors) is assembled on a knitted glove, and consists of a piezoresistive film connected by a network of conductive thread electrodes that are passively probed. Using a low-cost (about US$10) scalable tactile glove sensor array, we record a large-scale tactile dataset with 135,000 frames, each covering the full hand, while interacting with 26 different objects. This set of interactions with different objects reveals the key correspondences between different regions of a human hand while it is manipulating objects. Insights from the tactile signatures of the human grasp—through the lens of an artificial analogue of the natural mechanoreceptor network—can thus aid the future design of prosthetics7, robot grasping tools and human–robot interactions1,8–10. Tactile patterns obtained from a scalable sensor-embedded glove and deep convolutional neural networks help to explain how the human hand can identify and grasp individual objects and estimate their weights.

...read moreread less

623 citations

Journal Article•DOI•

What Do Different Evaluation Metrics Tell Us About Saliency Models

[...]

Zoya Bylinskii¹, Tilke Judd², Aude Oliva¹, Antonio Torralba¹, Frédo Durand¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Google²

01 Mar 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper provides an analysis of 8 different evaluation metrics and their properties, and makes recommendations for metric selections under specific assumptions and for specific applications.

...read moreread less

Abstract: How best to evaluate a saliency model's ability to predict where humans look in images is an open research question. The choice of evaluation metric depends on how saliency is defined and how the ground truth is represented. Metrics differ in how they rank saliency models, and this results from how false positives and false negatives are treated, whether viewing biases are accounted for, whether spatial deviations are factored in, and how the saliency maps are pre-processed. In this paper, we provide an analysis of 8 different evaluation metrics and their properties. With the help of systematic experiments and visualizations of metric computations, we add interpretability to saliency scores and more transparency to the evaluation of saliency models. Building off the differences in metric properties and behaviors, we make recommendations for metric selections under specific assumptions and for specific applications.

...read moreread less

526 citations

Journal Article•DOI•

Semantic photo manipulation with a generative image prior

[...]

David Bau¹, Hendrik Strobelt², William Peebles¹, Jonas Wulff¹, Bolei Zhou³, Jun-Yan Zhu¹, Antonio Torralba¹ - Show less +3 more•Institutions (3)

Massachusetts Institute of Technology¹, IBM², The Chinese University of Hong Kong³

12 Jul 2019-ACM Transactions on Graphics

TL;DR: The authors adapts the image prior learned by GANs to image statistics of an individual image, which can accurately reconstruct the input image and synthesize new content consistent with the appearance of the original image.

...read moreread less

Abstract: Despite the recent success of GANs in synthesizing images conditioned on inputs such as a user sketch, text, or semantic labels, manipulating the high-level attributes of an existing natural photograph with GANs is challenging for two reasons. First, it is hard for GANs to precisely reproduce an input image. Second, after manipulation, the newly synthesized pixels often do not fit the original image. In this paper, we address these issues by adapting the image prior learned by GANs to image statistics of an individual image. Our method can accurately reconstruct the input image and synthesize new content, consistent with the appearance of the input image. We demonstrate our interactive system on several semantic image editing tasks, including synthesizing new objects consistent with background, removing unwanted objects, and changing the appearance of an object. Quantitative and qualitative comparisons against several existing methods demonstrate the effectiveness of our method.

...read moreread less

315 citations

Proceedings Article•DOI•

Seeing What a GAN Cannot Generate

[...]

David Bau¹, Jun-Yan Zhu¹, Jonas Wulff¹, William Peebles¹, Bolei Zhou², Hendrik Strobelt³, Antonio Torralba¹ - Show less +3 more•Institutions (3)

Massachusetts Institute of Technology¹, The Chinese University of Hong Kong², IBM³

01 Oct 2019

TL;DR: This work visualize mode collapse at both the distribution level and the instance level, and deploys a semantic segmentation network to compare the distribution of segmented objects in the generated images with the target distribution in the training set.

...read moreread less

Abstract: Despite the success of Generative Adversarial Networks (GANs), mode collapse remains a serious issue during GAN training. To date, little work has focused on understanding and quantifying which modes have been dropped by a model. In this work, we visualize mode collapse at both the distribution level and the instance level. First, we deploy a semantic segmentation network to compare the distribution of segmented objects in the generated images with the target distribution in the training set. Differences in statistics reveal object classes that are omitted by a GAN. Second, given the identified omitted object classes, we visualize the GAN's omissions directly. In particular, we compare specific differences between individual photos and their approximate inversions by a GAN. To this end, we relax the problem of inversion and solve the tractable problem of inverting a GAN layer instead of the entire generator. Finally, we use this framework to analyze several recent GANs trained on multiple datasets and identify their typical failure cases.

...read moreread less

254 citations

Proceedings Article•DOI•

The Sound of Motions

[...]

Hang Zhao¹, Chuang Gan¹, Wei-Chiu Ma¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

11 Apr 2019

TL;DR: Quantitative and qualitative evaluations show that comparing to previous models that rely on visual appearance cues, the proposed novel motion based system improves performance in separating musical instrument sounds.

...read moreread less

Abstract: Sounds originate from object motions and vibrations of surrounding air. Inspired by the fact that humans is capable of interpreting sound sources from how objects move visually, we propose a novel system that explicitly captures such motion cues for the task of sound localization and separation. Our system is composed of an end-to-end learnable model called Deep Dense Trajectory (DDT), and a curriculum learning scheme. It exploits the inherent coherence of audio-visual signals from a large quantities of unlabeled videos. Quantitative and qualitative evaluations show that comparing to previous models that rely on visual appearance cues, our motion based system improves performance in separating musical instrument sounds. Furthermore, it separates sound components from duets of the same category of instruments, a challenging problem that has not been addressed before.

...read moreread less

246 citations

Journal Article•DOI•

Interpreting Deep Visual Representations via Network Dissection

[...]

Bolei Zhou¹, David Bau¹, Aude Oliva¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Sep 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: In this paper, the authors quantified the interpretability of CNN representations by evaluating the alignment between individual hidden units and visual semantic concepts and found that deep representations are more transparent and interpretable than they would be under a random equivalently powerful basis.

...read moreread less

Abstract: The success of recent deep convolutional neural networks (CNNs) depends on learning hidden representations that can summarize the important factors of variation behind the data. In this work, we describe Network Dissection, a method that interprets networks by providing meaningful labels to their individual units. The proposed method quantifies the interpretability of CNN representations by evaluating the alignment between individual hidden units and visual semantic concepts. By identifying the best alignments, units are given interpretable labels ranging from colors, materials, textures, parts, objects and scenes. The method reveals that deep representations are more transparent and interpretable than they would be under a random equivalently powerful basis. We apply our approach to interpret and compare the latent representations of several network architectures trained to solve a wide range of supervised and self-supervised tasks. We then examine factors affecting the network interpretability such as the number of the training iterations, regularizations, different initialization parameters, as well as networks depth and width. Finally we show that the interpreted units can be used to provide explicit explanations of a given CNN prediction for an image. Our results highlight that interpretability is an important property of deep neural networks that provides new insights into what hierarchical structures can learn.

...read moreread less

213 citations

Posted Content•

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

[...]

Kexin Yi¹, Chuang Gan², Yunzhu Li³, Pushmeet Kohli⁴, Jiajun Wu³, Antonio Torralba³, Joshua B. Tenenbaum³ - Show less +3 more•Institutions (4)

Harvard University¹, IBM², Massachusetts Institute of Technology³, Google⁴

03 Oct 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work introduces the CoLlision Events for Video REpresentation and Reasoning (CLEVRER), a diagnostic video dataset for systematic evaluation of computational models on a wide range of reasoning tasks, and evaluates various state-of-the-art models for visual reasoning on a benchmark.

...read moreread less

Abstract: The ability to reason about temporal and causal events from videos lies at the core of human intelligence. Most video reasoning benchmarks, however, focus on pattern recognition from complex visual and language input, instead of on causal structure. We study the complementary problem, exploring the temporal and causal structures behind videos of objects with simple visual appearance. To this end, we introduce the CoLlision Events for Video REpresentation and Reasoning (CLEVRER), a diagnostic video dataset for systematic evaluation of computational models on a wide range of reasoning tasks. Motivated by the theory of human casual judgment, CLEVRER includes four types of questions: descriptive (e.g., "what color"), explanatory ("what is responsible for"), predictive ("what will happen next"), and counterfactual ("what if"). We evaluate various state-of-the-art models for visual reasoning on our benchmark. While these models thrive on the perception-based task (descriptive), they perform poorly on the causal tasks (explanatory, predictive and counterfactual), suggesting that a principled approach for causal reasoning should incorporate the capability of both perceiving complex visual and language inputs, and understanding the underlying dynamics and causal relations. We also study an oracle model that explicitly combines these components via symbolic representations.

...read moreread less

208 citations

Proceedings Article•DOI•

Gaze360: Physically Unconstrained Gaze Estimation in the Wild

[...]

Petr Kellnhofer¹, Adrià Recasens¹, Simon Stent², Wojciech Matusik¹, Antonio Torralba¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Toyota²

01 Oct 2019

TL;DR: Gaze360 as discussed by the authors is a large-scale remote gaze tracking dataset and method for robust 3D gaze estimation in unconstrained images, which consists of 238 subjects in indoor and outdoor environments with labelled three-dimensional (3D) gaze across a wide range of head poses and distances.

...read moreread less

Abstract: Understanding where people are looking is an informative social cue. In this work, we present Gaze360, a large-scale remote gaze-tracking dataset and method for robust 3D gaze estimation in unconstrained images. Our dataset consists of 238 subjects in indoor and outdoor environments with labelled 3D gaze across a wide range of head poses and distances. It is the largest publicly available dataset of its kind by both subject and variety, made possible by a simple and efficient collection method. Our proposed 3D gaze model extends existing models to include temporal information and to directly output an estimate of gaze uncertainty. We demonstrate the benefits of our model via an ablation study, and show its generalization performance via a cross-dataset evaluation against other recent gaze benchmark datasets. We furthermore propose a simple self-supervised approach to improve cross-dataset domain adaptation. Finally, we demonstrate an application of our model for estimating customer attention in a supermarket setting. Our dataset and models will be made available at http://gaze360.csail.mit.edu.

...read moreread less

194 citations

Proceedings Article•DOI•

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

[...]

Hang Zhao¹, Antonio Torralba¹, Lorenzo Torresani², Zhicheng Yan³•Institutions (3)

Massachusetts Institute of Technology¹, Dartmouth College², University of Illinois at Urbana–Champaign³

01 Oct 2019

TL;DR: The Human Action Clips and Segments (HACS) dataset as discussed by the authors is a large-scale dataset for action proposal generation and temporal localization from web videos, which contains 1.5M annotated clips sampled from 504k untrimmed videos, and 139K action segments densely annotated in 50k videos spanning 200 action categories.

...read moreread less

Abstract: This paper presents a new large-scale dataset for recognition and temporal localization of human actions collected from Web videos. We refer to it as HACS (Human Action Clips and Segments). We leverage consensus and disagreement among visual classifiers to automatically mine candidate short clips from unlabeled videos, which are subsequently validated by human annotators. The resulting dataset is dubbed HACS Clips. Through a separate process we also collect annotations defining action segment boundaries. This resulting dataset is called HACS Segments. Overall, HACS Clips consists of 1.5M annotated clips sampled from 504K untrimmed videos, and HACS Segments contains 139K action segments densely annotated in 50K untrimmed videos spanning 200 action categories. HACS Clips contains more labeled examples than any existing video benchmark. This renders our dataset both a large-scale action recognition benchmark and an excellent source for spatiotemporal feature learning. In our transfer learning experiments on three target datasets, HACS Clips outperforms Kinetics-600, Moments-In-Time and Sports1M as a pretraining source. On HACS Segments, we evaluate state-of-the-art methods of action proposal generation and action localization, and highlight the new challenges posed by our dense temporal annotations.

...read moreread less

155 citations

Posted Content•

Meta-Sim: Learning to Generate Synthetic Datasets

[...]

Amlan Kar¹, Aayush Prakash², Ming-Yu Liu², Eric Cameracci², Justin Yuan², Matt Rusiniak², David Acuna¹, Antonio Torralba³, Sanja Fidler¹ - Show less +5 more•Institutions (3)

University of Toronto¹, Nvidia², Massachusetts Institute of Technology³

25 Apr 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: Meta-Sim as mentioned in this paper learns a generative model of synthetic scenes and obtain images as well as its corresponding ground-truth via a graphics engine, and parametrizes its dataset generator with a neural network, which learns to modify attributes of scene graphs obtained from probabilistic scene grammars, so as to minimize the distribution gap between its rendered outputs and target data.

...read moreread less

Abstract: Training models to high-end performance requires availability of large labeled datasets, which are expensive to get. The goal of our work is to automatically synthesize labeled datasets that are relevant for a downstream task. We propose Meta-Sim, which learns a generative model of synthetic scenes, and obtain images as well as its corresponding ground-truth via a graphics engine. We parametrize our dataset generator with a neural network, which learns to modify attributes of scene graphs obtained from probabilistic scene grammars, so as to minimize the distribution gap between its rendered outputs and target data. If the real dataset comes with a small labeled validation set, we additionally aim to optimize a meta-objective, i.e. downstream task performance. Experiments show that the proposed method can greatly improve content generation quality over a human-engineered probabilistic scene grammar, both qualitatively and quantitatively as measured by performance on a downstream task.

...read moreread less

Proceedings Article•DOI•

Meta-Sim: Learning to Generate Synthetic Datasets

[...]

Amlan Kar¹, Aayush Prakash², Ming-Yu Liu², Eric Cameracci², Justin Yuan², Matt Rusiniak², David Acuna¹, Antonio Torralba³, Sanja Fidler¹ - Show less +5 more•Institutions (3)

University of Toronto¹, Nvidia², Massachusetts Institute of Technology³

01 Oct 2019

TL;DR: Meta-Sim is proposed, which learns a generative model of synthetic scenes, and obtain images as well as its corresponding ground-truth via a graphics engine, and can greatly improve content generation quality over a human-engineered probabilistic scene grammar.

...read moreread less

Abstract: Training models to high-end performance requires availability of large labeled datasets, which are expensive to get. The goal of our work is to automatically synthesize labeled datasets that are relevant for a downstream task. We propose Meta-Sim, which learns a generative model of synthetic scenes, and obtain images as well as its corresponding ground-truth via a graphics engine. We parametrize our dataset generator with a neural network, which learns to modify attributes of scene graphs obtained from probabilistic scene grammars, so as to minimize the distribution gap between its rendered outputs and target data. If the real dataset comes with a small labeled validation set, we additionally aim to optimize a meta-objective, i.e. downstream task performance. Experiments show that the proposed method can greatly improve content generation quality over a human-engineered probabilistic scene grammar, both qualitatively and quantitatively as measured by performance on a downstream task.

...read moreread less

Proceedings Article•DOI•

Self-Supervised Moving Vehicle Tracking With Stereo Sound

[...]

Chuang Gan¹, Hang Zhao¹, Peihao Chen, David D. Cox², Antonio Torralba¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, IBM²

25 Oct 2019

TL;DR: This work proposes a system that can leverage unlabeled audiovisual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time, and demonstrates that the proposed approach outperforms several baseline approaches.

...read moreread less

Abstract: Humans are able to localize objects in the environment using both visual and auditory cues, integrating information from multiple modalities into a common reference frame. We introduce a system that can leverage unlabeled audiovisual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time. Since it is labor-intensive to manually annotate the correspondences between audio and object bounding boxes, we achieve this goal by using the co-occurrence of visual and audio streams in unlabeled videos as a form of self-supervision, without resorting to the collection of ground truth annotations. In particular, we propose a framework that consists of a vision ``teacher'' network and a stereo-sound ``student'' network. During training, knowledge embodied in a well-established visual vehicle detection model is transferred to the audio domain using unlabeled videos as a bridge. At test time, the stereo-sound student network can work independently to perform object localization using just stereo audio and camera meta-data, without any visual input. Experimental results on a newly collected Auditory Vehicles Tracking dataset verify that our proposed approach outperforms several baseline approaches. We also demonstrate that our cross-modal auditory localization approach can assist in the visual localization of moving vehicles under poor lighting conditions.

...read moreread less

Proceedings Article•DOI•

Self-supervised Audio-visual Co-segmentation

[...]

Andrew Rouditchenko¹, Hang Zhao¹, Chuang Gan², Josh H. McDermott¹, Antonio Torralba¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, IBM²

12 May 2019

TL;DR: In this article, a self-supervised neural network model for visual object segmentation and sound source separation is proposed. But the model is not suitable for audio-visual training on videos.

...read moreread less

Abstract: Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data. In this paper we develop a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision. The model is an extension of recently proposed work that maps image pixels to sounds [1]. Here, we introduce a learning approach to disentangle concepts in the neural networks, and assign semantic categories to network feature channels to enable independent image segmentation and sound source separation after audio-visual training on videos. Our evaluations show that the disentangled model outperforms several baselines in semantic segmentation and sound source separation.

...read moreread less

Posted Content•

Seeing What a GAN Cannot Generate

[...]

David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, Hendrik Strobelt, Bolei Zhou, Antonio Torralba - Show less +3 more

24 Oct 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a semantic segmentation network is deployed to compare the distribution of segmented objects in the generated images with the target distribution in the training set, revealing object classes that are omitted by a GAN.

...read moreread less

Abstract: Despite the success of Generative Adversarial Networks (GANs), mode collapse remains a serious issue during GAN training. To date, little work has focused on understanding and quantifying which modes have been dropped by a model. In this work, we visualize mode collapse at both the distribution level and the instance level. First, we deploy a semantic segmentation network to compare the distribution of segmented objects in the generated images with the target distribution in the training set. Differences in statistics reveal object classes that are omitted by a GAN. Second, given the identified omitted object classes, we visualize the GAN's omissions directly. In particular, we compare specific differences between individual photos and their approximate inversions by a GAN. To this end, we relax the problem of inversion and solve the tractable problem of inverting a GAN layer instead of the entire generator. Finally, we use this framework to analyze several recent GANs trained on multiple datasets and identify their typical failure cases.

...read moreread less

Proceedings Article•DOI•

Propagation Networks for Model-Based Control Under Partial Observation

[...]

Yunzhu Li¹, Jiajun Wu¹, Jun-Yan Zhu¹, Joshua B. Tenenbaum¹, Antonio Torralba¹, Russ Tedrake¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

20 May 2019

TL;DR: PropNet (PropNet) as discussed by the authors is a differentiable, learnable dynamics model that handles partially observable scenarios and enables instantaneous propagation of signals beyond pairwise interactions, and it not only outperforms current learnable physics engines in forward simulation, but also achieves superior performance on various control tasks.

...read moreread less

Abstract: There has been an increasing interest in learning dynamics simulators for model-based control. Compared with off-the-shelf physics engines, a learnable simulator can quickly adapt to unseen objects, scenes, and tasks. However, existing models like interaction networks only work for fully observable systems; they also only consider pairwise interactions within a single time step, both restricting their use in practical systems. We introduce Propagation Networks (PropNet), a differentiable, learnable dynamics model that handles partially observable scenarios and enables instantaneous propagation of signals beyond pairwise interactions. With these innovations, our propagation networks not only outperform current learnable physics engines in forward simulation, but also achieves superior performance on various control tasks. Compared with existing deep reinforcement learning algorithms, model-based control with propagation networks is more accurate, efficient, and generalizable to novel, partially observable scenes and tasks.

...read moreread less

Proceedings Article•DOI•

Through-Wall Human Mesh Recovery Using Radio Signals

[...]

Mingmin Zhao¹, Yingcheng Liu¹, Aniruddh Raghu¹, Hang Zhao¹, Tianhong Li¹, Antonio Torralba¹, Dina Katabi¹ - Show less +3 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Oct 2019

TL;DR: RF-Avatar, a neural network model that can estimate 3D meshes of the human body in the presence of occlusions, baggy clothes, and bad lighting conditions, and even through walls, is presented.

...read moreread less

Abstract: This paper presents RF-Avatar, a neural network model that can estimate 3D meshes of the human body in the presence of occlusions, baggy clothes, and bad lighting conditions. We leverage that radio frequency (RF) signals in the WiFi range traverse clothes and occlusions and bounce off the human body. Our model parses such radio signals and recovers 3D body meshes. Our meshes are dynamic and smoothly track the movements of the corresponding people. Further, our model works both in single and multi-person scenarios. Inferring body meshes from radio signals is a highly under-constrained problem. Our model deals with this challenge using: 1) a combination of strong and weak supervision, 2) a multi-headed self-attention mechanism that attends differently to temporal information in the radio signal, and 3) an adversarially trained temporal discriminator that imposes a prior on the dynamics of human motion. Our results show that RF-Avatar accurately recovers dynamic 3D meshes in the presence of occlusions, baggy clothes, bad lighting conditions, and even through walls.

...read moreread less

Posted Content•

Self-supervised Moving Vehicle Tracking with Stereo Sound.

[...]

Chuang Gan¹, Hang Zhao¹, Peihao Chen, David D. Cox², Antonio Torralba¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, IBM²

25 Oct 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a cross-modal auditory localization approach is proposed to assist in the visual localization of moving vehicles under poor lighting conditions by using the co-occurrence of visual and audio streams in unlabeled videos.

...read moreread less

Abstract: Humans are able to localize objects in the environment using both visual and auditory cues, integrating information from multiple modalities into a common reference frame. We introduce a system that can leverage unlabeled audio-visual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time. Since it is labor-intensive to manually annotate the correspondences between audio and object bounding boxes, we achieve this goal by using the co-occurrence of visual and audio streams in unlabeled videos as a form of self-supervision, without resorting to the collection of ground-truth annotations. In particular, we propose a framework that consists of a vision "teacher" network and a stereo-sound "student" network. During training, knowledge embodied in a well-established visual vehicle detection model is transferred to the audio domain using unlabeled videos as a bridge. At test time, the stereo-sound student network can work independently to perform object localization us-ing just stereo audio and camera meta-data, without any visual input. Experimental results on a newly collected Au-ditory Vehicle Tracking dataset verify that our proposed approach outperforms several baseline approaches. We also demonstrate that our cross-modal auditory localization approach can assist in the visual localization of moving vehicles under poor lighting conditions.

...read moreread less

Posted Content•

Learning Compositional Koopman Operators for Model-Based Control

[...]

Yunzhu Li¹, Hao He¹, Jiajun Wu¹, Dina Katabi¹, Antonio Torralba¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

18 Oct 2019-arXiv: Learning

TL;DR: This paper proposes to learn compositional Koopman operators, using graph neural networks to encode the state into object-centric embeddings and using a block-wise linear transition matrix to regularize the shared structure across objects.

...read moreread less

Abstract: Finding an embedding space for a linear approximation of a nonlinear dynamical system enables efficient system identification and control synthesis. The Koopman operator theory lays the foundation for identifying the nonlinear-to-linear coordinate transformations with data-driven methods. Recently, researchers have proposed to use deep neural networks as a more expressive class of basis functions for calculating the Koopman operators. These approaches, however, assume a fixed dimensional state space; they are therefore not applicable to scenarios with a variable number of objects. In this paper, we propose to learn compositional Koopman operators, using graph neural networks to encode the state into object-centric embeddings and using a block-wise linear transition matrix to regularize the shared structure across objects. The learned dynamics can quickly adapt to new environments of unknown physical parameters and produce control signals to achieve a specified goal. Our experiments on manipulating ropes and controlling soft robots show that the proposed method has better efficiency and generalization ability than existing baselines.

...read moreread less

Proceedings Article•DOI•

Connecting Touch and Vision via Cross-Modal Prediction

[...]

Yunzhu Li¹, Jun-Yan Zhu¹, Russ Tedrake¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jun 2019

TL;DR: In this article, a new conditional adversarial model was proposed to synthesize plausible tactile signals from visual inputs as well as imagine how humans interact with objects given tactile data as input.

...read moreread less

Abstract: Humans perceive the world using multi-modal sensory inputs such as vision, audition, and touch. In this work, we investigate the cross-modal connection between vision and touch. The main challenge in this cross-domain modeling task lies in the significant scale discrepancy between the two: while our eyes perceive an entire visual scene at once, humans can only feel a small region of an object at any given moment. To connect vision and touch, we introduce new tasks of synthesizing plausible tactile signals from visual inputs as well as imagining how we interact with objects given tactile data as input. To accomplish our goals, we first equip robots with both visual and tactile sensors and collect a large-scale dataset of corresponding vision and tactile image sequences. To close the scale gap, we present a new conditional adversarial model that incorporates the scale and location information of the touch. Human perceptual studies demonstrate that our model can produce realistic visual images from tactile data and vice versa. Finally, we present both qualitative and quantitative experimental results regarding different system designs, as well as visualizing the learned representations of our model.

...read moreread less

Posted Content•

The Sound of Motions

[...]

Hang Zhao¹, Chuang Gan¹, Wei-Chiu Ma¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

11 Apr 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, an end-to-end learnable model called Deep Dense Trajectory (DDT) and a curriculum learning scheme was proposed for sound localization and separation.

...read moreread less

Abstract: Sounds originate from object motions and vibrations of surrounding air. Inspired by the fact that humans is capable of interpreting sound sources from how objects move visually, we propose a novel system that explicitly captures such motion cues for the task of sound localization and separation. Our system is composed of an end-to-end learnable model called Deep Dense Trajectory (DDT), and a curriculum learning scheme. It exploits the inherent coherence of audio-visual signals from a large quantities of unlabeled videos. Quantitative and qualitative evaluations show that comparing to previous models that rely on visual appearance cues, our motion based system improves performance in separating musical instrument sounds. Furthermore, it separates sound components from duets of the same category of instruments, a challenging problem that has not been addressed before.

...read moreread less

Posted Content•

Connecting Touch and Vision via Cross-Modal Prediction

[...]

Yunzhu Li¹, Jun-Yan Zhu¹, Russ Tedrake¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

14 Jun 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work investigates the cross-modal connection between vision and touch with a new conditional adversarial model that incorporates the scale and location information of the touch and demonstrates that the model can produce realistic visual images from tactile data and vice versa.

...read moreread less

Abstract: Humans perceive the world using multi-modal sensory inputs such as vision, audition, and touch. In this work, we investigate the cross-modal connection between vision and touch. The main challenge in this cross-domain modeling task lies in the significant scale discrepancy between the two: while our eyes perceive an entire visual scene at once, humans can only feel a small region of an object at any given moment. To connect vision and touch, we introduce new tasks of synthesizing plausible tactile signals from visual inputs as well as imagining how we interact with objects given tactile data as input. To accomplish our goals, we first equip robots with both visual and tactile sensors and collect a large-scale dataset of corresponding vision and tactile image sequences. To close the scale gap, we present a new conditional adversarial model that incorporates the scale and location information of the touch. Human perceptual studies demonstrate that our model can produce realistic visual images from tactile data and vice versa. Finally, we present both qualitative and quantitative experimental results regarding different system designs, as well as visualizing the learned representations of our model.

...read moreread less

Proceedings Article•DOI•

Neural Turtle Graphics for Modeling City Road Layouts

[...]

Hang Chu¹, Daiqing Li², David Acuna¹, Amlan Kar¹, Maria Shugrina¹, Xinkai Wei³, Ming-Yu Liu², Antonio Torralba⁴, Sanja Fidler¹ - Show less +5 more•Institutions (4)

University of Toronto¹, Nvidia², University of Waterloo³, Massachusetts Institute of Technology⁴

01 Oct 2019

TL;DR: NTG is a sequential generative model parameterized by a neural network that iteratively generates a new node and an edge connecting to an existing node conditioned on the current graph and achieves state-of-the-art performance on the SpaceNet dataset.

...read moreread less

Abstract: We propose Neural Turtle Graphics (NTG), a novel generative model for spatial graphs, and demonstrate its applications in modeling city road layouts. Specifically, we represent the road layout using a graph where nodes in the graph represent control points and edges in the graph represents road segments. NTG is a sequential generative model parameterized by a neural network. It iteratively generates a new node and an edge connecting to an existing node conditioned on the current graph. We train NTG on Open Street Map data and show it outperforms existing approaches using a set of diverse performance metrics. Moreover, our method allows users to control styles of generated road layouts mimicking existing cities as well as to sketch a part of the city road layout to be synthesized. In addition to synthesis, the proposed NTG finds uses in an analytical task of aerial road parsing. Experimental results show that it achieves state-of-the-art performance on the SpaceNet dataset.

...read moreread less

Posted Content•

Visualizing and Understanding Generative Adversarial Networks (Extended Abstract).

[...]

David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B. Tenenbaum, William T. Freeman, Antonio Torralba - Show less +3 more

29 Jan 2019

TL;DR: This work presents an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level, and identifies a group of interpretable units that are closely related to concepts with a segmentation-based network dissection method.

...read moreread less

Abstract: Generative Adversarial Networks (GANs) have achieved impressive results for many real-world applications. As an active research topic, many GAN variants have emerged with improvements in sample quality and training stability. However, visualization and understanding of GANs is largely missing. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do architectural choices affect GAN learning? Answering such questions could enable us to develop new insights and better models. In this work, we present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level. We first identify a group of interpretable units that are closely related to concepts with a segmentation-based network dissection method. We quantify the causal effect of interpretable units by measuring the ability of interventions to control objects in the output. Finally, we examine the contextual relationship between these units and their surrounding by inserting the discovered concepts into new images. We show several practical applications enabled by our framework, from comparing internal representations across different layers, models, and datasets, to improving GANs by locating and removing artifact-causing units, to interactively manipulating objects in the scene. We will open source our interactive tools to help researchers and practitioners better understand their models.

...read moreread less

Posted Content•

Gaze360: Physically Unconstrained Gaze Estimation in the Wild.

[...]

Petr Kellnhofer¹, Adrià Recasens¹, Simon Stent¹, Wojciech Matusik¹, Antonio Torralba¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

22 Oct 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents Gaze360, a large-scale remote gaze-tracking dataset and method for robust 3D gaze estimation in unconstrained images, and proposes a simple self-supervised approach to improve cross-dataset domain adaptation.

...read moreread less

Abstract: Understanding where people are looking is an informative social cue. In this work, we present Gaze360, a large-scale gaze-tracking dataset and method for robust 3D gaze estimation in unconstrained images. Our dataset consists of 238 subjects in indoor and outdoor environments with labelled 3D gaze across a wide range of head poses and distances. It is the largest publicly available dataset of its kind by both subject and variety, made possible by a simple and efficient collection method. Our proposed 3D gaze model extends existing models to include temporal information and to directly output an estimate of gaze uncertainty. We demonstrate the benefits of our model via an ablation study, and show its generalization performance via a cross-dataset evaluation against other recent gaze benchmark datasets. We furthermore propose a simple self-supervised approach to improve cross-dataset domain adaptation. Finally, we demonstrate an application of our model for estimating customer attention in a supermarket setting. Our dataset and models are available at this http URL .

...read moreread less

Proceedings Article•

Grounding Spoken Words in Unlabeled Video

[...]

Angie Boggust¹, Kartik Audhkhasi², Dhiraj Joshi², David Harwath¹, Samuel Thomas², Rogerio Feris², Dan Gutfreund², Yang Zhang, Antonio Torralba¹, Michael Picheny², James Glass³ - Show less +7 more•Institutions (3)

Massachusetts Institute of Technology¹, IBM², Qatar Foundation³

01 Jan 2019

TL;DR: Deep learning models that learn joint multi-modal embeddings in videos where the audio and visual streams are loosely synchronized are explored, and with weak supervision the authors see significant amounts of cross- modal learning.

...read moreread less

Abstract: In this paper, we explore deep learning models that learn joint multi-modal embeddings in videos where the audio and visual streams are loosely synchronized. Specifically, we consider cooking show videos from the YouCook2 dataset and a subset of the YouTube-8M dataset. We introduce varying levels of supervision into the learning process to guide the sampling of audio-visual pairs for training the models. This includes (1) a fully-unsupervised approach that samples audio-visual segments uniformly from an entire video, and (2) sampling audio-visual segments using weak supervision from off-the-shelf automatic speech and visual recognition systems. Although these models are preliminary, even with no supervision they are capable of learning cross-modal correlations, and with weak supervision we see significant amounts of cross-modal learning.

...read moreread less

Proceedings Article•DOI•

Synthesizing Environment-Aware Activities via Activity Sketches

[...]

Yuan-Hong Liao¹, Xavier Puig², Marko Boben³, Antonio Torralba³, Sanja Fidler⁴ - Show less +1 more•Institutions (4)

National Tsing Hua University¹, University of Toronto², Massachusetts Institute of Technology³, University of Ljubljana⁴

15 Jun 2019

TL;DR: This work builds upon VirtualHome, to create a new dataset VirtualHome-Env, where it collects program sketches to represent activities and match programs with environments that can afford them, and proposes RNN-ResActGraph, a network that generates a program from a given sketch and an environment graph and tracks the changes in the environment induced by the program.

...read moreread less

Abstract: In order to learn to perform activities from demonstrations or descriptions, agents need to distill what the essence of the given activity is, and how it can be adapted to new environments. In this work, we address the problem: environment-aware program generation. Given a visual demonstration or a description of an activity, we generate program sketches representing the essential instructions and propose a model to flesh these into full programs representing the actions needed to perform the activity under the presented environmental constraints. To this end, we build upon VirtualHome, to create a new dataset VirtualHome-Env, where we collect program sketches to represent activities and match programs with environments that can afford them. Furthermore, we construct a knowledge base to sample realistic environments and another knowledge base to seek out the programs under the sampled environments. Finally, we propose RNN-ResActGraph, a network that generates a program from a given sketch and an environment graph and tracks the changes in the environment induced by the program.

...read moreread less

Proceedings Article•DOI•

How to Make a Pizza: Learning a Compositional Layer-Based GAN Model

[...]

Dim P. Papadopoulos¹, Youssef Tamaazousti¹, Ferda Ofli², Ingmar Weber², Antonio Torralba¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Qatar Computing Research Institute²

06 Jun 2019

TL;DR: In this article, a generative adversarial network (GAN) is proposed to learn composable module operations that can either add or remove a particular ingredient in a food recipe, which can be seen as a way to change the visual appearance of a dish by adding extra objects or changing the appearance of the existing ones.

...read moreread less

Abstract: A food recipe is an ordered set of instructions for preparing a particular dish. From a visual perspective, every instruction step can be seen as a way to change the visual appearance of the dish by adding extra objects (e.g., adding an ingredient) or changing the appearance of the existing ones (e.g., cooking the dish). In this paper, we aim to teach a machine how to make a pizza by building a generative model that mirrors this step-by-step procedure. To do so, we learn composable module operations which are able to either add or remove a particular ingredient. Each operator is designed as a Generative Adversarial Network (GAN). Given only weak image-level supervision, the operators are trained to generate a visual layer that needs to be added to or removed from the existing image. The proposed model is able to decompose an image into an ordered sequence of layers by applying sequentially in the right order the corresponding removing modules. Experimental results on synthetic and real pizza images demonstrate that our proposed model is able to: (1) segment pizza toppings in a weakly- supervised fashion, (2) remove them by revealing what is occluded underneath them (i.e., inpainting), and (3) infer the ordering of the toppings without any depth ordering supervision. Code, data, and models are available online.

...read moreread less

Posted Content•

Neural Turtle Graphics for Modeling City Road Layouts

[...]

Hang Chu¹, Daiqing Li¹, David Acuna¹, Amlan Kar², Maria Shugrina², Xinkai Wei², Ming-Yu Liu², Antonio Torralba², Sanja Fidler³ - Show less +5 more•Institutions (3)

University of Toronto¹, Nvidia², Massachusetts Institute of Technology³

04 Oct 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors proposed Neural Turtle Graphics (NTG), a novel generative model for spatial graphs, and demonstrate its applications in modeling city road layouts, where the road layout is represented using a graph where nodes in the graph represent control points and edges in a graph represent road segments, and the model iteratively generates a new node and an edge connecting to an existing node conditioned on the current graph.

...read moreread less

Abstract: We propose Neural Turtle Graphics (NTG), a novel generative model for spatial graphs, and demonstrate its applications in modeling city road layouts. Specifically, we represent the road layout using a graph where nodes in the graph represent control points and edges in the graph represent road segments. NTG is a sequential generative model parameterized by a neural network. It iteratively generates a new node and an edge connecting to an existing node conditioned on the current graph. We train NTG on Open Street Map data and show that it outperforms existing approaches using a set of diverse performance metrics. Moreover, our method allows users to control styles of generated road layouts mimicking existing cities as well as to sketch parts of the city road layout to be synthesized. In addition to synthesis, the proposed NTG finds uses in an analytical task of aerial road parsing. Experimental results show that it achieves state-of-the-art performance on the SpaceNet dataset.

...read moreread less

Posted Content•

How to make a pizza: Learning a compositional layer-based GAN model

[...]

Dim P. Papadopoulos¹, Youssef Tamaazousti¹, Ferda Ofli¹, Ingmar Weber², Antonio Torralba¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Qatar Computing Research Institute²

06 Jun 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper aims to teach a machine how to make a pizza by building a generative model that mirrors this step-by-step procedure and learns composable module operations which are able to either add or remove a particular ingredient.

...read moreread less

Abstract: A food recipe is an ordered set of instructions for preparing a particular dish. From a visual perspective, every instruction step can be seen as a way to change the visual appearance of the dish by adding extra objects (e.g., adding an ingredient) or changing the appearance of the existing ones (e.g., cooking the dish). In this paper, we aim to teach a machine how to make a pizza by building a generative model that mirrors this step-by-step procedure. To do so, we learn composable module operations which are able to either add or remove a particular ingredient. Each operator is designed as a Generative Adversarial Network (GAN). Given only weak image-level supervision, the operators are trained to generate a visual layer that needs to be added to or removed from the existing image. The proposed model is able to decompose an image into an ordered sequence of layers by applying sequentially in the right order the corresponding removing modules. Experimental results on synthetic and real pizza images demonstrate that our proposed model is able to: (1) segment pizza toppings in a weaklysupervised fashion, (2) remove them by revealing what is occluded underneath them (i.e., inpainting), and (3) infer the ordering of the toppings without any depth ordering supervision. Code, data, and models are available online.

...read moreread less

Showing papers by "Antonio Torralba published in 2019"