Showing papers by "Antonio Torralba published in 2021"

PDF

Open Access

Proceedings Article•DOI•

DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort

[...]

Yuxuan Zhang¹, Huan Ling¹, Jun Gao¹, Kangxue Yin¹, Jean-Francois Lafleche¹, Adela Barriuso², Antonio Torralba², Sanja Fidler¹ - Show less +4 more•Institutions (2)

Nvidia¹, Massachusetts Institute of Technology²

13 Apr 2021

TL;DR: DatasetGAN as discussed by the authors uses GANs to generate high-quality semantically segmented images, which can then be used for training any computer vision architecture just as real datasets are.

...read moreread less

Abstract: We introduce DatasetGAN: an automatic procedure to generate massive datasets of high-quality semantically segmented images requiring minimal human effort. Current deep networks are extremely data-hungry, benefiting from training on large-scale datasets, which are time consuming to annotate. Our method relies on the power of recent GANs to generate realistic images. We show how the GAN latent code can be decoded to produce a semantic segmentation of the image. Training the decoder only needs a few labeled examples to generalize to the rest of the latent space, resulting in an infinite annotated dataset generator! These generated datasets can then be used for training any computer vision architecture just as real datasets are. As only a few images need to be manually segmented, it becomes possible to annotate images in extreme detail and generate datasets with rich object and part segmentations. To showcase the power of our approach, we generated datasets for 7 image segmentation tasks which include pixel-level labels for 34 human face parts, and 32 car parts. Our approach outperforms all semi-supervised baselines significantly and is on par with fully supervised methods, which in some cases require as much as 100x more annotated data as our method.

...read moreread less

162 citations

Journal Article•DOI•

Learning human–environment interactions using conformal tactile textiles

[...]

Yiyue Luo¹, Yunzhu Li¹, Pratyusha Sharma¹, Wan Shou¹, Kui Wu¹, Michael Foshey¹, Beichen Li¹, Tomas Palacios¹, Antonio Torralba¹, Wojciech Matusik¹ - Show less +6 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Mar 2021

TL;DR: A textile-based tactile learning platform that can be used to record, monitor and learn human–environment interactions and it is shown that the artificial-intelligence-powered sensing textiles can classify humans’ sitting poses, motions and other interactions with the environment.

...read moreread less

Abstract: Recording, modelling and understanding tactile interactions is important in the study of human behaviour and in the development of applications in healthcare and robotics. However, such studies remain challenging because existing wearable sensory interfaces are limited in terms of performance, flexibility, scalability and cost. Here, we report a textile-based tactile learning platform that can be used to record, monitor and learn human–environment interactions. The tactile textiles are created via digital machine knitting of inexpensive piezoresistive fibres, and can conform to arbitrary three-dimensional geometries. To ensure that our system is robust against variations in individual sensors, we use machine learning techniques for sensing correction and calibration. Using the platform, we capture diverse human–environment interactions (more than a million tactile frames) and show that the artificial-intelligence-powered sensing textiles can classify humans’ sitting poses, motions and other interactions with the environment. We also show that the platform can recover dynamic whole-body poses, reveal environmental spatial information and discover biomechanical signatures. Large-scale sensing textiles that can conform to arbitrary three-dimensional geometries and are created through digital machine knitting of piezoresistive fibres can be used to record, monitor and learn human–environment interactions.

...read moreread less

129 citations

Journal Article•DOI•

Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images

[...]

Javier Marin¹, Aritro Biswas¹, Ferda Ofli², Nicholas Hynes¹, Amaia Salvador³, Yusuf Aytar¹, Ingmar Weber², Antonio Torralba¹ - Show less +4 more•Institutions (3)

Massachusetts Institute of Technology¹, Qatar Computing Research Institute², Polytechnic University of Catalonia³

01 Jan 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The Recipe1M+ dataset as mentioned in this paper is a large-scale, structured corpus of over one million cooking recipes and 13 million food images, which enables the ability to train high-capacity models on aligned, multimodal data.

...read moreread less

Abstract: In this paper, we introduce Recipe1M+, a new large-scale, structured corpus of over one million cooking recipes and 13 million food images. As the largest publicly available collection of recipe data, Recipe1M+ affords the ability to train high-capacity models on aligned, multimodal data. Using these data, we train a neural network to learn a joint embedding of recipes and images that yields impressive results on an image-recipe retrieval task. Moreover, we demonstrate that regularization via the addition of a high-level classification objective both improves retrieval performance to rival that of humans and enables semantic vector arithmetic. We postulate that these embeddings will provide a basis for further exploration of the Recipe1M+ dataset and food and cooking in general. Code, data and models are publicly available. 1 1. http://im2recipe.csail.mit.edu .

...read moreread less

105 citations

Proceedings Article•DOI•

Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization

[...]

Daiqing Li¹, Junlin Yang¹, Karsten Kreis¹, Antonio Torralba², Sanja Fidler¹ - Show less +1 more•Institutions (2)

Nvidia¹, Massachusetts Institute of Technology²

12 Apr 2021

TL;DR: In this paper, a generative adversarial network is proposed to capture the joint image-label distribution and is trained efficiently using a large set of un-labeled images supplemented with only few labeled ones.

...read moreread less

Abstract: Training deep networks with limited labeled data while achieving a strong generalization ability is key in the quest to reduce human annotation efforts. This is the goal of semi-supervised learning, which exploits more widely available unlabeled data to complement small labeled data sets. In this paper, we propose a novel framework for discriminative pixel-level tasks using a generative model of both images and labels. Concretely, we learn a generative adversarial network that captures the joint image-label distribution and is trained efficiently using a large set of un-labeled images supplemented with only few labeled ones. We build our architecture on top of StyleGAN2 [45], augmented with a label synthesis branch. Image labeling at test time is achieved by first embedding the target image into the joint latent space via an encoder network and test-time optimization, and then generating the label from the inferred embedding. We evaluate our approach in two important domains: medical image segmentation and part-based face segmentation. We demonstrate strong in-domain performance compared to several baselines, and are the first to showcase extreme out-of-domain generalization, such as transferring from CT to MRI in medical imaging, and photographs of real faces to paintings, sculptures, and even cartoons and animal faces. Project Page: https://nv-tlabs.github.io/semanticGAN/

...read moreread less

103 citations

Proceedings Article•DOI•

DriveGAN: Towards a Controllable High-Quality Neural Simulation

[...]

Seung Wook Kim¹, Jonah Philion¹, Antonio Torralba², Sanja Fidler¹•Institutions (2)

Nvidia¹, Massachusetts Institute of Technology²

30 Apr 2021

TL;DR: In this paper, the authors introduce a novel high-quality neural simulator referred to as DriveGAN that achieves controllability by disentangling different components without supervision, including steering control and sampling features of a scene, such as the weather and the location of non-player objects.

...read moreread less

Abstract: Realistic simulators are critical for training and verifying robotics systems. While most of the contemporary simulators are hand-crafted, a scaleable way to build simulators is to use machine learning to learn how the environment behaves in response to an action, directly from data. In this work, we aim to learn to simulate a dynamic environment directly in pixel-space, by watching unannotated sequences of frames and their associated actions. We introduce a novel high-quality neural simulator referred to as DriveGAN that achieves controllability by disentangling different components without supervision. In addition to steering controls, it also includes controls for sampling features of a scene, such as the weather as well as the location of non-player objects. Since DriveGAN is a fully differentiable simulator, it further allows for re-simulation of a given video sequence, offering an agent to drive through a recorded scene again, possibly taking different actions. We train DriveGAN on multiple datasets, including 160 hours of real-world driving data. We showcase that our approach greatly surpasses the performance of previous data-driven simulators, and al-lows for new key features not explored before.

...read moreread less

45 citations

Journal Article•DOI•

Computer Vision in the Operating Room: Opportunities and Caveats

[...]

Lauren R. Kennedy-Metz¹, Pietro Mascagni², Antonio Torralba³, Roger D. Dias¹, Pietro Perona⁴, Julie A. Shah³, Nicolas Padoy², Marco A. Zenati¹ - Show less +4 more•Institutions (4)

Harvard University¹, University of Strasbourg², Massachusetts Institute of Technology³, California Institute of Technology⁴

01 Feb 2021

TL;DR: Major topics in computer vision for surgical domains are summarized to enable the surgery community to collectively define well-specified common objectives for automated systems, spur academic research, mobilize industry, and provide benchmarks to track progress.

...read moreread less

Abstract: Effectiveness of computer vision techniques has been demonstrated through a number of applications, both within and outside healthcare. The operating room environment specifically is a setting with rich data sources compatible with computational approaches and high potential for direct patient benefit. The aim of this review is to summarize major topics in computer vision for surgical domains. The major capabilities of computer vision are described as an aid to surgical teams to improve performance and contribute to enhanced patient safety. Literature was identified through leading experts in the fields of surgery, computational analysis and modeling in medicine, and computer vision in healthcare. The literature supports the application of computer vision principles to surgery. Potential applications within surgery include operating room vigilance, endoscopic vigilance, and individual and team-wide behavioral analysis. To advance the field, we recommend collecting and publishing carefully annotated datasets. Doing so will enable the surgery community to collectively define well-specified common objectives for automated systems, spur academic research, mobilize industry, and provide benchmarks with which we can track progress. Leveraging computer vision approaches through interdisciplinary collaboration and advanced approaches to data acquisition, modeling, interpretation, and integration promises a powerful impact on patient safety, public health, and financial costs.

...read moreread less

27 citations

Proceedings Article•DOI•

Intelligent Carpet: Inferring 3D Human Pose from Tactile Signals

[...]

Yiyue Luo¹, Yunzhu Li¹, Michael Foshey¹, Wan Shou¹, Pratyusha Sharma¹, Tomas Palacios¹, Antonio Torralba¹, Wojciech Matusik¹ - Show less +4 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Jun 2021

TL;DR: In this paper, the authors propose a 3D human pose estimation approach using the pressure maps recorded by a tactile carpet as input, which enables the real-time recordings of human-floor tactile interactions in a seamless manner.

...read moreread less

Abstract: Daily human activities, e.g., locomotion, exercises, and resting, are heavily guided by the tactile interactions between the human and the ground. In this work, leveraging such tactile interactions, we propose a 3D human pose estimation approach using the pressure maps recorded by a tactile carpet as input. We build a low-cost, high-density, large-scale intelligent carpet, which enables the real-time recordings of human-floor tactile interactions in a seamless manner. We collect a synchronized tactile and visual dataset on various human activities. Employing a state-of-the-art camera-based pose estimation model as supervision, we design and implement a deep neural network model to infer 3D human poses using only the tactile information. Our pipeline can be further scaled up to multi-person pose estimation. We evaluate our system and demonstrate its potential applications in diverse fields.

...read moreread less

24 citations

Proceedings Article•

Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering

[...]

Yuxuan Zhang¹, Wenzheng Chen², Huan Ling², Jun Gao², Yinan Zhang³, Antonio Torralba⁴, Sanja Fidler² - Show less +3 more•Institutions (4)

University of Waterloo¹, University of Toronto², Stanford University³, Massachusetts Institute of Technology⁴

03 May 2021

TL;DR: In this paper, the authors exploit GANs as a multi-view data generator to train an inverse graphics network using an off-the-shelf differentiable renderer, and the trained inverse graphics networks as a teacher to disentangle the GAN's latent code into interpretable 3D properties.

...read moreread less

Abstract: Differentiable rendering has paved the way to training neural networks to perform “inverse graphics” tasks such as predicting 3D geometry from monocular photographs. To train high performing models, most of the current approaches rely on multi-view imagery which are not readily available in practice. Recent Generative Adversarial Networks (GANs) that synthesize images, in contrast, seem to acquire 3D knowledge implicitly during training: object viewpoints can be manipulated by simply manipulating the latent codes. However, these latent codes often lack further physical interpretation and thus GANs cannot easily be inverted to perform explicit 3D reasoning. In this paper, we aim to extract and disentangle 3D knowledge learned by generative models by utilizing differentiable renderers. Key to our approach is to exploit GANs as a multi-view data generator to train an inverse graphics network using an off-the-shelf differentiable renderer, and the trained inverse graphics network as a teacher to disentangle the GAN's latent code into interpretable 3D properties. The entire architecture is trained iteratively using cycle consistency losses. We show that our approach significantly outperforms state-of-the-art inverse graphics networks trained on existing datasets, both quantitatively and via user studies. We further showcase the disentangled GAN as a controllable 3D “neural renderer", complementing traditional graphics renderers.

...read moreread less

22 citations

Posted Content•

Ego4D: Around the World in 3,000 Hours of Egocentric Video

[...]

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh K. Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Christian Fuegen, Abrham Gebreselasie, Cristina González, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Yunyi Zhu, Pablo Arbeláez, David J. Crandall, Dima Damen, Giovanni Maria Farinella, Bernard Ghanem, Vamsi K. Ithapu, C. V. Jawahar, Hanbyul Joo, Kris M. Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik - Show less +80 more

13 Oct 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: The Ego4D dataset as mentioned in this paper was used for de-identification of videos by some of the universities, such as the University of Bristol and the National University of Singapore.

...read moreread less

Abstract: We gratefully acknowledge the following colleagues for valuable discussions and support of our project: Aaron Adcock, Andrew Allen, Behrouz Behmardi, Serge Belongie, Mark Broyles, Xiao Chu, Samuel Clapp, Irene D’Ambra, Peter Dodds, Jacob Donley, Ruohan Gao, Tal Hassner, EthanHenderson, Jiabo Hu, Guillaume Jeanneret, Sanjana Krishnan, Tsung-Yi Lin, Bobby Otillar, Manohar Paluri, Maja Pantic, Lucas Pinto, Vivek Roy, Jerome Pesenti, Joelle Pineau, Luca Sbordone, Rajan Subramanian, Helen Sun, Mary Williamson, and Bill Wu. We also acknowledge Jacob Chalk for setting up the Ego4D AWS backend and Prasanna Sridhar for developing the Ego4D website. Thank you to the Common Visual Data Foundation (CVDF) for hosting the Ego4D dataset. The universities acknowledge the usage of commercial software for de-identification of video. brighter.ai was used for redacting videos by some of the universities. Personal data from the University of Bristol was protected by Primloc’s Secure Redact software suite. UNICT is supported by MIUR AIM - Attrazione eMobilitaInternazionale Linea 1 - AIM1893589 - CUP E64118002540007. Bristol is supported by UKRIEngineering and Physical Sciences Research Council (EPSRC) Doctoral Training Program (DTP), EPSRC Fellowship UMPIRE (EP/T004991/1). KAUST is supported by the KAUST Office of Sponsored Research through the Visual Computing Center (VCC) funding. National University of Singapore is supported by Mike Shou’s Start-Up Grant. Georgia Tech is supported in part by NSF award 2033413 and NIH award R01MH114999.

...read moreread less

19 citations

Proceedings Article•DOI•

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

[...]

Andrew Rouditchenko¹, Angie Boggust¹, David Harwath¹, Brian Chen², Dhiraj Joshi³, Samuel Thomas³, Kartik Audhkhasi⁴, Hilde Kuehne⁵, Rameswar Panda⁶, Rogerio Feris³, Brian Kingsbury³, Michael Picheny³, Antonio Torralba¹, James Glass¹ - Show less +10 more•Institutions (6)

Massachusetts Institute of Technology¹, Columbia University², IBM³, Google⁴, Goethe University Frankfurt⁵, University of California, Riverside⁶

30 Aug 2021

19 citations

Proceedings Article•

BARF: Bundle-Adjusting Neural Radiance Fields

[...]

Chen-Hsuan Lin¹, Wei-Chiu Ma², Antonio Torralba³, Simon Lucey¹•Institutions (3)

Carnegie Mellon University¹, Uber ², Massachusetts Institute of Technology³

13 Apr 2021

TL;DR: In this article, a Bundle-Adjusting Neural Radiance Fields (BARF) is proposed for training NeRF from imperfect (or even unknown) camera poses, which enables view synthesis and localization of video sequences from unknown camera poses.

...read moreread less

Abstract: Neural Radiance Fields (NeRF) have recently gained a surge of interest within the computer vision community for its power to synthesize photorealistic novel views of real-world scenes. One limitation of NeRF, however, is its requirement of accurate camera poses to learn the scene representations. In this paper, we propose Bundle-Adjusting Neural Radiance Fields (BARF) for training NeRF from imperfect (or even unknown) camera poses -- the joint problem of learning neural 3D representations and registering camera frames. We establish a theoretical connection to classical image alignment and show that coarse-to-fine registration is also applicable to NeRF. Furthermore, we show that naively applying positional encoding in NeRF has a negative impact on registration with a synthesis-based objective. Experiments on synthetic and real-world data show that BARF can effectively optimize the neural scene representations and resolve large camera pose misalignment at the same time. This enables view synthesis and localization of video sequences from unknown camera poses, opening up new avenues for visual localization systems (e.g. SLAM) and potential applications for dense 3D mapping and reconstruction.

...read moreread less

Posted Content•

Paint by Word.

[...]

David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, Antonio Torralba¹ - Show less +3 more•Institutions (1)

Massachusetts Institute of Technology¹

19 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors investigate the problem of zero-shot semantic image painting, which is to point to a location in a synthesized image and apply an arbitrary new concept such as "rustic" or "opulent" or ''happy dog''.

...read moreread less

Abstract: We investigate the problem of zero-shot semantic image painting. Instead of painting modifications into an image using only concrete colors or a finite set of semantic concepts, we ask how to create semantic paint based on open full-text descriptions: our goal is to be able to point to a location in a synthesized image and apply an arbitrary new concept such as "rustic" or "opulent" or "happy dog." To do this, our method combines a state-of-the art generative model of realistic images with a state-of-the-art text-image semantic similarity network. We find that, to make large changes, it is important to use non-gradient methods to explore latent space, and it is important to relax the computations of the GAN to target changes to a specific region. We conduct user studies to compare our methods to several baselines.

...read moreread less

Posted Content•

The ThreeDWorld Transport Challenge: A Visually Guided Task-and-Motion Planning Benchmark for Physically Realistic Embodied AI.

[...]

Chuang Gan, Siyuan Zhou, Jeremy Schwartz, Seth Alter, Abhishek Bhandwaldar, Dan Gutfreund, Daniel L. K. Yamins, James J. DiCarlo, Josh H. McDermott, Antonio Torralba, Joshua B. Tenenbaum - Show less +7 more

25 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: The ThreeDWorld Transport Challenge as discussed by the authors is a task-and-motion planning benchmark for the task of finding a small set of objects scattered around the house, picking them up, and transporting them to a desired final location.

...read moreread less

Abstract: We introduce a visually-guided and physics-driven task-and-motion planning benchmark, which we call the ThreeDWorld Transport Challenge. In this challenge, an embodied agent equipped with two 9-DOF articulated arms is spawned randomly in a simulated physical home environment. The agent is required to find a small set of objects scattered around the house, pick them up, and transport them to a desired final location. We also position containers around the house that can be used as tools to assist with transporting objects efficiently. To complete the task, an embodied agent must plan a sequence of actions to change the state of a large number of objects in the face of realistic physical constraints. We build this benchmark challenge using the ThreeDWorld simulation: a virtual 3D environment where all objects respond to physics, and where can be controlled using fully physics-driven navigation and interaction API. We evaluate several existing agents on this benchmark. Experimental results suggest that: 1) a pure RL model struggles on this challenge; 2) hierarchical planning-based agents can transport some objects but still far from solving this task. We anticipate that this benchmark will empower researchers to develop more intelligent physics-driven robots for the physical world.

...read moreread less

Posted Content•

Dynamic Modeling of Hand-Object Interactions via Tactile Sensing.

[...]

Qiang Zhang, Yunzhu Li, Yiyue Luo, Wan Shou, Michael Foshey, Junchi Yan, Joshua B. Tenenbaum, Wojciech Matusik, Antonio Torralba - Show less +5 more

09 Sep 2021-arXiv: Robotics

TL;DR: In this article, a high-resolution tactile glove is employed to perform four different interactive activities on a diversified set of objects. And the tactile model aims to predict the 3D locations of both the hand and the object purely from the touch data by combining a predictive model and a contrastive learning module.

...read moreread less

Abstract: Tactile sensing is critical for humans to perform everyday tasks. While significant progress has been made in analyzing object grasping from vision, it remains unclear how we can utilize tactile sensing to reason about and model the dynamics of hand-object interactions. In this work, we employ a high-resolution tactile glove to perform four different interactive activities on a diversified set of objects. We build our model on a cross-modal learning framework and generate the labels using a visual processing pipeline to supervise the tactile model, which can then be used on its own during the test time. The tactile model aims to predict the 3d locations of both the hand and the object purely from the touch data by combining a predictive model and a contrastive learning module. This framework can reason about the interaction patterns from the tactile data, hallucinate the changes in the environment, estimate the uncertainty of the prediction, and generalize to unseen objects. We also provide detailed ablation studies regarding different system designs as well as visualizations of the predicted trajectories. This work takes a step on dynamics modeling in hand-object interactions from dense tactile sensing, which opens the door for future applications in activity learning, human-computer interactions, and imitation learning for robotics.

...read moreread less

Proceedings Article•

Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration

[...]

Xavier Puig¹, Tianmin Shu¹, Shuang Li¹, Zilin Wang², Yuan-Hong Liao³, Joshua B. Tenenbaum¹, Sanja Fidler³, Antonio Torralba¹ - Show less +4 more•Institutions (3)

Massachusetts Institute of Technology¹, École Polytechnique Fédérale de Lausanne², University of Toronto³

03 May 2021

TL;DR: The Watch-And-Help (WAH) challenge as discussed by the authors is a challenge for testing social intelligence in agents, where an AI agent needs to help a human-like agent perform a complex household task efficiently.

...read moreread less

Abstract: In this paper, we introduce Watch-And-Help (WAH), a challenge for testing social intelligence in agents. In WAH, an AI agent needs to help a human-like agent perform a complex household task efficiently. To succeed, the AI agent needs to i) understand the underlying goal of the task by watching a single demonstration of the human-like agent performing the same task (social perception), and ii) coordinate with the human-like agent to solve the task in an unseen environment as fast as possible (human-AI collaboration). For this challenge, we build VirtualHome-Social, a multi-agent household environment, and provide a benchmark including both planning and learning based baselines. We evaluate the performance of AI agents with the human-like agent as well as and with real humans using objective metrics and subjective user ratings. Experimental results demonstrate that our challenge and virtual environment enable a systematic evaluation on the important aspects of machine social intelligence at scale.

...read moreread less

Posted Content•

Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

[...]

Shuang Li¹, Yilun Du², Antonio Torralba², Josef Sivic³, Bryan Russell⁴ - Show less +1 more•Institutions (4)

Beijing Institute of Technology¹, Massachusetts Institute of Technology², Czech Technical University in Prague³, Adobe Systems⁴

07 Oct 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a contrastive weakly supervised training loss is introduced to jointly associate spatio-temporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects as a form of self-supervision.

...read moreread less

Abstract: We introduce the task of weakly supervised learning for detecting human and object interactions in videos. Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and the object. To address these challenges, we introduce a contrastive weakly supervised training loss that aims to jointly associate spatiotemporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects as a form of self-supervision. To train our model, we introduce a dataset comprising over 6.5k videos with human-object interaction annotations that have been semi-automatically curated from sentence captions associated with the videos. We demonstrate improved performance over weakly supervised baselines adapted to our task on our video dataset.

...read moreread less

Posted Content•

Skill Induction and Planning with Latent Language.

[...]

Pratyusha Sharma, Antonio Torralba, Jacob Andreas

04 Oct 2021

TL;DR: The authors use sparse natural language annotations to guide the discovery of reusable skills for autonomous decision-making, which can be used to generate high-level instruction sequences tailored to novel goals, and then use these skills to plan.

...read moreread less

Abstract: We present a framework for learning hierarchical policies from demonstrations, using sparse natural language annotations to guide the discovery of reusable skills for autonomous decision-making. We formulate a generative model of action sequences in which goals generate sequences of high-level subtask descriptions, and these descriptions generate sequences of low-level actions. We describe how to train this model using primarily unannotated demonstrations by parsing demonstrations into sequences of named high-level subtasks, using only a small number of seed annotations to ground language in action. In trained models, the space of natural language commands indexes a combinatorial library of skills; agents can use these skills to plan by generating high-level instruction sequences tailored to novel goals. We evaluate this approach in the ALFRED household simulation environment, providing natural language annotations for only 10% of demonstrations. It completes more than twice as many tasks as a standard approach to learning from demonstrations, matching the performance of instruction following models with access to ground-truth plans during both training and evaluation.

...read moreread less

Posted Content•

DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort

[...]

Yuxuan Zhang¹, Huan Ling¹, Jun Gao¹, Kangxue Yin¹, Jean-Francois Lafleche¹, Adela Barriuso², Antonio Torralba², Sanja Fidler¹ - Show less +4 more•Institutions (2)

Nvidia¹, Massachusetts Institute of Technology²

13 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: DatasetGAN as mentioned in this paper uses GANs to generate high-quality semantically segmented images, which can then be used for training any computer vision architecture just as real datasets are.

...read moreread less

Posted Content•

3D Neural Scene Representations for Visuomotor Control

[...]

Yunzhu Li¹, Shuang Li¹, Vincent Sitzmann¹, Pulkit Agrawal¹, Antonio Torralba¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

08 Jul 2021-arXiv: Robotics

TL;DR: In this article, the authors combine Neural Radiance Fields (NeRF) and time contrastive learning with an autoencoding framework to learn viewpoint-invariant 3D-aware scene representations.

...read moreread less

Abstract: Humans have a strong intuitive understanding of the 3D environment around us. The mental model of the physics in our brain applies to objects of different materials and enables us to perform a wide range of manipulation tasks that are far beyond the reach of current robots. In this work, we desire to learn models for dynamic 3D scenes purely from 2D visual observations. Our model combines Neural Radiance Fields (NeRF) and time contrastive learning with an autoencoding framework, which learns viewpoint-invariant 3D-aware scene representations. We show that a dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks involving both rigid bodies and fluids, where the target is specified in a viewpoint different from what the robot operates on. When coupled with an auto-decoding framework, it can even support goal specification from camera viewpoints that are outside the training distribution. We further demonstrate the richness of the learned 3D dynamics model by performing future prediction and novel view synthesis. Finally, we provide detailed ablation studies regarding different system designs and qualitative analysis of the learned representations.

...read moreread less

Posted Content•

Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization

[...]

Daiqing Li¹, Junlin Yang¹, Karsten Kreis¹, Antonio Torralba², Sanja Fidler¹ - Show less +1 more•Institutions (2)

Nvidia¹, Massachusetts Institute of Technology²

12 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a generative adversarial network (GAN) is proposed to capture the joint image-label distribution and is trained efficiently using a large set of unlabeled images supplemented with only few labeled ones.

...read moreread less

Abstract: Training deep networks with limited labeled data while achieving a strong generalization ability is key in the quest to reduce human annotation efforts. This is the goal of semi-supervised learning, which exploits more widely available unlabeled data to complement small labeled data sets. In this paper, we propose a novel framework for discriminative pixel-level tasks using a generative model of both images and labels. Concretely, we learn a generative adversarial network that captures the joint image-label distribution and is trained efficiently using a large set of unlabeled images supplemented with only few labeled ones. We build our architecture on top of StyleGAN2, augmented with a label synthesis branch. Image labeling at test time is achieved by first embedding the target image into the joint latent space via an encoder network and test-time optimization, and then generating the label from the inferred embedding. We evaluate our approach in two important domains: medical image segmentation and part-based face segmentation. We demonstrate strong in-domain performance compared to several baselines, and are the first to showcase extreme out-of-domain generalization, such as transferring from CT to MRI in medical imaging, and photographs of real faces to paintings, sculptures, and even cartoons and animal faces. Project Page: \url{this https URL}

...read moreread less

Journal Article•

Noisy Agents: Self-supervised Exploration by Predicting Auditory Events

[...]

Chuang Gan¹, Xiaoyu Chen², Phillip Isola³, Antonio Torralba³, Joshua B. Tenenbaum³ - Show less +1 more•Institutions (3)

IBM¹, Tsinghua University², Massachusetts Institute of Technology³

04 May 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes a novel type of intrinsic motivation for Reinforcement Learning (RL) that encourages the agent to understand the causal effect of its actions through auditory event prediction and uses the prediction errors as intrinsic rewards to guide RL exploration.

...read moreread less

Abstract: Humans integrate multiple sensory modalities (e.g., visual and audio) to build a causal understanding of the physical world. In this work, we propose a novel type of intrinsic motivation for Reinforcement Learning (RL) that encourages the agent to understand the causal effect of its actions through auditory event prediction. First, we allow the agent to collect a small amount of acoustic data and use K-means to discover underlying auditory event clusters. We then train a neural network to predict the auditory events and use the prediction errors as intrinsic rewards to guide RL exploration. We first conduct an in-depth analysis of our module using a set of Atari games. We then apply our model to audio-visual exploration using the Habitat simulator and active learning using the TDW simulator. Experimental results demonstrate the advantages of using audio signals over vision-based models as intrinsic rewards to guide RL explorations.

...read moreread less

Journal Article•DOI•

Publisher Correction: Learning human–environment interactions using conformal tactile textiles

[...]

Yiyue Luo¹, Yunzhu Li¹, Pratyusha Sharma¹, Wan Shou¹, Kui Wu¹, Michael Foshey¹, Beichen Li¹, Tomas Palacios¹, Antonio Torralba¹, Wojciech Matusik¹ - Show less +6 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Apr 2021

TL;DR: A Correction to this paper has been published: https://doi.org/10.1038/s41928-021-00572-2.

...read moreread less

Abstract: A Correction to this paper has been published: https://doi.org/10.1038/s41928-021-00572-2.

...read moreread less

Proceedings Article•

Learning to See by Looking at Noise

[...]

Manel Baradad¹, Jonas Wulff¹, Tongzhou Wang¹, Phillip Isola¹, Antonio Torralba¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

06 Dec 2021

TL;DR: In this paper, the authors investigate a suite of image generation models that produce images from simple random processes, which are then used as training data for a visual representation learner with a contrastive loss.

...read moreread less

Abstract: Current vision systems are trained on huge datasets, and these datasets come with costs: curation is expensive, they inherit human biases, and there are concerns over privacy and usage rights. To counter these costs, interest has surged in learning from cheaper data sources, such as unlabeled images. In this paper we go a step further and ask if we can do away with real image datasets entirely, instead learning from noise processes. We investigate a suite of image generation models that produce images from simple random processes. These are then used as training data for a visual representation learner with a contrastive loss. We study two types of noise processes, statistical image models and deep generative models under different random initializations. Our findings show that it is important for the noise to capture certain structural properties of real data but that good performance can be achieved even with processes that are far from realistic. We also find that diversity is a key property to learn good representations. Datasets, models, and code are available at this https URL.

...read moreread less

Proceedings Article•

Energy-Based Models for Continual Learning

[...]

Shuang Li¹, Yilun Du¹, Gido M. van de Ven², Antonio Torralba¹, Igor Mordatch³ - Show less +1 more•Institutions (3)

Massachusetts Institute of Technology¹, Baylor College of Medicine², University of Washington³

04 May 2021

TL;DR: Energy-Based Models (EBMs) have been shown to be adaptable to a more general continual learning setting where the data distribution changes without the notion of explicitly delineated tasks as discussed by the authors.

...read moreread less

Abstract: We motivate Energy-Based Models (EBMs) as a promising model class for continual learning problems. Instead of tackling continual learning via the use of external memory, growing models, or regularization, EBMs have a natural way to support a dynamically-growing number of tasks and classes and less interference with old tasks. We show that EBMs are adaptable to a more general continual learning setting where the data distribution changes without the notion of explicitly delineated tasks. We also find that EBMs outperform the baseline methods by a large margin on several continual learning benchmarks. These observations point towards EBMs as a class of models naturally inclined towards the continual learning regime.

...read moreread less

Posted Content•

Deep Feedback Inverse Problem Solver

[...]

Wei-Chiu Ma¹, Wei-Chiu Ma², Shenlong Wang³, Shenlong Wang², Jiayuan Gu⁴, Jiayuan Gu², Sivabalan Manivasagam², Sivabalan Manivasagam³, Antonio Torralba¹, Raquel Urtasun², Raquel Urtasun³ - Show less +7 more•Institutions (4)

Massachusetts Institute of Technology¹, Uber ², University of Toronto³, University of California, San Diego⁴

19 Jan 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: The authors leverage the feedback signal provided by the forward process and learn an iterative update model, where at each iteration, the neural network takes the feedback as input and outputs an update on the current estimation.

...read moreread less

Abstract: We present an efficient, effective, and generic approach towards solving inverse problems. The key idea is to leverage the feedback signal provided by the forward process and learn an iterative update model. Specifically, at each iteration, the neural network takes the feedback as input and outputs an update on the current estimation. Our approach does not have any restrictions on the forward process; it does not require any prior knowledge either. Through the feedback information, our model not only can produce accurate estimations that are coherent to the input observation but also is capable of recovering from early incorrect predictions. We verify the performance of our approach over a wide range of inverse problems, including 6-DOF pose estimation, illumination estimation, as well as inverse kinematics. Comparing to traditional optimization-based methods, we can achieve comparable or better performance while being two to three orders of magnitude faster. Compared to deep learning-based approaches, our model consistently improves the performance on all metrics. Please refer to the project page for videos, animations, supplementary materials, etc.

...read moreread less

Posted Content•

Measuring Generalization with Optimal Transport

[...]

Ching-Yao Chuang¹, Youssef Mroueh², Kristjan Greenewald², Antonio Torralba¹, Stefanie Jegelka¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, IBM²

07 Jun 2021-arXiv: Learning

TL;DR: This article developed margin-based generalization bounds, where the margins are normalized with optimal transport costs between independent random subsets sampled from the training distribution, which can be interpreted as a generalization of variance which captures the structural properties of the learned feature space.

...read moreread less

Abstract: Understanding the generalization of deep neural networks is one of the most important tasks in deep learning. Although much progress has been made, theoretical error bounds still often behave disparately from empirical observations. In this work, we develop margin-based generalization bounds, where the margins are normalized with optimal transport costs between independent random subsets sampled from the training distribution. In particular, the optimal transport cost can be interpreted as a generalization of variance which captures the structural properties of the learned feature space. Our bounds robustly predict the generalization error, given training data and network parameters, on large scale datasets. Theoretically, we demonstrate that the concentration and separation of features play crucial roles in generalization, supporting empirical results in the literature. The code is available at \url{this https URL}.

...read moreread less

Proceedings Article•

Learning to Compose Visual Relations

[...]

Nan Liu¹, Shuang Li², Yilun Du³, Joshua B. Tenenbaum³, Antonio Torralba³ - Show less +1 more•Institutions (3)

University of Michigan¹, Beihang University², Massachusetts Institute of Technology³

06 Dec 2021

TL;DR: In this paper, the authors propose to represent each relation as an unnormalized density (an energy-based model), enabling them to compose separate relations in a factorized manner, which allows the model to both generate and edit scenes that have multiple sets of relations more faithfully.

...read moreread less

Abstract: The visual world around us can be described as a structured set of objects and their associated relations. An image of a room may be conjured given only the description of the underlying objects and their associated relations. While there has been significant work on designing deep neural networks which may compose individual objects together, less work has been done on composing the individual relations between objects. A principal difficulty is that while the placement of objects is mutually independent, their relations are entangled and dependent on each other. To circumvent this issue, existing works primarily compose relations by utilizing a holistic encoder, in the form of text or graphs. In this work, we instead propose to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner. We show that such a factorized decomposition allows the model to both generate and edit scenes that have multiple sets of relations more faithfully. We further show that decomposition enables our model to effectively understand the underlying relational scene structure. Project page at: https://composevisualrelations.github.io/.

...read moreread less

Posted Content•

BARF: Bundle-Adjusting Neural Radiance Fields

[...]

Chen-Hsuan Lin¹, Wei-Chiu Ma², Antonio Torralba³, Simon Lucey¹•Institutions (3)

Carnegie Mellon University¹, Uber ², Massachusetts Institute of Technology³

13 Apr 2021-arXiv: Computer Vision and Pattern Recognition

...read moreread less

Posted Content•

Learning to Compose Visual Relations.

[...]

Nan Liu¹, Shuang Li², Yilun Du², Joshua B. Tenenbaum², Antonio Torralba² - Show less +1 more•Institutions (2)

University of Michigan¹, Massachusetts Institute of Technology²

17 Nov 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors propose to represent each relation as an unnormalized density (an energy-based model), enabling them to compose separate relations in a factorized manner, which allows the model to both generate and edit scenes that have multiple sets of relations more faithfully.

...read moreread less

Posted Content•

Scaling up instance annotation via label propagation

[...]

Dim P. Papadopoulos¹, Ethan Weber¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

05 Oct 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a hierarchical clustering-based method is proposed for building large datasets with object segmentation masks, which is shown to reduce annotation time by 76x compared to manual annotation.

...read moreread less

Abstract: Manually annotating object segmentation masks is very time-consuming. While interactive segmentation methods offer a more efficient alternative, they become unaffordable at a large scale because the cost grows linearly with the number of annotated masks. In this paper, we propose a highly efficient annotation scheme for building large datasets with object segmentation masks. At a large scale, images contain many object instances with similar appearance. We exploit these similarities by using hierarchical clustering on mask predictions made by a segmentation model. We propose a scheme that efficiently searches through the hierarchy of clusters and selects which clusters to annotate. Humans manually verify only a few masks per cluster, and the labels are propagated to the whole cluster. Through a large-scale experiment to populate 1M unlabeled images with object segmentation masks for 80 object classes, we show that (1) we obtain 1M object segmentation masks with an total annotation time of only 290 hours; (2) we reduce annotation time by 76x compared to manual annotation; (3) the segmentation quality of our masks is on par with those from manually annotated datasets. Code, data, and models are available online.

...read moreread less