Showing papers by "Dhruv Batra published in 2021"

PDF

Open Access

Posted Content•

Ego4D: Around the World in 3,000 Hours of Egocentric Video

[...]

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh K. Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Christian Fuegen, Abrham Gebreselasie, Cristina González, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Yunyi Zhu, Pablo Arbeláez, David J. Crandall, Dima Damen, Giovanni Maria Farinella, Bernard Ghanem, Vamsi K. Ithapu, C. V. Jawahar, Hanbyul Joo, Kris M. Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik - Show less +80 more

13 Oct 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: The Ego4D dataset as mentioned in this paper was used for de-identification of videos by some of the universities, such as the University of Bristol and the National University of Singapore.

...read moreread less

Abstract: We gratefully acknowledge the following colleagues for valuable discussions and support of our project: Aaron Adcock, Andrew Allen, Behrouz Behmardi, Serge Belongie, Mark Broyles, Xiao Chu, Samuel Clapp, Irene D’Ambra, Peter Dodds, Jacob Donley, Ruohan Gao, Tal Hassner, EthanHenderson, Jiabo Hu, Guillaume Jeanneret, Sanjana Krishnan, Tsung-Yi Lin, Bobby Otillar, Manohar Paluri, Maja Pantic, Lucas Pinto, Vivek Roy, Jerome Pesenti, Joelle Pineau, Luca Sbordone, Rajan Subramanian, Helen Sun, Mary Williamson, and Bill Wu. We also acknowledge Jacob Chalk for setting up the Ego4D AWS backend and Prasanna Sridhar for developing the Ego4D website. Thank you to the Common Visual Data Foundation (CVDF) for hosting the Ego4D dataset. The universities acknowledge the usage of commercial software for de-identification of video. brighter.ai was used for redacting videos by some of the universities. Personal data from the University of Bristol was protected by Primloc’s Secure Redact software suite. UNICT is supported by MIUR AIM - Attrazione eMobilitaInternazionale Linea 1 - AIM1893589 - CUP E64118002540007. Bristol is supported by UKRIEngineering and Physical Sciences Research Council (EPSRC) Doctoral Training Program (DTP), EPSRC Fellowship UMPIRE (EP/T004991/1). KAUST is supported by the KAUST Office of Sponsored Research through the Visual Computing Center (VCC) funding. National University of Singapore is supported by Mike Shou’s Start-Up Grant. Georgia Tech is supported in part by NSF award 2033413 and NIH award R01MH114999.

...read moreread less

19 citations

Journal Article•DOI•

Bi-Directional Domain Adaptation for Sim2Real Transfer of Embodied Navigation Agents

[...]

Joanne Truong¹, Sonia Chernova¹, Dhruv Batra¹•Institutions (1)

Georgia Institute of Technology¹

25 Feb 2021

TL;DR: Bi-directional domain adaptation (BDA) as mentioned in this paper is a novel approach to bridge the sim-vs-real gap in both directions, where real2sim is used to bridge visual domain gap and sim2real to bridge dynamics domain gap.

...read moreread less

Abstract: Deep reinforcement learning models are notoriously data hungry, yet real-world data is expensive and time consuming to obtain. The solution that many have turned to is to use simulation for training before deploying the robot in a real environment. Simulation offers the ability to train large numbers of robots in parallel, and offers an abundance of data. However, no simulation is perfect, and robots trained solely in simulation fail to generalize to the real-world, resulting in a “sim-vs-real gap”. How can we overcome the trade-off between the abundance of less accurate, artificial data from simulators and the scarcity of reliable, real-world data? In this letter, we propose Bi-directional Domain Adaptation (BDA), a novel approach to bridge the sim-vs-real gap in both directions– real2sim to bridge the visual domain gap, and sim2real to bridge the dynamics domain gap. We demonstrate the benefits of BDA on the task of PointGoal Navigation. BDA with only 5 k real-world (state, action, next-state) samples matches the performance of a policy fine-tuned with $\sim$ 600 k samples, resulting in a speed-up of $\sim 120\times$ .

...read moreread less

14 citations

Proceedings Article•

Auxiliary Tasks and Exploration Enable ObjectGoal Navigation

[...]

Joel Ye¹, Dhruv Batra², Abhishek Das², Erik Wijmans¹•Institutions (2)

Georgia Institute of Technology¹, Facebook²

01 Jan 2021

14 citations

Posted Content•

Habitat 2.0: Training Home Assistants to Rearrange their Habitat

[...]

Andrew Szot¹, Alexander Clegg², Eric Undersander, Erik Wijmans³, Yili Zhao², John Turner, Noah Maestre, Mustafa Mukadam², Devendra Singh Chaplot², Oleksandr Maksymets², Aaron Gokaslan⁴, Vladimir Vondrus, Sameer Dharur³, Franziska Meier¹, Wojciech Galuba², Angel X. Chang⁵, Zsolt Kira³, Vladlen Koltun⁶, Jitendra Malik², Manolis Savva⁷, Dhruv Batra³ - Show less +17 more•Institutions (7)

University of Southern California¹, Facebook², Georgia Institute of Technology³, Brown University⁴, University of Illinois at Urbana–Champaign⁵, Intel⁶, Simon Fraser University⁷

28 Jun 2021-arXiv: Learning

TL;DR: H2.0 as discussed by the authors is a simulation platform for training virtual robots in interactive 3D environments and complex physics-enabled scenarios, which includes a suite of common tasks for assistive robots (tidy the house, prepare groceries, set the table).

...read moreread less

Abstract: We introduce Habitat 2.0 (H2.0), a simulation platform for training virtual robots in interactive 3D environments and complex physics-enabled scenarios. We make comprehensive contributions to all levels of the embodied AI stack - data, simulation, and benchmark tasks. Specifically, we present: (i) ReplicaCAD: an artist-authored, annotated, reconfigurable 3D dataset of apartments (matching real spaces) with articulated objects (e.g. cabinets and drawers that can open/close); (ii) H2.0: a high-performance physics-enabled 3D simulator with speeds exceeding 25,000 simulation steps per second (850x real-time) on an 8-GPU node, representing 100x speed-ups over prior work; and, (iii) Home Assistant Benchmark (HAB): a suite of common tasks for assistive robots (tidy the house, prepare groceries, set the table) that test a range of mobile manipulation capabilities. These large-scale engineering contributions allow us to systematically compare deep reinforcement learning (RL) at scale and classical sense-plan-act (SPA) pipelines in long-horizon structured tasks, with an emphasis on generalization to new objects, receptacles, and layouts. We find that (1) flat RL policies struggle on HAB compared to hierarchical ones; (2) a hierarchy with independent skills suffers from 'hand-off problems', and (3) SPA pipelines are more brittle than RL policies.

...read moreread less

13 citations

Proceedings Article•

Waypoint Models for Instruction-guided Navigation in Continuous Environments

[...]

Jacob Krantz¹, Aaron Gokaslan², Dhruv Batra³, Stefan Lee⁴, Oleksandr Maksymets⁵ - Show less +1 more•Institutions (5)

Oregon State University¹, Brown University², Georgia Institute of Technology³, Microsoft⁴, Facebook⁵

05 Oct 2021

TL;DR: In this article, a class of language-conditioned waypoint prediction networks was developed to explore the role of action spaces in language-guided visual navigation, either in terms of its effect on navigation success or the efficiency with which a robotic agent could execute the resulting trajectory.

...read moreread less

Abstract: Little inquiry has explicitly addressed the role of action spaces in language-guided visual navigation -- either in terms of its effect on navigation success or the efficiency with which a robotic agent could execute the resulting trajectory. Building on the recently released VLN-CE setting for instruction following in continuous environments, we develop a class of language-conditioned waypoint prediction networks to examine this question. We vary the expressivity of these models to explore a spectrum between low-level actions and continuous waypoint prediction. We measure task performance and estimated execution time on a profiled LoCoBot robot. We find more expressive models result in simpler, faster to execute trajectories, but lower-level actions can achieve better navigation metrics by approximating shortest paths better. Further, our models outperform prior work in VLN-CE and set a new state-of-the-art on the public leaderboard -- increasing success rate by 4% with our best model on this challenging task.

...read moreread less

11 citations

Proceedings Article•

THDA: Treasure Hunt Data Augmentation for Semantic Navigation

[...]

Oleksandr Maksymets¹, Vincent Cartillier², Aaron Gokaslan³, Erik Wijmans², Wojciech Galuba, Stefan Lee⁴, Dhruv Batra¹ - Show less +3 more•Institutions (4)

Facebook¹, Georgia Institute of Technology², Brown University³, Oregon State University⁴

01 Jan 2021

9 citations

Posted Content•

Success Weighted by Completion Time: A Dynamics-Aware Evaluation Criteria for Embodied Navigation.

[...]

Naoki Yokoyama¹, Sehoon Ha¹, Dhruv Batra¹•Institutions (1)

Georgia Institute of Technology¹

14 Mar 2021-arXiv: Robotics

TL;DR: In this paper, the authors present Success weighted by Completion Time (SCT), a new metric for evaluating navigation performance for mobile robots that explicitly takes the agent's dynamics model into consideration, and aims to accurately capture how well the agent has approximated the fastest navigation behavior afforded by its dynamics.

...read moreread less

Abstract: We present Success weighted by Completion Time (SCT), a new metric for evaluating navigation performance for mobile robots. Several related works on navigation have used Success weighted by Path Length (SPL) as the primary method of evaluating the path an agent makes to a goal location, but SPL is limited in its ability to properly evaluate agents with complex dynamics. In contrast, SCT explicitly takes the agent's dynamics model into consideration, and aims to accurately capture how well the agent has approximated the fastest navigation behavior afforded by its dynamics. While several embodied navigation works use point-turn dynamics, we focus on unicycle-cart dynamics for our agent, which better exemplifies the dynamics model of popular mobile robotics platforms (e.g., LoCoBot, TurtleBot, Fetch, etc.). We also present RRT*-Unicycle, an algorithm for unicycle dynamics that estimates the fastest collision-free path and completion time from a starting pose to a goal location in an environment containing obstacles. We experiment with deep reinforcement learning and reward shaping to train and compare the navigation performance of agents with different dynamics models. In evaluating these agents, we show that in contrast to SPL, SCT is able to capture the advantages in navigation speed a unicycle model has over a simpler point-turn model of dynamics. Lastly, we show that we can successfully deploy our trained models and algorithms outside of simulation in the real world. We embody our agents in an real robot to navigate an apartment, and show that they can generalize in a zero-shot manner.

...read moreread less

5 citations

Proceedings Article•

Large Batch Simulation for Deep Reinforcement Learning

[...]

Brennan Shacklett¹, Erik Wijmans², Aleksei Petrenko³, Manolis Savva⁴, Dhruv Batra⁵, Vladlen Koltun⁶, Kayvon Fatahalian⁷ - Show less +3 more•Institutions (7)

Stanford University¹, Georgia Institute of Technology², University of Southern California³, Simon Fraser University⁴, Virginia Tech⁵, Intel⁶, Carnegie Mellon University⁷

03 May 2021

TL;DR: In this article, the authors design a 3D renderer and environment simulator based on the principle of batch simulation, which can accept and execute large batches of requests simultaneously, and demonstrate that PointGoal navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system using a 64-GPU cluster over three days.

...read moreread less

Abstract: We accelerate deep reinforcement learning based training in visually complex 3D environments by two orders of magnitude over prior work, realizing end-to-end training speeds of over 19,000 frames of experience per second on a single GPU (and up to 72,000 frames per second on a single eight-GPU machine). The key idea of our approach is to design a 3D renderer and environment simulator around the principle of “batch simulation”: accepting and executing large batches of requests simultaneously. Beyond exposing large amounts of work at once, batch simulation allows simulator implementations to amortize in-memory storage of scene assets, rendering work, data loading, and synchronization costs across many simulation requests, dramatically improving the number of simulated agents per GPU and overall simulation throughput. To balance DNN inference and training costs with faster simulation, we also build a computationally efficient policy DNN that maintains high task performance, and modify training algorithms to maintain sample efficiency when training with large mini-batches. By combining batch simulation and DNN performance optimizations, we demonstrate that PointGoal navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system using a 64-GPU cluster over three days. We provide open-source reference implementations of our batch 3D renderer and simulator to facilitate incorporation of these ideas into current and future RL systems.

...read moreread less

3 citations

Posted Content•

Auxiliary Tasks and Exploration Enable ObjectNav.

[...]

Joel Ye¹, Dhruv Batra, Abhishek Das, Erik Wijmans•Institutions (1)

Georgia Institute of Technology¹

08 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors propose to add auxiliary learning tasks and an exploration reward to a generic learned agent to simplify visual inputs so as to smooth their RNN dynamics and reduce overfitting by minimizing effective RNN dimensionality.

...read moreread less

Abstract: ObjectGoal Navigation (ObjectNav) is an embodied task wherein agents are to navigate to an object instance in an unseen environment. Prior works have shown that end-to-end ObjectNav agents that use vanilla visual and recurrent modules, e.g. a CNN+RNN, perform poorly due to overfitting and sample inefficiency. This has motivated current state-of-the-art methods to mix analytic and learned components and operate on explicit spatial maps of the environment. We instead re-enable a generic learned agent by adding auxiliary learning tasks and an exploration reward. Our agents achieve 24.5% success and 8.1% SPL, a 37% and 8% relative improvement over prior state-of-the-art, respectively, on the Habitat ObjectNav Challenge. From our analysis, we propose that agents will act to simplify their visual inputs so as to smooth their RNN dynamics, and that auxiliary tasks reduce overfitting by minimizing effective RNN dimensionality; i.e. a performant ObjectNav agent that must maintain coherent plans over long horizons does so by learning smooth, low-dimensional recurrent dynamics. Site: this https URL

...read moreread less

3 citations

Proceedings Article•

Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views

[...]

Vincent Cartillier¹, Zhile Ren¹, Neha Jain¹, Stefan Lee², Irfan Essa, Dhruv Batra - Show less +2 more•Institutions (2)

Georgia Institute of Technology¹, Oregon State University²

18 May 2021

TL;DR: Semantic MapNet (SMNet) as mentioned in this paper combines the strengths of egocentric visual encoder and spatial memory tensor to produce semantic top-down maps of 3D spaces.

...read moreread less

Abstract: We study the task of semantic mapping – specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map (‘what is where?’) from egocentric observations of an RGB-D camera with known pose (via localization sensors). Importantly, our goal is to build neural episodic memories and spatio-semantic representations of 3D spaces that enable the agent to easily learn subsequent tasks in the same space – navigating to objects seen during the tour (‘Find chair’) or answering questions about the space (‘How many chairs did you see in the house?’). Towards this goal, we present Semantic MapNet (SMNet), which consists of: (1) an Egocentric Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length×width×feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic top-down maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4.01−16.81% (absolute) on mean-IoU and 3.81−19.69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the spatio-semantic allocentric representations build by SMNet for the task of ObjectNav and Embodied Question Answering. Project page: https://vincentcartillier.github.io/smnet.html.

...read moreread less

3 citations

Proceedings Article•

Contrast and Classify: Training Robust VQA Models

[...]

Yash Kant¹, Abhinav Moudgil², Dhruv Batra³, Devi Parikh³, Harsh Agrawal¹ - Show less +1 more•Institutions (3)

Georgia Institute of Technology¹, International Institute of Information Technology, Hyderabad², Facebook³

01 Jan 2021

TL;DR: The authors proposed a contrastive loss to encourage representations to be robust to linguistic variations in questions while the cross-entropy loss preserves the discriminative power of representations for answer prediction and showed that optimizing both losses is key to effective training.

...read moreread less

Abstract: Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input questions. Existing approaches address this by augmenting the dataset with question paraphrases from visual question generation models or adversarial perturbations. These approaches use the combined data to learn an answer classifier by minimizing the standard cross-entropy loss. To more effectively leverage augmented data, we build on the recent success in contrastive learning. We propose a novel training paradigm (ConClaT) that optimizes both cross-entropy and contrastive losses. The contrastive loss encourages representations to be robust to linguistic variations in questions while the cross-entropy loss preserves the discriminative power of representations for answer prediction. We find that optimizing both losses -- either alternately or jointly -- is key to effective training. On the VQA-Rephrasings benchmark, which measures the VQA model's answer consistency across human paraphrases of a question, ConClaT improves Consensus Score by 1 .63% over an improved baseline. In addition, on the standard VQA 2.0 benchmark, we improve the VQA accuracy by 0.78% overall. We also show that ConClaT is agnostic to the type of data-augmentation strategy used.

...read moreread less

Proceedings Article•

Habitat 2.0: Training Home Assistants to Rearrange their Habitat

[...]

University of Southern California¹, Facebook², Georgia Institute of Technology³, Brown University⁴, University of Illinois at Urbana–Champaign⁵, Intel⁶, Simon Fraser University⁷

06 Dec 2021

...read moreread less

Posted Content•

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI.

[...]

Santhosh K. Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X. Chang, Manolis Savva, Yili Zhao, Dhruv Batra - Show less +9 more

16 Sep 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: The Habitat-Matterport 3D (HM3D) dataset as discussed by the authors is a large-scale dataset of 1,000 building-scale 3D reconstructions from a diverse set of real-world locations.

...read moreread less

Abstract: We present the Habitat-Matterport 3D (HM3D) dataset. HM3D is a large-scale dataset of 1,000 building-scale 3D reconstructions from a diverse set of real-world locations. Each scene in the dataset consists of a textured 3D mesh reconstruction of interiors such as multi-floor residences, stores, and other private indoor spaces. HM3D surpasses existing datasets available for academic research in terms of physical scale, completeness of the reconstruction, and visual fidelity. HM3D contains 112.5k m^2 of navigable space, which is 1.4 - 3.7x larger than other building-scale datasets such as MP3D and Gibson. When compared to existing photorealistic 3D datasets such as Replica, MP3D, Gibson, and ScanNet, images rendered from HM3D have 20 - 85% higher visual fidelity w.r.t. counterpart images captured with real cameras, and HM3D meshes have 34 - 91% fewer artifacts due to incomplete surface reconstruction. The increased scale, fidelity, and diversity of HM3D directly impacts the performance of embodied AI agents trained using it. In fact, we find that HM3D is `pareto optimal' in the following sense -- agents trained to perform PointGoal navigation on HM3D achieve the highest performance regardless of whether they are evaluated on HM3D, Gibson, or MP3D. No similar claim can be made about training on other datasets. HM3D-trained PointNav agents achieve 100% performance on Gibson-test dataset, suggesting that it might be time to retire that episode dataset.

...read moreread less

Posted Content•

Memory-Augmented Reinforcement Learning for Image-Goal Navigation

[...]

Lina Mezghani¹, Sainbayar Sukhbaatar¹, Thibaut Lavril¹, Oleksandr Maksymets¹, Dhruv Batra¹, Dhruv Batra², Piotr Bojanowski¹, Karteek Alahari - Show less +4 more•Institutions (2)

Facebook¹, Georgia Institute of Technology²

13 Jan 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a cross-episode memory is used to learn to navigate in the context of visually-realistic 3D environments, without access to additional sensors such as position or depth.

...read moreread less

Abstract: In this work, we address the problem of image-goal navigation in the context of visually-realistic 3D environments. This task involves navigating to a location indicated by a target image in a previously unseen environment. Earlier attempts, including RL-based and SLAM-based approaches, have either shown poor generalization performance, or are heavily-reliant on pose/depth sensors. We present a novel method that leverages a cross-episode memory to learn to navigate. We first train a state-embedding network in a self-supervised fashion, and then use it to embed previously-visited states into a memory. In order to avoid overfitting, we propose to use data augmentation on the RGB input during training. We validate our approach through extensive evaluations, showing that our data-augmented memory-based model establishes a new state of the art on the image-goal navigation task in the challenging Gibson dataset. We obtain this competitive performance from RGB input only, without access to additional sensors such as position or depth.

...read moreread less

Posted Content•

Large Batch Simulation for Deep Reinforcement Learning.

[...]

Brennan Shacklett¹, Erik Wijmans², Aleksei Petrenko³, Manolis Savva⁴, Dhruv Batra⁵, Vladlen Koltun⁶, Kayvon Fatahalian⁷ - Show less +3 more•Institutions (7)

Stanford University¹, Georgia Institute of Technology², University of Southern California³, Simon Fraser University⁴, Virginia Tech⁵, Intel⁶, Carnegie Mellon University⁷

12 Mar 2021-arXiv: Learning

TL;DR: The authors design a 3D renderer and embodied navigation simulator based on the principle of batch simulation, which can accept and execute large batches of requests simultaneously, and demonstrate that PointGoal navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-theart system using a 64-GPU cluster over three days.

...read moreread less

Abstract: We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work, realizing end-to-end training speeds of over 19,000 frames of experience per second on a single GPU and up to 72,000 frames per second on a single eight-GPU machine. The key idea of our approach is to design a 3D renderer and embodied navigation simulator around the principle of "batch simulation": accepting and executing large batches of requests simultaneously. Beyond exposing large amounts of work at once, batch simulation allows implementations to amortize in-memory storage of scene assets, rendering work, data loading, and synchronization costs across many simulation requests, dramatically improving the number of simulated agents per GPU and overall simulation throughput. To balance DNN inference and training costs with faster simulation, we also build a computationally efficient policy DNN that maintains high task performance, and modify training algorithms to maintain sample efficiency when training with large mini-batches. By combining batch simulation and DNN performance optimizations, we demonstrate that PointGoal navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system using a 64-GPU cluster over three days. We provide open-source reference implementations of our batch 3D renderer and simulator to facilitate incorporation of these ideas into RL systems.

...read moreread less

Posted Content•

The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation

[...]

Xiaoming Zhao, Harsh Agrawal¹, Dhruv Batra¹, Alexander G. Schwing²•Institutions (2)

Georgia Institute of Technology¹, University of Illinois at Urbana–Champaign²

26 Aug 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, visual odometry is used for point-goal navigation in a realistic setting, without access to GPS and compass sensors, and the state-of-the-art on the popular Habitat PointNav benchmark is achieved.

...read moreread less

Abstract: It is fundamental for personal robots to reliably navigate to a specified goal. To study this task, PointGoal navigation has been introduced in simulated Embodied AI environments. Recent advances solve this PointGoal navigation task with near-perfect accuracy (99.6% success) in photo-realistically simulated environments, assuming noiseless egocentric vision, noiseless actuation, and most importantly, perfect localization. However, under realistic noise models for visual sensors and actuation, and without access to a "GPS and Compass sensor," the 99.6%-success agents for PointGoal navigation only succeed with 0.3%. In this work, we demonstrate the surprising effectiveness of visual odometry for the task of PointGoal navigation in this realistic setting, i.e., with realistic noise models for perception and actuation and without access to GPS and Compass sensors. We show that integrating visual odometry techniques into navigation policies improves the state-of-the-art on the popular Habitat PointNav benchmark by a large margin, improving success from 64.5% to 71.7% while executing 6.4 times faster.

...read moreread less

Posted Content•

Model-Advantage Optimization for Model-Based Reinforcement Learning.

[...]

Nirbhay Modhe¹, Harish K Kamath¹, Dhruv Batra¹, Ashwin Kalyan²•Institutions (2)

Georgia Institute of Technology¹, Allen Institute for Artificial Intelligence²

26 Jun 2021-arXiv: Learning

TL;DR: In this paper, the authors propose a value-aware objective that is an upper bound on the absolute performance difference of a policy across two models, which is a general purpose algorithm that modifies the standard MBRL pipeline to enable learning with value aware objectives.

...read moreread less

Abstract: Model-based Reinforcement Learning (MBRL) algorithms have been traditionally designed with the goal of learning accurate dynamics of the environment. This introduces a mismatch between the objectives of model-learning and the overall learning problem of finding an optimal policy. Value-aware model learning, an alternative model-learning paradigm to maximum likelihood, proposes to inform model-learning through the value function of the learnt policy. While this paradigm is theoretically sound, it does not scale beyond toy settings. In this work, we propose a novel value-aware objective that is an upper bound on the absolute performance difference of a policy across two models. Further, we propose a general purpose algorithm that modifies the standard MBRL pipeline -- enabling learning with value aware objectives. Our proposed objective, in conjunction with this algorithm, is the first successful instantiation of value-aware MBRL on challenging continuous control environments, outperforming previous value-aware objectives and with competitive performance w.r.t. MLE-based MBRL approaches.

...read moreread less

Posted Content•

Learning Robust Agents for Visual Navigation in Dynamic Environments: The Winning Entry of iGibson Challenge 2021.

[...]

Naoki Yokoyama¹, Qian Luo¹, Dhruv Batra¹, Sehoon Ha¹•Institutions (1)

Georgia Institute of Technology¹

22 Sep 2021-arXiv: Robotics

TL;DR: In this paper, the authors present an approach for improving navigation in dynamic and interactive environments, which won the 1st place in the iGibson Interactive Navigation Challenge 2021. And they employ large-scale reinforcement learning by leveraging the Habitat simulator, which supports high performance parallel computing for both simulation and synchronized learning.

...read moreread less

Abstract: This paper presents an approach for improving navigation in dynamic and interactive environments, which won the 1st place in the iGibson Interactive Navigation Challenge 2021. While the last few years have produced impressive progress on PointGoal Navigation in static environments, relatively little effort has been made on more realistic dynamic environments. The iGibson Challenge proposed two new navigation tasks, Interactive Navigation and Social Navigation, which add displaceable obstacles and moving pedestrians into the simulator environment. Our approach to study these problems uses two key ideas. First, we employ large-scale reinforcement learning by leveraging the Habitat simulator, which supports high performance parallel computing for both simulation and synchronized learning. Second, we employ a new data augmentation technique that adds more dynamic objects into the environment, which can also be combined with traditional image-based augmentation techniques to boost the performance further. Lastly, we achieve sim-to-sim transfer from Habitat to the iGibson simulator, and demonstrate that our proposed methods allow us to train robust agents in dynamic environments with interactive objects or moving humans. Video link: this https URL

...read moreread less

Posted Content•

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation.

[...]

Abhinav Moudgil¹, Arjun Majumdar², Harsh Agrawal², Stefan Lee³, Dhruv Batra² - Show less +1 more•Institutions (3)

International Institute of Information Technology, Hyderabad¹, Georgia Institute of Technology², Microsoft³

27 Oct 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a transformer-based vision-and-language navigation (VLN) agent uses two different visual encoders, a scene classification network and an object detector, which produce features that match these two distinct types of visual cues.

...read moreread less

Abstract: Natural language instructions for visual navigation often use scene descriptions (e.g., "bedroom") and object references (e.g., "green chairs") to provide a breadcrumb trail to a goal location. This work presents a transformer-based vision-and-language navigation (VLN) agent that uses two different visual encoders -- a scene classification network and an object detector -- which produce features that match these two distinct types of visual cues. In our method, scene features contribute high-level contextual information that supports object-level processing. With this design, our model is able to use vision-and-language pretraining (i.e., learning the alignment between images and text from large-scale web data) to substantially improve performance on the Room-to-Room (R2R) and Room-Across-Room (RxR) benchmarks. Specifically, our approach leads to improvements of 1.8% absolute in SPL on R2R and 3.7% absolute in SR on RxR. Our analysis reveals even larger gains for navigation instructions that contain six or more object references, which further suggests that our approach is better able to use object features and align them to references in the instructions.

...read moreread less

Posted Content•

Waypoint Models for Instruction-guided Navigation in Continuous Environments.

[...]

Jacob Krantz¹, Aaron Gokaslan², Dhruv Batra³, Stefan Lee⁴, Oleksandr Maksymets⁵ - Show less +1 more•Institutions (5)

Oregon State University¹, Brown University², Georgia Institute of Technology³, Microsoft⁴, Facebook⁵

05 Oct 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a class of language-conditioned waypoint prediction networks was developed to explore the role of action spaces in language-guided visual navigation, either in terms of its effect on navigation success or the efficiency with which a robotic agent could execute the resulting trajectory.

...read moreread less

Posted Content•

Realistic PointGoal Navigation via Auxiliary Losses and Information Bottleneck.

[...]

Guillermo Grande¹, Dhruv Batra¹, Erik Wijmans¹•Institutions (1)

Georgia Institute of Technology¹

17 Sep 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors propose a novel architecture and training paradigm for training realistic Point-Goal Navigation (PWN) without access to ground-truth localization, which enables the agent to first learn navigation and then learn localization instead of conflating these two objectives.

...read moreread less

Abstract: We propose a novel architecture and training paradigm for training realistic PointGoal Navigation -- navigating to a target coordinate in an unseen environment under actuation and sensor noise without access to ground-truth localization. Specifically, we find that the primary challenge under this setting is learning localization -- when stripped of idealized localization, agents fail to stop precisely at the goal despite reliably making progress towards it. To address this we introduce a set of auxiliary losses to help the agent learn localization. Further, we explore the idea of treating the precise location of the agent as privileged information -- it is unavailable during test time, however, it is available during training time in simulation. We grant the agent restricted access to ground-truth localization readings during training via an information bottleneck. Under this setting, the agent incurs a penalty for using this privileged information, encouraging the agent to only leverage this information when it is crucial to learning. This enables the agent to first learn navigation and then learn localization instead of conflating these two objectives in training. We evaluate our proposed method both in a semi-idealized (noiseless simulation without Compass+GPS) and realistic (addition of noisy simulation) settings. Specifically, our method outperforms existing baselines on the semi-idealized setting by 18\%/21\% SPL/Success and by 15\%/20\% SPL in the realistic setting. Our improved Success and SPL metrics indicate our agent's improved ability to accurately self-localize while maintaining a strong navigation policy. Our implementation can be found at this https URL.

...read moreread less

Proceedings Article•DOI•

SOrT-ing VQA Models : Contrastive Gradient Learning for Improved Consistency

[...]

Sameer Dharur¹, Purva Tendulkar¹, Dhruv Batra¹, Devi Parikh², Ramprasaath R. Selvaraju³ - Show less +1 more•Institutions (3)

Georgia Institute of Technology¹, Facebook², Salesforce.com³

01 Jun 2021

TL;DR: This article proposed a contrastive gradient learning based approach called Sub-question Oriented Tuning (SOrT) which encourages models to rank relevant sub-questions higher than irrelevant questions for an pair.

...read moreread less

Abstract: Recent research in Visual Question Answering (VQA) has revealed state-of-the-art models to be inconsistent in their understanding of the world - they answer seemingly difficult questions requiring reasoning correctly but get simpler associated sub-questions wrong. These sub-questions pertain to lower level visual concepts in the image that models ideally should understand to be able to answer the reasoning question correctly. To address this, we first present a gradient-based interpretability approach to determine the questions most strongly correlated with the reasoning question on an image, and use this to evaluate VQA models on their ability to identify the relevant sub-questions needed to answer a reasoning question. Next, we propose a contrastive gradient learning based approach called Sub-question Oriented Tuning (SOrT) which encourages models to rank relevant sub-questions higher than irrelevant questions for an pair. We show that SOrT improves model consistency by up to 6.5% points over existing approaches, while also improving visual grounding and robustness to rephrasings of questions.

...read moreread less