Showing papers by "Dhruv Batra published in 2019"

PDF

Open Access

Posted Content•

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

[...]

Jiasen Lu¹, Dhruv Batra², Devi Parikh², Stefan Lee²•Institutions (2)

Salesforce.com¹, Georgia Institute of Technology²

06 Aug 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language, is presented, extending the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

...read moreread less

Abstract: We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

...read moreread less

1,241 citations

Proceedings Article•

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

[...]

Jiasen Lu¹, Dhruv Batra², Devi Parikh², Stefan Lee²•Institutions (2)

Salesforce.com¹, Georgia Institute of Technology²

06 Aug 2019

TL;DR: The ViLBERT model as mentioned in this paper extends the BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

...read moreread less

Abstract: We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

...read moreread less

1,069 citations

Proceedings Article•DOI•

Habitat: A Platform for Embodied AI Research

[...]

Manolis Savva¹, Jitendra Malik², Devi Parikh³, Dhruv Batra³, Abhishek Kadian⁴, Oleksandr Maksymets⁴, Yili Zhao⁴, Erik Wijmans³, Bhavana Jain⁴, Julian Straub⁴, Jia Liu⁴, Vladlen Koltun⁵ - Show less +8 more•Institutions (5)

Simon Fraser University¹, University of California, Berkeley², Georgia Institute of Technology³, Facebook⁴, Intel⁵

02 Apr 2019

TL;DR: The comparison between learning and SLAM approaches from two recent works are revisited and evidence is found -- that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and the first cross-dataset generalization experiments are conducted.

...read moreread less

Abstract: We present Habitat, a platform for research in embodied artificial intelligence (AI). Habitat enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation. Specifically, Habitat consists of: (i) Habitat-Sim: a flexible, high-performance 3D simulator with configurable agents, sensors, and generic 3D dataset handling. Habitat-Sim is fast -- when rendering a scene from Matterport3D, it achieves several thousand frames per second (fps) running single-threaded, and can reach over 10,000 fps multi-process on a single GPU. (ii) Habitat-API: a modular high-level library for end-to-end development of embodied AI algorithms -- defining tasks (e.g., navigation, instruction following, question answering), configuring, training, and benchmarking embodied agents. These large-scale engineering contributions enable us to answer scientific questions requiring experiments that were till now impracticable or 'merely' impractical. Specifically, in the context of point-goal navigation: (1) we revisit the comparison between learning and SLAM approaches from two recent works and find evidence for the opposite conclusion -- that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and (2) we conduct the first cross-dataset generalization experiments {train, test} x {Matterport3D, Gibson} for multiple sensors {blind, RGB, RGBD, D} and find that only agents with depth (D) sensors generalize across datasets. We hope that our open-source platform and these findings will advance research in embodied AI.

...read moreread less

839 citations

Journal Article•

Visual Dialog

[...]

Abhishek Das¹, Satwik Kottur², Khushi Gupta², Avi Singh³, Deshraj Yadav¹, Stefan Lee¹, Jose M. F. Moura², Devi Parikh¹, Dhruv Batra¹ - Show less +5 more•Institutions (3)

Georgia Institute of Technology¹, Carnegie Mellon University², University of California, Berkeley³

01 May 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The authors introduced the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content, given an image, a dialog history and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately.

...read moreread less

Abstract: We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being sufficiently grounded in vision to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person real-time chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). VisDial v0.9 has been released and consists of $\sim$ ∼ 1.2M dialog question-answer pairs from 10-round, human-human dialogs grounded in $\sim$ ∼ 120k images from the COCO dataset. We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders—Late Fusion, Hierarchical Recurrent Encoder and Memory Network (optionally with attention over image features)—and 2 decoders (generative and discriminative), which outperform a number of sophisticated baselines. We propose a retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank and recall $@k$ @ k of human response. We quantify the gap between machine and human performance on the Visual Dialog task via human studies. Putting it all together, we demonstrate the first ‘visual chatbot’! Our dataset, code, pretrained models and visual chatbot are available on https://visualdialog.org .

...read moreread less

484 citations

Proceedings Article•DOI•

Towards VQA Models That Can Read

[...]

Amanpreet Singh¹, Vivek T. Natarajan¹, Meet Shah¹, Yu Jiang¹, Xinlei Chen¹, Dhruv Batra², Devi Parikh², Marcus Rohrbach¹ - Show less +4 more•Institutions (2)

Facebook¹, Georgia Institute of Technology²

15 Jun 2019

TL;DR: A novel model architecture is introduced that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the images.

...read moreread less

Abstract: Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today’s VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new “TextVQA” dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Second, we introduce a novel model architecture that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the image. Consequently, we call our approach Look, Read, Reason & Answer (LoRRA). We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset. We find that the gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0.

...read moreread less

363 citations

Proceedings Article•

Counterfactual Visual Explanations

[...]

Yash Goyal¹, Ziyan Wu², Jan Ernst², Dhruv Batra¹, Devi Parikh¹, Stefan Lee¹ - Show less +2 more•Institutions (2)

Georgia Institute of Technology¹, Siemens²

24 May 2019

TL;DR: In this article, a technique to produce counterfactual visual explanations was developed for fine-grained bird classification, where a visual explanation identifies how a query image could change such that the system would output a different specified class.

...read moreread less

Abstract: In this work, we develop a technique to produce counterfactual visual explanations. Given a 'query' image $I$ for which a vision system predicts class $c$, a counterfactual visual explanation identifies how $I$ could change such that the system would output a different specified class $c'$. To do this, we select a 'distractor' image $I'$ that the system predicts as class $c'$ and identify spatial regions in $I$ and $I'$ such that replacing the identified region in $I$ with the identified region in $I'$ would push the system towards classifying $I$ as $c'$. We apply our approach to multiple image classification datasets generating qualitative results showcasing the interpretability and discriminativeness of our counterfactual explanations. To explore the effectiveness of our explanations in teaching humans, we present machine teaching experiments for the task of fine-grained bird classification. We find that users trained to distinguish bird species fare better when given access to counterfactual explanations in addition to training examples.

...read moreread less

319 citations

Posted Content•

The Replica Dataset: A Digital Replica of Indoor Spaces.

[...]

13 Jun 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: Replica, a dataset of 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale, is introduced to enable machine learning (ML) research that relies on visually, geometrically, and semantically realistic generative models of the world.

...read moreread less

Abstract: We introduce Replica, a dataset of 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale. Each scene consists of a dense mesh, high-resolution high-dynamic-range (HDR) textures, per-primitive semantic class and instance information, and planar mirror and glass reflectors. The goal of Replica is to enable machine learning (ML) research that relies on visually, geometrically, and semantically realistic generative models of the world - for instance, egocentric computer vision, semantic segmentation in 2D and 3D, geometric inference, and the development of embodied agents (virtual robots) performing navigation, instruction following, and question answering. Due to the high level of realism of the renderings from Replica, there is hope that ML systems trained on Replica may transfer directly to real world image and video data. Together with the data, we are releasing a minimal C++ SDK as a starting point for working with the Replica dataset. In addition, Replica is `Habitat-compatible', i.e. can be natively used with AI Habitat for training and testing embodied agents.

...read moreread less

299 citations

Posted Content•

Habitat: A Platform for Embodied AI Research

[...]

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, Dhruv Batra - Show less +8 more

02 Apr 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: Habitat as discussed by the authors is a platform for research in embodied artificial intelligence (AI) that enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation.

...read moreread less

294 citations

Proceedings Article•

TarMAC: Targeted Multi-Agent Communication

[...]

Abhishek Das¹, Théophile Gervet, Joshua Romoff², Dhruv Batra¹, Devi Parikh¹, Michael G. Rabbat², Joelle Pineau² - Show less +3 more•Institutions (2)

Georgia Institute of Technology¹, McGill University²

24 May 2019

TL;DR: This work proposes a targeted communication architecture for multi-agent reinforcement learning, where agents learn both what messages to send and whom to address them to while performing cooperative tasks in partially-observable environments, and augment this with a multi-round communication approach.

...read moreread less

Abstract: We propose a targeted communication architecture for multi-agent reinforcement learning, where agents learn both what messages to send and whom to address them to while performing cooperative tasks in partially-observable environments. This targeting behavior is learnt solely from downstream task-specific reward without any communication supervision. We additionally augment this with a multi-round communication approach where agents coordinate via multiple rounds of communication before taking actions in the environment. We evaluate our approach on a diverse set of cooperative multi-agent tasks, of varying difficulties, with varying number of agents, in a variety of environments ranging from 2D grid layouts of shapes and simulated traffic junctions to 3D indoor environments, and demonstrate the benefits of targeted and multi-round communication. Moreover, we show that the targeted communication strategies learned by agents are interpretable and intuitive. Finally, we show that our architecture can be easily extended to mixed and competitive environments, leading to improved performance and sample complexity over recent state-of-the-art approaches.

...read moreread less

205 citations

Posted Content•

DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames

[...]

Erik Wijmans¹, Abhishek Kadian², Ari S. Morcos³, Stefan Lee⁴, Irfan Essa⁵, Devi Parikh¹, Manolis Savva⁶, Dhruv Batra¹ - Show less +4 more•Institutions (6)

Georgia Institute of Technology¹, Facebook², Harvard University³, Oregon State University⁴, Google⁵, Simon Fraser University⁶

01 Nov 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: Decentralized Distributed Proximal Policy Optimization (DD-PPO) as discussed by the authors uses multiple machines, lacks a centralized server, and is synchronous, making it conceptually simple and easy to implement.

...read moreread less

Abstract: We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever stale), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling -- achieving a speedup of 107x on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs. This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially solves the task --near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of ImageNet pre-training + task-specific fine-tuning for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available).

...read moreread less

177 citations

Proceedings Article•DOI•

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

[...]

Ramprasaath R. Selvaraju¹, Stefan Lee¹, Yilin Shen², Hongxia Jin², Shalini Ghosh², Larry Heck², Dhruv Batra¹, Devi Parikh¹ - Show less +4 more•Institutions (2)

Georgia Institute of Technology¹, Samsung²

01 Oct 2019

TL;DR: In this article, the alignment between human attention maps and gradient-based network importance is optimized to encourage deep networks to be sensitive to the same input regions as humans to improve visual grounding.

...read moreread less

Abstract: Many vision and language models suffer from poor visual grounding -- often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image. In this work, we propose a generic approach called Human Importance-aware Network Tuning (HINT) that effectively leverages human demonstrations to improve visual grounding. HINT encourages deep networks to be sensitive to the same input regions as humans. Our approach optimizes the alignment between human attention maps and gradient-based network importances -- ensuring that models learn not just to look at but rather rely on visual concepts that humans found relevant for a task when making predictions. We apply HINT to Visual Question Answering and Image Captioning tasks, outperforming top approaches on splits that penalize over-reliance on language priors (VQA-CP and robust captioning) using human attention demonstrations for just 6% of the training data.

...read moreread less

Proceedings Article•DOI•

End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features

[...]

Chiori Hori¹, Huda Alamri², Jue Wang¹, Gordon Wichern¹, Takaaki Hori¹, Anoop Cherian¹, Tim K. Marks¹, Vincent Cartillier², Raphael Gontijo Lopes², Abhishek Das², Irfan Essa², Dhruv Batra², Devi Parikh² - Show less +9 more•Institutions (2)

Mitsubishi Electric Research Laboratories¹, Georgia Institute of Technology²

12 May 2019

TL;DR: This paper introduces a new data set of dialogs about videos of human behaviors, as well as an end-to-end Audio Visual Scene-Aware Dialog (AVSD) model, trained using thisnew data set, that generates responses in a dialog about a video.

...read moreread less

Abstract: In order for machines interacting with the real world to have conversations with users about the objects and events around them, they need to understand dynamic audiovisual scenes. The recent revolution of neural network models allows us to combine various modules into a single end-to-end differentiable network. As a result, Audio Visual Scene-Aware Dialog (AVSD) systems for real-world applications can be developed by integrating state-of-the-art technologies from multiple research areas, including end-to-end dialog technologies, visual question answering (VQA) technologies, and video description technologies. In this paper, we introduce a new data set of dialogs about videos of human behaviors, as well as an end-to-end Audio Visual Scene-Aware Dialog (AVSD) model, trained using this new data set, that generates responses in a dialog about a video. By using features that were developed for multimodal attention-based video description, our system improves the quality of generated dialog about dynamic video scenes.

...read moreread less

Proceedings Article•DOI•

Audio Visual Scene-Aware Dialog

[...]

Huda Alamri¹, Peter Anderson¹, Stefan Lee¹, Devi Parikh¹, Vincent Cartillier¹, Abhishek Das¹, Jue Wang², Anoop Cherian, Irfan Essa¹, Dhruv Batra¹, Tim K. Marks², Chiori Hori² - Show less +8 more•Institutions (2)

Georgia Institute of Technology¹, Mitsubishi Electric Research Laboratories²

15 Jun 2019

TL;DR: The authors introduced the Audio Visual Scene-Aware Dialog (AVSD) dataset, which contains a dialog about the video, plus a final summary of the video by one of the dialog participants.

...read moreread less

Abstract: We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more than 11,000 videos of human actions from the Charades dataset, our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using both qualitative and quantitative metrics. Our results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset.

...read moreread less

Journal Article•DOI•

Sim2Real Predictivity: Does Evaluation in Simulation Predict Real-World Performance?

[...]

Abhishek Kadian¹, Joanne Truong², Aaron Gokaslan¹, Alexander Clegg¹, Erik Wijmans², Stefan Lee³, Manolis Savva⁴, Sonia Chernova², Dhruv Batra² - Show less +5 more•Institutions (4)

Facebook¹, Georgia Institute of Technology², Oregon State University³, Simon Fraser University⁴

13 Dec 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: The experiments show that it is possible to tune simulation parameters to improve sim2real predictivity (e.g. improving SRCC from 0.18 to 0.844) – increasing confidence that in-simulation comparisons will translate to deployed systems in reality.

...read moreread less

Abstract: Does progress in simulation translate to progress on robots? If one method outperforms another in simulation, how likely is that trend to hold in reality on a robot? We examine this question for embodied PointGoal navigation, developing engineering tools and a research paradigm for evaluating a simulator by its sim2real predictivity. First, we develop Habitat-PyRobot Bridge (HaPy), a library for seamless execution of identical code on simulated agents and robots, transferring simulation-trained agents to a LoCoBot platform with a one-line code change. Second, we investigate the sim2real predictivity of Habitat-Sim for PointGoal navigation. We 3D-scan a physical lab space to create a virtualized replica, and run parallel tests of 9 different models in reality and simulation. We present a new metric called Sim-vs-Real Correlation Coefficient (SRCC) to quantify predictivity. We find that SRCC for Habitat as used for the CVPR19 challenge is low (0.18 for the success metric), suggesting that performance differences in this simulator-based challenge do not persist after physical deployment. This gap is largely due to AI agents learning to exploit simulator imperfections, abusing collision dynamics to 'slide' along walls, leading to shortcuts through otherwise non-navigable space. Naturally, such exploits do not work in the real world. Our experiments show that it is possible to tune simulation parameters to improve sim2real predictivity (e.g. improving $SRCC_{Succ}$ from 0.18 to 0.844), increasing confidence that in-simulation comparisons will translate to deployed systems in reality.

...read moreread less

Proceedings Article•DOI•

nocaps: novel object captioning at scale

[...]

Harsh Agrawal¹, Peter Anderson¹, Karan Desai¹, Yufei Wang², Xinlei Chen³, Rishabh Jain¹, Mark Johnson², Dhruv Batra¹, Devi Parikh¹, Stefan Lee¹ - Show less +6 more•Institutions (3)

Georgia Institute of Technology¹, Macquarie University², Facebook³

01 Oct 2019

TL;DR: The nocaps benchmark as discussed by the authors is a large-scale benchmark for object captioning, which consists of 166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets.

...read moreread less

Abstract: Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task. Dubbed ‘nocaps’, for novel object captioning at scale, our benchmark consists of 166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets. The associated training data consists of COCO image-caption pairs, plus Open Images image-level labels and object bounding boxes. Since Open Images contains many more classes than COCO, nearly 400 object classes seen in test images have no or very few associated training captions (hence, nocaps). We extend existing novel object captioning models to establish strong baselines for this benchmark and provide analysis to guide future work.

...read moreread less

Proceedings Article•DOI•

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

[...]

Erik Wijmans¹, Samyak Datta¹, Oleksandr Maksymets², Abhishek Das¹, Georgia Gkioxari², Stefan Lee¹, Irfan Essa¹, Devi Parikh¹, Dhruv Batra¹ - Show less +5 more•Institutions (2)

Georgia Institute of Technology¹, Facebook²

15 Jun 2019

TL;DR: In this article, a large-scale navigation task for embodied question answering in photo-realistic environments (Matterport 3D) is presented, where 3D point clouds, RGB images, or their combination are used.

...read moreread less

Abstract: To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -- Embodied Question Answering [1] in photo-realistic environments (Matterport 3D). We thoroughly study navigation policies that utilize 3D point clouds, RGB images, or their combination. Our analysis of these models reveals several key findings. We find that two seemingly naive navigation baselines, forward-only and random, are strong navigators and challenging to outperform, due to the specific choice of the evaluation setting presented by [1]. We find a novel loss-weighting scheme we call Inflection Weighting to be important when training recurrent models for navigation with behavior cloning and are able to out perform the baselines with this technique. We find that point clouds provide a richer signal than RGB images for learning obstacle avoidance, motivating the use (and continued study) of 3D deep learning models for embodied navigation.

...read moreread less

Posted Content•

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded.

[...]

Ramprasaath R. Selvaraju¹, Stefan Lee¹, Yilin Shen², Hongxia Jin², Shalini Ghosh², Larry Heck², Dhruv Batra¹, Devi Parikh¹ - Show less +4 more•Institutions (2)

Georgia Institute of Technology¹, Samsung²

11 Feb 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes a generic approach called Human Importance-aware Network Tuning (HINT), which effectively leverages human demonstrations to improve visual grounding and encourages deep networks to be sensitive to the same input regions as humans.

...read moreread less

Abstract: Many vision and language models suffer from poor visual grounding - often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image. In this work, we propose a generic approach called Human Importance-aware Network Tuning (HINT) that effectively leverages human demonstrations to improve visual grounding. HINT encourages deep networks to be sensitive to the same input regions as humans. Our approach optimizes the alignment between human attention maps and gradient-based network importances - ensuring that models learn not just to look at but rather rely on visual concepts that humans found relevant for a task when making predictions. We apply HINT to Visual Question Answering and Image Captioning tasks, outperforming top approaches on splits that penalize over-reliance on language priors (VQA-CP and robust captioning) using human attention demonstrations for just 6% of the training data.

...read moreread less

Journal Article•DOI•

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

[...]

Yash Goyal¹, Tejas Khot², Aishwarya Agrawal¹, Douglas Summers-Stay³, Dhruv Batra¹, Dhruv Batra⁴, Devi Parikh¹, Devi Parikh⁴ - Show less +4 more•Institutions (4)

Georgia Institute of Technology¹, Carnegie Mellon University², United States Army Research Laboratory³, Facebook⁴

01 Apr 2019-International Journal of Computer Vision

TL;DR: This work balances the popular VQA dataset by collecting complementary images such that every question in the authors' balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.

...read moreread less

Abstract: The problem of visual question answering (VQA) is of significant importance both as a challenging research question and for the rich set of applications it enables. In this context, however, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in VQA models that ignore visual information, leading to an inflated sense of their capability. We propose to counter these language priors for the task of VQA and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset (Antol et al., in: ICCV, 2015) by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has approximately twice the number of image-question pairs. Our complete balanced dataset is publicly available at http://visualqa.org/ as part of the 2nd iteration of the VQA Dataset and Challenge (VQA v2.0). We further benchmark a number of state-of-art VQA models on our balanced dataset. All models perform significantly worse on our balanced dataset, suggesting that these models have indeed learned to exploit language priors. This finding provides the first concrete empirical evidence for what seems to be a qualitative sense among practitioners. We also present interesting insights from analysis of the participant entries in VQA Challenge 2017, organized by us on the proposed VQA v2.0 dataset. The results of the challenge were announced in the 2nd VQA Challenge Workshop at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017. Finally, our data collection protocol for identifying complementary images enables us to develop a novel interpretable model, which in addition to providing an answer to the given (image, question) pair, also provides a counter-example based explanation. Specifically, it identifies an image that is similar to the original image, but it believes has a different answer to the same question. This can help in building trust for machines among their users.

...read moreread less

Proceedings Article•DOI•

CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication

[...]

Jin-Hwa Kim¹, Nikita Kitaev², Xinlei Chen³, Marcus Rohrbach³, Byoung-Tak Zhang⁴, Yuandong Tian³, Dhruv Batra⁵, Devi Parikh⁵ - Show less +4 more•Institutions (5)

Naver Corporation¹, University of California, Berkeley², Facebook³, Seoul National University⁴, Georgia Institute of Technology⁵

01 Jul 2019

TL;DR: This work develops a Collaborative image-Drawing game between two agents, called CoDraw, which is grounded in a virtual world that contains movable clip art objects and presents models for the task and benchmark them using both fully automated evaluation and by having them play the game live with humans.

...read moreread less

Abstract: In this work, we propose a goal-driven collaborative task that combines language, perception, and action. Specifically, we develop a Collaborative image-Drawing game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pieces in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip art pieces. The two players communicate with each other using natural language. We collect the CoDraw dataset of ~10K dialogs consisting of ~138K messages exchanged between human players. We define protocols and metrics to evaluate learned agents in this testbed, highlighting the need for a novel “crosstalk” evaluation condition which pairs agents trained independently on disjoint subsets of the training data. We present models for our task and benchmark them using both fully automated evaluation and by having them play the game live with humans.

...read moreread less

Posted Content•

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

[...]

Satwik Kottur¹, Jose M. F. Moura¹, Devi Parikh², Dhruv Batra³, Marcus Rohrbach³ - Show less +1 more•Institutions (3)

Carnegie Mellon University¹, Georgia Institute of Technology², Facebook³

07 Mar 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work develops CLEVR-Dialog, a large diagnostic dataset for studying multi-round reasoning in visual dialog, and constructs a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset, resulting in a dataset where all aspects of the visual dialog are fully annotated.

...read moreread less

Abstract: Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image, using the conversation history as context. It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks in isolation on large, real datasets is infeasible as it requires prohibitively-expensive complete annotation of the 'state' of all images and dialogs. We develop CLEVR-Dialog, a large diagnostic dataset for studying multi-round reasoning in visual dialog. Specifically, we construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset. This combination results in a dataset where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains 5 instances of 10-round dialogs for about 85k CLEVR images, totaling to 4.25M question-answer pairs. We use CLEVR-Dialog to benchmark performance of standard visual dialog models; in particular, on visual coreference resolution (as a function of the coreference distance). This is the first analysis of its kind for visual dialog models that was not possible without this dataset. We hope the findings from CLEVR-Dialog will help inform the development of future models for visual dialog. Our dataset and code are publicly available.

...read moreread less

Proceedings Article•DOI•

Multi-Target Embodied Question Answering

[...]

Licheng Yu¹, Xinlei Chen², Georgia Gkioxari², Mohit Bansal¹, Tamara L. Berg, Dhruv Batra³ - Show less +2 more•Institutions (3)

University of North Carolina at Chapel Hill¹, Facebook², Georgia Institute of Technology³

15 Jun 2019

TL;DR: This work presents a generalization of EQA -- Multi-Target EQA (MT-EQA), and proposes a modular architecture composed of a program generator, a controller, a navigator, and a VQA module that can outperform previous methods and strong baselines by a significant margin.

...read moreread less

Abstract: Embodied Question Answering (EQA) is a relatively new task where an agent is asked to answer questions about its environment from egocentric perception. EQA as introduced in [8] makes the fundamental assumption that every question, e.g., ``what color is the car?", has exactly one target (``car") being inquired about. This assumption puts a direct limitation on the abilities of the agent. We present a generalization of EQA -- Multi-Target EQA (MT-EQA). Specifically, we study questions that have multiple targets in them, such as ``Is the dresser in the bedroom bigger than the oven in the kitchen?", where the agent has to navigate to multiple locations (``dresser in bedroom", ``oven in kitchen") and perform comparative reasoning (``dresser" bigger than ``oven") before it can answer a question. Such questions require the development of entirely new modules or components in the agent. To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module. The program generator converts the given question into sequential executable sub-programs; the navigator guides the agent to multiple locations pertinent to the navigation-related sub-programs; and the controller learns to select relevant observations along its path. These observations are then fed to the VQA module to predict the answer. We perform detailed analysis for each of the model components and show that our joint model can outperform previous methods and strong baselines by a significant margin.

...read moreread less

Posted Content•

Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

[...]

Vishvak Murahari¹, Dhruv Batra¹, Devi Parikh¹, Abhishek Das¹•Institutions (1)

Georgia Institute of Technology¹

05 Dec 2019-arXiv: Learning

TL;DR: This work adapts the recently proposed ViLBERT model for multi-turn visually-grounded conversations and finds that additional finetuning using "dense" annotations in VisDial leads to even higher NDCG but hurts MRR, highlighting a trade-off between the two primary metrics.

...read moreread less

Abstract: Prior work in visual dialog has focused on training deep neural models on VisDial in isolation. Instead, we present an approach to leverage pretraining on related vision-language datasets before transferring to visual dialog. We adapt the recently proposed ViLBERT (Lu et al., 2019) model for multi-turn visually-grounded conversations. Our model is pretrained on the Conceptual Captions and Visual Question Answering datasets, and finetuned on VisDial. Our best single model outperforms prior published work (including model ensembles) by more than 1% absolute on NDCG and MRR. Next, we find that additional finetuning using "dense" annotations in VisDial leads to even higher NDCG -- more than 10% over our base model -- but hurts MRR -- more than 17% below our base model! This highlights a trade-off between the two primary metrics -- NDCG and MRR -- which we find is due to dense annotations not correlating well with the original ground-truth answers to questions.

...read moreread less

Posted Content•

Audio-Visual Scene-Aware Dialog

[...]

Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh - Show less +8 more

25 Jan 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: The task of scene-aware dialog is introduced and results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset.

...read moreread less

Posted Content•

EvalAI: Towards Better Evaluation Systems for AI Agents.

[...]

Deshraj Yadav, Rishabh Jain, Harsh Agrawal, Prithvijit Chattopadhyay, Taranjeet Singh, Akash Jain, Shivkaran Singh, Stefan Lee, Dhruv Batra - Show less +5 more

10 Feb 2019-arXiv: Artificial Intelligence

TL;DR: EvalAI is built to provide a scalable solution to the research community to fulfill the critical need of evaluating machine learning models and agents acting in an environment against annotations or with a human-in-the-loop.

...read moreread less

Abstract: We introduce EvalAI, an open source platform for evaluating and comparing machine learning (ML) and artificial intelligence algorithms (AI) at scale. EvalAI is built to provide a scalable solution to the research community to fulfill the critical need of evaluating machine learning models and agents acting in an environment against annotations or with a human-in-the-loop. This will help researchers, students, and data scientists to create, collaborate, and participate in AI challenges organized around the globe. By simplifying and standardizing the process of benchmarking these models, EvalAI seeks to lower the barrier to entry for participating in the global scientific effort to push the frontiers of machine learning and artificial intelligence, thereby increasing the rate of measurable progress in this domain.

...read moreread less

Posted Content•

SplitNet: Sim2Sim and Task2Task Transfer for Embodied Visual Navigation

[...]

Daniel Gordon¹, Abhishek Kadian², Devi Parikh³, Judy Hoffman³, Dhruv Batra³ - Show less +1 more•Institutions (3)

University of Washington¹, Facebook², Georgia Institute of Technology³

18 May 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: SplitNet is proposed, a method for decoupling visual perception and policy learning by incorporating auxiliary tasks and selective learning of portions of the model that explicitly decompose the learning objectives for visual navigation into perceiving the world and acting on that perception.

...read moreread less

Abstract: We propose SplitNet, a method for decoupling visual perception and policy learning. By incorporating auxiliary tasks and selective learning of portions of the model, we explicitly decompose the learning objectives for visual navigation into perceiving the world and acting on that perception. We show dramatic improvements over baseline models on transferring between simulators, an encouraging step towards Sim2Real. Additionally, SplitNet generalizes better to unseen environments from the same simulator and transfers faster and more effectively to novel embodied navigation tasks. Further, given only a small sample from a target domain, SplitNet can match the performance of traditional end-to-end pipelines which receive the entire dataset. Code is available this https URL

...read moreread less

Posted Content•

Embodied Question Answering in Photorealistic Environments with Point Cloud Perception

[...]

Erik Wijmans¹, Samyak Datta¹, Oleksandr Maksymets², Abhishek Das¹, Georgia Gkioxari², Stefan Lee¹, Irfan Essa¹, Devi Parikh¹, Dhruv Batra¹ - Show less +5 more•Institutions (2)

Georgia Institute of Technology¹, Facebook²

06 Apr 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: It is found that point clouds provide a richer signal than RGB images for learning obstacle avoidance, motivating the use (and continued study) of 3D deep learning models for embodied navigation.

...read moreread less

Posted Content•

Emergence of Compositional Language with Deep Generational Transmission

[...]

Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh, Dhruv Batra - Show less +1 more

19 Apr 2019-arXiv: Learning

TL;DR: It is shown that this implicit cultural transmission encourages the resulting languages to exhibit better compositional generalization and suggest how elements of cultural dynamics can be further integrated into populations of deep agents.

...read moreread less

Abstract: Recent work has studied the emergence of language among deep reinforcement learning agents that must collaborate to solve a task. Of particular interest are the factors that cause language to be compositional -- i.e., express meaning by combining words which themselves have meaning. Evolutionary linguists have found that in addition to structural priors like those already studied in deep learning, the dynamics of transmitting language from generation to generation contribute significantly to the emergence of compositionality. In this paper, we introduce these cultural evolutionary dynamics into language emergence by periodically replacing agents in a population to create a knowledge gap, implicitly inducing cultural transmission of language. We show that this implicit cultural transmission encourages the resulting languages to exhibit better compositional generalization.

...read moreread less

Posted Content•

Are We Making Real Progress in Simulated Environments? Measuring the Sim2Real Gap in Embodied Visual Navigation

[...]

Abhishek Kadian, Joanne Truong, Aaron Gokaslan, Alexander Clegg, Erik Wijmans, Stefan Lee, Manolis Savva, Sonia Chernova, Dhruv Batra - Show less +5 more

13 Dec 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: The Habitat-PyRobot Bridge is developed, a library for seamless execution of identical code on a simulated agent and a physical robot and a new metric called Sim-vs-Real Correlation Coefficient (SRCC) is presented to quantify sim2real predictivity, which is largely due to AI agents learning to 'cheat' by exploiting simulator imperfections.

...read moreread less

Abstract: Does progress in simulation translate to progress in robotics? Specifically, if method A outperforms method B in simulation, how likely is the trend to hold in reality on a robot? We examine this question for embodied (PointGoal) navigation, developing engineering tools and a research paradigm for evaluating a simulator by its sim2real predictivity, revealing surprising findings about prior work. First, we develop Habitat-PyRobot Bridge (HaPy), a library for seamless execution of identical code on a simulated agent and a physical robot. Habitat-to-Locobot transfer with HaPy involves just one line change in config, essentially treating reality as just another simulator! Second, we investigate sim2real predictivity of Habitat-Sim for PointGoal navigation. We 3D-scan a physical lab space to create a virtualized replica, and run parallel tests of 9 different models in reality and simulation. We present a new metric called Sim-vs-Real Correlation Coefficient (SRCC) to quantify sim2real predictivity. Our analysis reveals several important findings. We find that SRCC for Habitat as used for the CVPR19 challenge is low (0.18 for the success metric), which suggests that performance improvements for this simulator-based challenge would not transfer well to a physical robot. We find that this gap is largely due to AI agents learning to 'cheat' by exploiting simulator imperfections: specifically, the way Habitat allows for 'sliding' along walls on collision. Essentially, the virtual robot is capable of cutting corners, leading to unrealistic shortcuts through non-navigable spaces. Naturally, such exploits do not work in the real world where the robot stops on contact with walls. Our experiments show that it is possible to optimize simulation parameters to enable robots trained in imperfect simulators to generalize learned skills to reality (e.g. improving $SRCC_{Succ}$ from 0.18 to 0.844).

...read moreread less

Proceedings Article•

Probabilistic Neural-Symbolic Models for Interpretable Visual Question Answering

[...]

Ramakrishna Vedantam¹, Karan Desai¹, Stefan Lee², Marcus Rohrbach¹, Dhruv Batra¹, Devi Parikh² - Show less +2 more•Institutions (2)

Facebook¹, Georgia Institute of Technology²

21 Feb 2019

TL;DR: A new class of probabilistic neural-symbolic models, that have symbolic functional programs as a latent, stochastic variable, that are more understandable while requiring lesser number of teaching examples for VQA is proposed.

...read moreread less

Abstract: We propose a new class of probabilistic neural-symbolic models, that have symbolic functional programs as a latent, stochastic variable. Instantiated in the context of visual question answering, our probabilistic formulation offers two key conceptual advantages over prior neural-symbolic models for VQA. Firstly, the programs generated by our model are more understandable while requiring lesser number of teaching examples. Secondly, we show that one can pose counterfactual scenarios to the model, to probe its beliefs on the programs that could lead to a specified answer given an image. Our results on the CLEVR and SHAPES datasets verify our hypotheses, showing that the model gets better program (and answer) prediction accuracy even in the low data regime, and allows one to probe the coherence and consistency of reasoning performed.

...read moreread less

Proceedings Article•DOI•

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

[...]

Jyoti Aneja¹, Harsh Agrawal², Dhruv Batra², Alexander G. Schwing¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, Georgia Institute of Technology²

01 Oct 2019

TL;DR: This paper proposed Seq-CVAE which learns a latent space for every word to capture the "intention" about how to complete the sentence by mimicking a representation which summarizes the future.

...read moreread less

Abstract: Diverse and accurate vision+language modeling is an important goal to retain creative freedom and maintain user engagement. However, adequately capturing the intricacies of diversity in language models is challenging. Recent works commonly resort to latent variable models augmented with more or less supervision from object detectors or part-of-speech tags. In common to all those methods is the fact that the latent variable either only initializes the sentence generation process or is identical across the steps of generation. Both methods offer no fine-grained control. To address this concern, we propose Seq-CVAE which learns a latent space for every word. We encourage this temporal latent space to capture the 'intention' about how to complete the sentence by mimicking a representation which summarizes the future. We illustrate the efficacy of the proposed approach on the challenging MSCOCO dataset, significantly improving diversity metrics compared to baselines while performing on par w.r.t. sentence quality.

...read moreread less