Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

doi:10.1109/CVPR.2018.00387

Open AccessProceedings ArticleDOI

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

- pp 3674-3683

TLDR

The Room-to-Room (R2R) dataset as mentioned in this paper provides a large-scale reinforcement learning environment based on real imagery for visually-grounded natural language navigation in real buildings.

Abstract:

A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visually-grounded navigation instructions, we present the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery [11]. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset1.

Citations

PDF

Open Access

More filters

Posted Content

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jiasen Lu, +3 more

- 06 Aug 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language, is presented, extending the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

...read moreread less

Proceedings ArticleDOI

Habitat: A Platform for Embodied AI Research

Manolis Savva, +11 more

TL;DR: The comparison between learning and SLAM approaches from two recent works are revisited and evidence is found -- that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and the first cross-dataset generalization experiments are conducted.

...read moreread less

Journal ArticleDOI

Cognitive Mapping and Planning for Visual Navigation

Saurabh Gupta, +8 more

- 01 May 2020 -

International Journal of Computer Vision

TL;DR: The Cognitive Mapper and Planner is based on a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the task, and a spatial memory with the ability to plan given an incomplete set of observations about the world.

...read moreread less

Posted Content

The Replica Dataset: A Digital Replica of Indoor Spaces.

Julian Straub, +29 more

- 13 Jun 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Replica, a dataset of 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale, is introduced to enable machine learning (ML) research that relies on visually, geometrically, and semantically realistic generative models of the world.

...read moreread less

Posted Content

Habitat: A Platform for Embodied AI Research

Manolis Savva, +11 more

- 02 Apr 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Habitat as discussed by the authors is a platform for research in embodied artificial intelligence (AI) that enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015 -

International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Collapse

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

Citations

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Habitat: A Platform for Embodied AI Research

Cognitive Mapping and Planning for Visual Navigation

The Replica Dataset: A Digital Replica of Indoor Spaces.

Habitat: A Platform for Embodied AI Research

References

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

Long short-term memory

ImageNet Large Scale Visual Recognition Challenge

Neural Machine Translation by Jointly Learning to Align and Translate

Related Papers (5)

Deep Residual Learning for Image Recognition

Matterport3D: Learning from RGB-D Data in Indoor Environments

Target-driven visual navigation in indoor scenes using deep reinforcement learning

Attention is All you Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding