Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
Peter Anderson,Qi Wu,Damien Teney,Jake Bruce,Mark Johnson,Niko Sünderhauf,Ian Reid,Stephen Gould,Anton van den Hengel +8 more
- pp 3674-3683
TLDR
The Room-to-Room (R2R) dataset as mentioned in this paper provides a large-scale reinforcement learning environment based on real imagery for visually-grounded natural language navigation in real buildings.Citations
More filters
Posted Content
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
TL;DR: ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language, is presented, extending the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.
Proceedings ArticleDOI
Habitat: A Platform for Embodied AI Research
Manolis Savva,Jitendra Malik,Devi Parikh,Dhruv Batra,Abhishek Kadian,Oleksandr Maksymets,Yili Zhao,Erik Wijmans,Bhavana Jain,Julian Straub,Jia Liu,Vladlen Koltun +11 more
TL;DR: The comparison between learning and SLAM approaches from two recent works are revisited and evidence is found -- that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and the first cross-dataset generalization experiments are conducted.
Journal ArticleDOI
Cognitive Mapping and Planning for Visual Navigation
Saurabh Gupta,Saurabh Gupta,Varun Tolani,James Davidson,Sergey Levine,Sergey Levine,Rahul Sukthankar,Jitendra Malik,Jitendra Malik +8 more
TL;DR: The Cognitive Mapper and Planner is based on a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the task, and a spatial memory with the ability to plan given an incomplete set of observations about the world.
Posted Content
The Replica Dataset: A Digital Replica of Indoor Spaces.
Julian Straub,Thomas Whelan,Lingni Ma,Yufan Chen,Erik Wijmans,Simon Green,Jakob Engel,Raul Mur-Artal,Carl Yuheng Ren,Shobhit Verma,Anton Clarkson,Yan Mingfei,Brian Budge,Yajie Yan,Xiaqing Pan,June Yon,Yuyang Zou,Kimberly Leon,Nigel Carter,Jesus Briales,Tyler Gillingham,Elias Mueggler,Luis Pesqueira,Manolis Savva,Dhruv Batra,Hauke Strasdat,Renzo De Nardi,Michael Goesele,Steven Lovegrove,Richard Newcombe +29 more
TL;DR: Replica, a dataset of 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale, is introduced to enable machine learning (ML) research that relies on visually, geometrically, and semantically realistic generative models of the world.
Posted Content
Habitat: A Platform for Embodied AI Research
Manolis Savva,Abhishek Kadian,Oleksandr Maksymets,Yili Zhao,Erik Wijmans,Bhavana Jain,Julian Straub,Jia Liu,Vladlen Koltun,Jitendra Malik,Devi Parikh,Dhruv Batra +11 more
TL;DR: Habitat as discussed by the authors is a platform for research in embodied artificial intelligence (AI) that enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Journal ArticleDOI
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky,Jia Deng,Hao Su,Jonathan Krause,Sanjeev Satheesh,Sean Ma,Zhiheng Huang,Andrej Karpathy,Aditya Khosla,Michael S. Bernstein,Alexander C. Berg,Li Fei-Fei +11 more
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Proceedings Article
Neural Machine Translation by Jointly Learning to Align and Translate
TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.