scispace - formally typeset
Open AccessPosted Content

Neural Baby Talk

Reads0
Chats0
TLDR
The authors propose to generate a sentence template with slot locations explicitly tied to specific image regions, which are then filled in by visual concepts identified in the regions by object detectors, achieving state-of-the-art performance on both standard image captioning and novel object captioning.
Abstract
We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image. Our approach reconciles classical slot filling approaches (that are generally better grounded in images) with modern neural captioning approaches (that are generally more natural sounding and accurate). Our approach first generates a sentence `template' with slot locations explicitly tied to specific image regions. These slots are then filled in by visual concepts identified in the regions by object detectors. The entire architecture (sentence template generation and slot filling with object detectors) is end-to-end differentiable. We verify the effectiveness of our proposed model on different image captioning tasks. On standard image captioning and novel object captioning, our model reaches state-of-the-art on both COCO and Flickr30k datasets. We also demonstrate that our model has unique advantages when the train and test distributions of scene compositions -- and hence language priors of associated captions -- are different. Code has been made available at: this https URL

read more

Citations
More filters
Posted Content

VideoBERT: A Joint Model for Video and Language Representation Learning.

TL;DR: In this article, a joint visual-linguistic model is proposed to learn high-level features without any explicit supervision, inspired by its recent success in language modeling, and it outperforms the state-of-the-art on video captioning, and quantitative results verify that the model learns highlevel semantic features.
Book ChapterDOI

Graph R-CNN for Scene Graph Generation

TL;DR: A novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images, is proposed and a new evaluation metric is introduced that is more holistic and realistic than existing metrics.
Posted Content

Auto-Encoding Scene Graphs for Image Captioning

TL;DR: The Scene Graph Auto-Encoder (SGAE) as discussed by the authors incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions.
Proceedings ArticleDOI

12-in-1: Multi-Task Vision and Language Representation Learning

TL;DR: This paper investigated the relationship between vision and language tasks by developing a large-scale, multi-task model, which culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification.
Posted Content

Graph R-CNN for Scene Graph Generation

TL;DR: Graph R-CNN as mentioned in this paper proposes a relation proposal network (RePN) that efficiently deals with the quadratic number of potential relations between objects in an image and an attentional graph convolutional network (aGCN) that effectively captures contextual information between objects and relations.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Related Papers (5)