Neural Baby Talk

Open AccessPosted Content

Neural Baby Talk

Jiasen Lu, +3 more

- 27 Mar 2018 -

arXiv: Computer Vision and Pattern Recog...

Chats0

TLDR

The authors propose to generate a sentence template with slot locations explicitly tied to specific image regions, which are then filled in by visual concepts identified in the regions by object detectors, achieving state-of-the-art performance on both standard image captioning and novel object captioning.

Abstract:

We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image. Our approach reconciles classical slot filling approaches (that are generally better grounded in images) with modern neural captioning approaches (that are generally more natural sounding and accurate). Our approach first generates a sentence `template' with slot locations explicitly tied to specific image regions. These slots are then filled in by visual concepts identified in the regions by object detectors. The entire architecture (sentence template generation and slot filling with object detectors) is end-to-end differentiable. We verify the effectiveness of our proposed model on different image captioning tasks. On standard image captioning and novel object captioning, our model reaches state-of-the-art on both COCO and Flickr30k datasets. We also demonstrate that our model has unique advantages when the train and test distributions of scene compositions -- and hence language priors of associated captions -- are different. Code has been made available at: this https URL

Citations

PDF

Open Access

More filters

Posted Content

VideoBERT: A Joint Model for Video and Language Representation Learning.

Chen Sun, +4 more

- 03 Apr 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: In this article, a joint visual-linguistic model is proposed to learn high-level features without any explicit supervision, inspired by its recent success in language modeling, and it outperforms the state-of-the-art on video captioning, and quantitative results verify that the model learns highlevel semantic features.

...read moreread less

Book ChapterDOI

Graph R-CNN for Scene Graph Generation

Jianwei Yang, +6 more

TL;DR: A novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images, is proposed and a new evaluation metric is introduced that is more holistic and realistic than existing metrics.

...read moreread less

Posted Content

Auto-Encoding Scene Graphs for Image Captioning

Xu Yang, +3 more

- 06 Dec 2018 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: The Scene Graph Auto-Encoder (SGAE) as discussed by the authors incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions.

...read moreread less

Proceedings ArticleDOI

12-in-1: Multi-Task Vision and Language Representation Learning

Jiasen Lu, +4 more

TL;DR: This paper investigated the relationship between vision and language tasks by developing a large-scale, multi-task model, which culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification.

...read moreread less

Posted Content

Graph R-CNN for Scene Graph Generation

Jianwei Yang, +6 more

- 01 Aug 2018 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Graph R-CNN as mentioned in this paper proposes a relation proposal network (RePN) that efficiently deals with the quadratic number of potential relations between objects in an image and an attentional graph convolutional network (aGCN) that effectively captures contextual information between objects and relations.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Collapse

arXiv: Computer Vision and Pattern Recog...

Neural Baby Talk

Citations

VideoBERT: A Joint Model for Video and Language Representation Learning.

Graph R-CNN for Scene Graph Generation

Auto-Encoding Scene Graphs for Image Captioning

12-in-1: Multi-Task Vision and Language Representation Learning

Graph R-CNN for Scene Graph Generation

References

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

Long short-term memory

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

Related Papers (5)

Bleu: a Method for Automatic Evaluation of Machine Translation

Deep visual-semantic alignments for generating image descriptions

Show and tell: A neural image caption generator

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks