Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts

doi:10.1109/TPAMI.2016.2642953

Open AccessJournal ArticleDOI

Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts

Kun Fu, +4 more

- 01 Dec 2017 -

IEEE Transactions on Pattern Analysis an...

- Vol. 39, Iss: 12, pp 2321-2334

TLDR

This paper proposes an image captioning system that exploits the parallel structures between images and sentences and makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image.

Abstract:

Recent progress on automatic generation of image captions has shown that it is possible to describe the most salient information conveyed by images with accurate and meaningful sentences. In this paper, we propose an image captioning system that exploits the parallel structures between images and sentences. In our model, the process of generating the next word, given the previously generated ones, is aligned with the visual perception experience where the attention shifts among the visual regions—such transitions impose a thread of ordering in visual perception. This alignment characterizes the flow of latent meaning, which encodes what is semantically shared by both the visual scene and the text description. Our system also makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image. The contexts adapt language models for word generation to specific scene types. We benchmark our system and contrast to published results on several popular datasets, using both automatic evaluation metrics and human evaluation. We show that either region-based attention or scene-specific contexts improves systems without those components. Furthermore, combining these two modeling ingredients attains the state-of-the-art performance.

Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts

Citations

Exploring Visual Relationship for Image Captioning

Exploring Visual Relationship for Image Captioning.

A survey on automatic image caption generation

Deep multi-path convolutional neural network joint with salient region attention for facial expression recognition

A Survey of the Usages of Deep Learning in Natural Language Processing

References

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

ImageNet Classification with Deep Convolutional Neural Networks

Long short-term memory

Very Deep Convolutional Networks for Large-Scale Image Recognition

Related Papers (5)

Show and tell: A neural image caption generator

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Deep visual-semantic alignments for generating image descriptions

Bleu: a Method for Automatic Evaluation of Machine Translation

CIDEr: Consensus-based image description evaluation