“Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention

doi:10.1007/978-3-030-01249-6_32

Open AccessBook ChapterDOI

“Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention

Tianlang Chen, +6 more

- pp 527-543

Chats0

TLDR

A novel stylized image captioning model that effectively takes factual and stylized knowledge into consideration and outperforms the state-of-the-art approaches, without using extra ground truth supervision is proposed.

Abstract:

Generating stylized captions for an image is an emerging topic in image captioning. Given an image as input, it requires the system to generate a caption that has a specific style (e.g., humorous, romantic, positive, and negative) while describing the image content semantically accurately. In this paper, we propose a novel stylized image captioning model that effectively takes both requirements into consideration. To this end, we first devise a new variant of LSTM, named style-factual LSTM, as the building block of our model. It uses two groups of matrices to capture the factual and stylized knowledge, respectively, and automatically learns the word-level weights of the two groups based on previous context. In addition, when we train the model to capture stylized elements, we propose an adaptive learning approach based on a reference factual model, it provides factual knowledge to the model as the model learns from stylized caption labels, and can adaptively compute how much information to supply at each time step. We evaluate our model on two stylized image captioning datasets, which contain humorous/romantic captions and positive/negative captions, respectively. Experiments shows that our proposed model outperforms the state-of-the-art approaches, without using extra ground truth supervision.

Citations

PDF

Open Access

More filters

Posted Content

Evaluation of Text Generation: A Survey

Asli Celikyilmaz, +2 more

- 26 Jun 2020 -

arXiv: Computation and Language

TL;DR: This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.

...read moreread less

Proceedings ArticleDOI

Reasoning Visual Dialogs With Structural and Partial Observations

Zilong Zheng, +3 more

TL;DR: This paper introduces an Expectation Maximization algorithm to infer both the underlying dialog structures and the missing node values (desired answers) and proposes a differentiable graph neural network (GNN) solution that approximates this process.

...read moreread less

Proceedings ArticleDOI

MSCap: Multi-Style Image Captioning With Unpaired Stylized Text

Longteng Guo, +4 more

TL;DR: An adversarial learning network is proposed for the task of multi-style image captioning (MSCap) with a standard factual image caption dataset and a multi-stylized language corpus without paired images to enable more natural and human-like captions.

...read moreread less

Proceedings ArticleDOI

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

Long Chen, +3 more

TL;DR: In this article, the authors propose a new control signal for CIC, Verb-specific Semantic Roles (VSR), which consists of a verb and some semantic roles, which represents a targeted activity and the roles of entities involved in this activity.

...read moreread less

Journal ArticleDOI

MemCap: Memorizing Style Knowledge for Image Captioning

Wentian Zhao, +2 more

TL;DR: This paper proposes MemCap, a novel stylized image captioning method that explicitly encodes the knowledge about linguistic styles with memory mechanism and extracts content-relevant style knowledge from the memory module via an attention mechanism and incorporates the extracted knowledge into a language model.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

- 01 Sep 2014 -

arXiv: Computation and Language

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Proceedings Article

Sequence to Sequence Learning with Neural Networks

Ilya Sutskever, +2 more

TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.

...read moreread less

Book ChapterDOI

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Justin Johnson, +2 more

TL;DR: In this paper, the authors combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image style transfer, where a feedforward network is trained to solve the optimization problem proposed by Gatys et al. in real-time.

...read moreread less

Collapse

International Journal of Computer Vision

“Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention

Citations

Evaluation of Text Generation: A Survey

Reasoning Visual Dialogs With Structural and Partial Observations

MSCap: Multi-Style Image Captioning With Unpaired Stylized Text

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

MemCap: Memorizing Style Knowledge for Image Captioning

References

Deep Residual Learning for Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

Neural Machine Translation by Jointly Learning to Align and Translate

Sequence to Sequence Learning with Neural Networks

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Related Papers (5)

Show and tell: A neural image caption generator

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Microsoft COCO: Common Objects in Context

Deep visual-semantic alignments for generating image descriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations