scispace - formally typeset
Open AccessBook ChapterDOI

“Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention

Reads0
Chats0
TLDR
A novel stylized image captioning model that effectively takes factual and stylized knowledge into consideration and outperforms the state-of-the-art approaches, without using extra ground truth supervision is proposed.
Abstract
Generating stylized captions for an image is an emerging topic in image captioning. Given an image as input, it requires the system to generate a caption that has a specific style (e.g., humorous, romantic, positive, and negative) while describing the image content semantically accurately. In this paper, we propose a novel stylized image captioning model that effectively takes both requirements into consideration. To this end, we first devise a new variant of LSTM, named style-factual LSTM, as the building block of our model. It uses two groups of matrices to capture the factual and stylized knowledge, respectively, and automatically learns the word-level weights of the two groups based on previous context. In addition, when we train the model to capture stylized elements, we propose an adaptive learning approach based on a reference factual model, it provides factual knowledge to the model as the model learns from stylized caption labels, and can adaptively compute how much information to supply at each time step. We evaluate our model on two stylized image captioning datasets, which contain humorous/romantic captions and positive/negative captions, respectively. Experiments shows that our proposed model outperforms the state-of-the-art approaches, without using extra ground truth supervision.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

Evaluation of Text Generation: A Survey

TL;DR: This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.
Proceedings ArticleDOI

Reasoning Visual Dialogs With Structural and Partial Observations

TL;DR: This paper introduces an Expectation Maximization algorithm to infer both the underlying dialog structures and the missing node values (desired answers) and proposes a differentiable graph neural network (GNN) solution that approximates this process.
Proceedings ArticleDOI

MSCap: Multi-Style Image Captioning With Unpaired Stylized Text

TL;DR: An adversarial learning network is proposed for the task of multi-style image captioning (MSCap) with a standard factual image caption dataset and a multi-stylized language corpus without paired images to enable more natural and human-like captions.
Proceedings ArticleDOI

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

TL;DR: In this article, the authors propose a new control signal for CIC, Verb-specific Semantic Roles (VSR), which consists of a verb and some semantic roles, which represents a targeted activity and the roles of entities involved in this activity.
Journal ArticleDOI

MemCap: Memorizing Style Knowledge for Image Captioning

TL;DR: This paper proposes MemCap, a novel stylized image captioning method that explicitly encodes the knowledge about linguistic styles with memory mechanism and extracts content-relevant style knowledge from the memory module via an attention mechanism and incorporates the extracted knowledge into a language model.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Proceedings Article

Sequence to Sequence Learning with Neural Networks

TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.
Book ChapterDOI

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

TL;DR: In this paper, the authors combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image style transfer, where a feedforward network is trained to solve the optimization problem proposed by Gatys et al. in real-time.
Related Papers (5)