scispace - formally typeset
Journal ArticleDOI

Vocabulary-Wide Credit Assignment for Training Image Captioning Models

Reads0
Chats0
TLDR
A Vocabulary-Critical Sequence Training (VCST) is proposed, which assigns every word in vocabulary an appropriate credit at each generation step and can be incorporated into existing RL methods for training image captioning models to achieve better results.
Abstract
Reinforcement learning (RL) algorithms have been shown to be efficient in training image captioning models. A critical step in RL algorithms is to assign credits to appropriate actions. There are mainly two classes of credit assignment methods in existing RL methods for image captioning, assigning a single credit for the whole sentence and assigning a credit to every word in the sentence. In this article, we propose a new credit assignment method which is orthogonal to the above two. It assigns every word in vocabulary an appropriate credit at each generation step. It is called vocabulary-wide credit assignment. Based on this we propose a Vocabulary-Critical Sequence Training (VCST). VCST can be incorporated into existing RL methods for training image captioning models to achieve better results. Extensive experiments with many popular models validated the effectiveness of VCST.

read more

Citations
More filters
Proceedings ArticleDOI

SGNet: A Super-class Guided Network for Image Classification and Object Detection

TL;DR: SGNet as mentioned in this paper proposes a super-class guided network to integrate high-level semantic information into the network so as to increase its performance in inference, which takes two-level class annotations that contain both superclass and finer class labels.
Journal ArticleDOI

Visual Cluster Grounding for Image Captioning

TL;DR: Zhang et al. as mentioned in this paper proposed a novel grounding model which implicitly links the words to the evidence in the image, which encourages the captioner to focus on informative regions of the objects, which could be either discriminative parts or full object content.
Journal ArticleDOI

Vision-Enhanced and Consensus-Aware Transformer for Image Captioning

TL;DR: A Vision-enhanced and Consensus-aware Transformer (VCT) is proposed to exploit both visual information and consensus knowledge for image captioning with three key components: a vision-enhancing encoder, consensus-aware knowledge representation generator, and consensus- Aware decoder.
Posted Content

SGNet: A Super-class Guided Network for Image Classification and Object Detection

TL;DR: SGNet as discussed by the authors proposes a super-class guided network to integrate high-level semantic information into the network so as to increase its performance in inference, which takes two-level class annotations that contain both superclass and finer class labels.
Journal ArticleDOI

I<sup>2</sup>Transformer: Intra- and Inter-Relation Embedding Transformer for TV Show Captioning

TL;DR: Wu et al. as mentioned in this paper proposed an Intra- and Inter-relation Embedding Transformer (I2Transformer) consisting of an intra-relation embedding block (IAE) and an inter-relation embeddeding block (IEE) under the framework of Transformer.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Related Papers (5)