Vocabulary-Wide Credit Assignment for Training Image Captioning Models

doi:10.1109/TIP.2021.3051476

Journal ArticleDOI

Vocabulary-Wide Credit Assignment for Training Image Captioning Models

Han Liu, +5 more

- 20 Jan 2021 -

IEEE Transactions on Image Processing

- Vol. 30, pp 2450-2460

Chats0

TLDR

A Vocabulary-Critical Sequence Training (VCST) is proposed, which assigns every word in vocabulary an appropriate credit at each generation step and can be incorporated into existing RL methods for training image captioning models to achieve better results.

Abstract:

Reinforcement learning (RL) algorithms have been shown to be efficient in training image captioning models. A critical step in RL algorithms is to assign credits to appropriate actions. There are mainly two classes of credit assignment methods in existing RL methods for image captioning, assigning a single credit for the whole sentence and assigning a credit to every word in the sentence. In this article, we propose a new credit assignment method which is orthogonal to the above two. It assigns every word in vocabulary an appropriate credit at each generation step. It is called vocabulary-wide credit assignment. Based on this we propose a Vocabulary-Critical Sequence Training (VCST). VCST can be incorporated into existing RL methods for training image captioning models to achieve better results. Extensive experiments with many popular models validated the effectiveness of VCST.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

SGNet: A Super-class Guided Network for Image Classification and Object Detection

Kaidong Li, +3 more

TL;DR: SGNet as mentioned in this paper proposes a super-class guided network to integrate high-level semantic information into the network so as to increase its performance in inference, which takes two-level class annotations that contain both superclass and finer class labels.

...read moreread less

Journal ArticleDOI

Visual Cluster Grounding for Image Captioning

- 01 Jan 2022 -

IEEE Transactions on Image Processing

TL;DR: Zhang et al. as mentioned in this paper proposed a novel grounding model which implicitly links the words to the evidence in the image, which encourages the captioner to focus on informative regions of the objects, which could be either discriminative parts or full object content.

...read moreread less

Journal ArticleDOI

Vision-Enhanced and Consensus-Aware Transformer for Image Captioning

Shan Cao, +3 more

TL;DR: A Vision-enhanced and Consensus-aware Transformer (VCT) is proposed to exploit both visual information and consensus knowledge for image captioning with three key components: a vision-enhancing encoder, consensus-aware knowledge representation generator, and consensus- Aware decoder.

...read moreread less

Posted Content

SGNet: A Super-class Guided Network for Image Classification and Object Detection

Kaidong Li, +3 more

- 26 Apr 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: SGNet as discussed by the authors proposes a super-class guided network to integrate high-level semantic information into the network so as to increase its performance in inference, which takes two-level class annotations that contain both superclass and finer class labels.

...read moreread less

Journal ArticleDOI

I<sup>2</sup>Transformer: Intra- and Inter-Relation Embedding Transformer for TV Show Captioning

- 01 Jan 2022 -

IEEE Transactions on Image Processing

TL;DR: Wu et al. as mentioned in this paper proposed an Intra- and Inter-relation Embedding Transformer (I2Transformer) consisting of an intra-relation embedding block (IAE) and an inter-relation embeddeding block (IEE) under the framework of Transformer.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less