Journal ArticleDOI
Vocabulary-Wide Credit Assignment for Training Image Captioning Models
Reads0
Chats0
TLDR
A Vocabulary-Critical Sequence Training (VCST) is proposed, which assigns every word in vocabulary an appropriate credit at each generation step and can be incorporated into existing RL methods for training image captioning models to achieve better results.Abstract:
Reinforcement learning (RL) algorithms have been shown to be efficient in training image captioning models. A critical step in RL algorithms is to assign credits to appropriate actions. There are mainly two classes of credit assignment methods in existing RL methods for image captioning, assigning a single credit for the whole sentence and assigning a credit to every word in the sentence. In this article, we propose a new credit assignment method which is orthogonal to the above two. It assigns every word in vocabulary an appropriate credit at each generation step. It is called vocabulary-wide credit assignment. Based on this we propose a Vocabulary-Critical Sequence Training (VCST). VCST can be incorporated into existing RL methods for training image captioning models to achieve better results. Extensive experiments with many popular models validated the effectiveness of VCST.read more
Citations
More filters
Proceedings ArticleDOI
SGNet: A Super-class Guided Network for Image Classification and Object Detection
TL;DR: SGNet as mentioned in this paper proposes a super-class guided network to integrate high-level semantic information into the network so as to increase its performance in inference, which takes two-level class annotations that contain both superclass and finer class labels.
Journal ArticleDOI
Visual Cluster Grounding for Image Captioning
TL;DR: Zhang et al. as mentioned in this paper proposed a novel grounding model which implicitly links the words to the evidence in the image, which encourages the captioner to focus on informative regions of the objects, which could be either discriminative parts or full object content.
Journal ArticleDOI
Vision-Enhanced and Consensus-Aware Transformer for Image Captioning
TL;DR: A Vision-enhanced and Consensus-aware Transformer (VCT) is proposed to exploit both visual information and consensus knowledge for image captioning with three key components: a vision-enhancing encoder, consensus-aware knowledge representation generator, and consensus- Aware decoder.
Posted Content
SGNet: A Super-class Guided Network for Image Classification and Object Detection
TL;DR: SGNet as discussed by the authors proposes a super-class guided network to integrate high-level semantic information into the network so as to increase its performance in inference, which takes two-level class annotations that contain both superclass and finer class labels.
Journal ArticleDOI
I<sup>2</sup>Transformer: Intra- and Inter-Relation Embedding Transformer for TV Show Captioning
TL;DR: Wu et al. as mentioned in this paper proposed an Intra- and Inter-relation Embedding Transformer (I2Transformer) consisting of an intra-relation embedding block (IAE) and an inter-relation embeddeding block (IEE) under the framework of Transformer.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI
ImageNet: A large-scale hierarchical image database
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.