Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning

[...]

Ning Xu¹, Hanwang Zhang², An-An Liu¹, Weizhi Nie¹, Yuting Su¹, Jie Nie³, Yongdong Zhang⁴ - Show less +3 more•Institutions (4)

Tianjin University¹, Nanyang Technological University², Ocean University of China³, University of Science and Technology of China⁴

01 May 2020-IEEE Transactions on Multimedia

TL;DR: A novel multi-level policy and reward RL framework for image captioning that can be easily integrated with RNN-based captioning models, language metrics, or visual-semantic functions for optimization and achieves competitive performances on a variety of evaluation metrics.

...read moreread less

Abstract: Image captioning is one of the most challenging tasks in AI because it requires an understanding of both complex visuals and natural language. Because image captioning is essentially a sequential prediction task, recent advances in image captioning have used reinforcement learning (RL) to better explore the dynamics of word-by-word generation. However, the existing RL-based image captioning methods rely primarily on a single policy network and reward function—an approach that is not well matched to the multi-level (word and sentence) and multi-modal (vision and language) nature of the task. To solve this problem, we propose a novel multi-level policy and reward RL framework for image captioning that can be easily integrated with RNN-based captioning models, language metrics, or visual-semantic functions for optimization. Specifically, the proposed framework includes two modules: 1) a multi-level policy network that jointly updates the word- and sentence-level policies for word generation; and 2) a multi-level reward function that collaboratively leverages both a vision-language reward and a language-language reward to guide the policy. Furthermore, we propose a guidance term to bridge the policy and the reward for RL optimization. The extensive experiments on the MSCOCO and Flickr30k datasets and the analyses show that the proposed framework achieves competitive performances on a variety of evaluation metrics. In addition, we conduct ablation studies on multiple variants of the proposed framework and explore several representative image captioning models and metrics for the word-level policy network and the language-language reward function to evaluate the generalization ability of the proposed framework.

...read moreread less

71 citations

Posted Content•

Watch What You Just Said: Image Captioning with Text-Conditional Attention

[...]

Luowei Zhou¹, Chenliang Xu², Parker A. Koch¹, Jason J. Corso¹•Institutions (2)

University of Michigan¹, University of Rochester²

15 Jun 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: The experimental results show that the proposed novel attention mechanism, called text-conditional attention, outperforms state-of-the-art captioning methods on various quantitative metrics as well as in human evaluation, which supports the use of the text- Conditional attention in image captioning.

...read moreread less

Abstract: Attention mechanisms have attracted considerable interest in image captioning due to its powerful performance. However, existing methods use only visual content as attention and whether textual context can improve attention in image captioning remains unsolved. To explore this problem, we propose a novel attention mechanism, called \textit{text-conditional attention}, which allows the caption generator to focus on certain image features given previously generated text. To obtain text-related image features for our attention model, we adopt the guiding Long Short-Term Memory (gLSTM) captioning architecture with CNN fine-tuning. Our proposed method allows joint learning of the image embedding, text embedding, text-conditional attention and language model with one network architecture in an end-to-end manner. We perform extensive experiments on the MS-COCO dataset. The experimental results show that our method outperforms state-of-the-art captioning methods on various quantitative metrics as well as in human evaluation, which supports the use of our text-conditional attention in image captioning.

...read moreread less

71 citations

Posted Content•

Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

[...]

Jiuxiang Gu¹, Jianfei Cai¹, Gang Wang², Tsuhan Chen¹•Institutions (2)

Nanyang Technological University¹, Alibaba Group²

11 Sep 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: The authors proposed a coarse-to-fine multi-stage prediction framework for image captioning, composed of multiple decoders each of which operates on the output of the previous stage, producing increasingly refined image descriptions.

...read moreread less

Abstract: The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multi-stage prediction framework for image captioning, composed of multiple decoders each of which operates on the output of the previous stage, producing increasingly refined image descriptions. Our proposed learning approach addresses the difficulty of vanishing gradients during training by providing a learning objective function that enforces intermediate supervisions. Particularly, we optimize our model with a reinforcement learning approach which utilizes the output of each intermediate decoder's test-time inference algorithm as well as the output of its preceding decoder to normalize the rewards, which simultaneously solves the well-known exposure bias problem and the loss-evaluation mismatch problem. We extensively evaluate the proposed approach on MSCOCO and show that our approach can achieve the state-of-the-art performance.

...read moreread less

70 citations

Journal Article•DOI•

More is Better: Precise and Detailed Image Captioning Using Online Positive Recall and Missing Concepts Mining

[...]

Ming-Xing Zhang¹, Yang Yang¹, Hanwang Zhang², Yanli Ji¹, Heng Tao Shen¹, Tat-Seng Chua³ - Show less +2 more•Institutions (3)

University of Electronic Science and Technology of China¹, Nanyang Technological University², National University of Singapore³

01 Jan 2019-IEEE Transactions on Image Processing

TL;DR: This paper adaptively re-weights the loss of different samples according to their predictions for online positive recall and uses a two-stage optimization strategy for missing concepts mining, which achieves superior image captioning performance compared with other competitive methods.

...read moreread less

Abstract: Recently, a great progress in automatic image captioning has been achieved by using semantic concepts detected from the image. However, we argue that existing concepts-to-caption framework, in which the concept detector is trained using the image-caption pairs to minimize the vocabulary discrepancy, suffers from the deficiency of insufficient concepts. The reasons are two-fold: 1) the extreme imbalance between the number of occurrence positive and negative samples of the concept and 2) the incomplete labeling in training captions caused by the biased annotation and usage of synonyms. In this paper, we propose a method, termed online positive recall and missing concepts mining , to overcome those problems. Our method adaptively re-weights the loss of different samples according to their predictions for online positive recall and uses a two-stage optimization strategy for missing concepts mining. In this way, more semantic concepts can be detected and a high accuracy will be expected. On the caption generation stage, we explore an element-wise selection process to automatically choose the most suitable concepts at each time step. Thus, our method can generate more precise and detailed caption to describe the image. We conduct extensive experiments on the MSCOCO image captioning data set and the MSCOCO online test server, which shows that our method achieves superior image captioning performance compared with other competitive methods.

...read moreread less

70 citations

Proceedings Article•DOI•

Learning to Collocate Neural Modules for Image Captioning

[...]

Xu Yang¹, Hanwang Zhang¹, Jianfei Cai¹•Institutions (1)

Nanyang Technological University¹

01 Oct 2019

TL;DR: Zhang et al. as discussed by the authors proposed learning to locate neural modules to generate the ''inner pattern'' connecting visual encoder and language decoder, which achieved state-of-the-art image captioning performance.

...read moreread less

Abstract: We do not speak word by word from scratch; our brain quickly structures a pattern like \textsc{sth do sth at someplace} and then fill in the detailed description. To render existing encoder-decoder image captioners such human-like reasoning, we propose a novel framework: learning to Collocate Neural Modules (CNM), to generate the ``inner pattern'' connecting visual encoder and language decoder. Unlike the widely-used neural module networks in visual Q\&A, where the language (\ie, question) is fully observable, CNM for captioning is more challenging as the language is being generated and thus is partially observable. To this end, we make the following technical contributions for CNM training: 1) compact module design --- one for function words and three for visual content words (\eg, noun, adjective, and verb), 2) soft module fusion and multi-step module execution, robustifying the visual reasoning in partial observation, 3) a linguistic loss for module controller being faithful to part-of-speech collocations (\eg, adjective is before noun). Extensive experiments on the challenging MS-COCO image captioning benchmark validate the effectiveness of our CNM image captioner. In particular, CNM achieves a new state-of-the-art 127.9 CIDEr-D on Karpathy split and a single-model 126.0 c40 on the official server. CNM is also robust to few training samples, \eg, by training only one sentence per image, CNM can halve the performance loss compared to a strong baseline.

...read moreread less

69 citations

Collapse

Network Information

Performance

Metrics

4,575

Papers

96,790

Citations

No. of papers in the topic in previous years
Year	Papers
2023	536
2022	1,030
2021	504
2020	530
2019	448
2018	334

Closed captioning

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics