Showing papers on "Closed captioning published in 2020"

PDF

Open Access

Proceedings Article•DOI•

Meshed-Memory Transformer for Image Captioning

[...]

Marcella Cornia¹, Matteo Stefanini¹, Lorenzo Baraldi¹, Rita Cucchiara¹•Institutions (1)

14 Jun 2020

TL;DR: The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features.

...read moreread less

Abstract: Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M² - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M² Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at: https://github.com/aimagelab/meshed-memory-transformer.

...read moreread less

660 citations

Journal Article•DOI•

Unified Vision-Language Pre-Training for Image Captioning and VQA

[...]

Luowei Zhou¹, Hamid Palangi², Lei Zhang², Houdong Hu², Jason J. Corso¹, Jianfeng Gao² - Show less +2 more•Institutions (2)

University of Michigan¹, Microsoft²

03 Apr 2020

TL;DR: VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.

...read moreread less

Abstract: This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.

...read moreread less

636 citations

Proceedings Article•DOI•

X-Linear Attention Networks for Image Captioning

[...]

Yingwei Pan, Ting Yao, Yehao Li, Tao Mei

14 Jun 2020

TL;DR: A unified attention block --- X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning is introduced.

...read moreread less

Abstract: Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2nd order interactions across multi-modal inputs. Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. In this paper, we introduce a unified attention block --- X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning. Technically, X-Linear attention block simultaneously exploits both the spatial and channel-wise bilinear attention distributions to capture the 2$^{nd}$ order interactions between the input single-modal or multi-modal features. Higher and even infinity order feature interactions are readily modeled through stacking multiple X-Linear attention blocks and equipping the block with Exponential Linear Unit (ELU) in a parameter-free fashion, respectively. Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher order intra- and inter-modal interactions. The experiments on COCO benchmark demonstrate that our X-LAN obtains to-date the best published CIDEr performance of 132.0% on COCO Karpathy test split. When further endowing Transformer with X-Linear attention blocks, CIDEr is boosted up to 132.8%. Source code is available at https://github.com/Panda-Peter/image-captioning.

...read moreread less

401 citations

Proceedings Article•DOI•

ActBERT: Learning Global-Local Video-Text Representations

[...]

Linchao Zhu¹, Yi Yang¹•Institutions (1)

University of Technology, Sydney¹

14 Jun 2020

TL;DR: This paper introduces ActBERT for self-supervised learning of joint video-text representations from unlabeled data and introduces an ENtangled Transformer block to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions.

...read moreread less

Abstract: In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperform the state-of-the-arts, demonstrating its superiority in video-text representation learning.

...read moreread less

353 citations

Proceedings Article•DOI•

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

[...]

Linjie Li¹, Yen-Chun Chen¹, Yu Cheng¹, Zhe Gan¹, Licheng Yu¹, Jingjing Liu¹ - Show less +2 more•Institutions (1)

Microsoft¹

01 May 2020

TL;DR: HELP, a novel framework for large-scale video+language omni-representation learning that achieves new state of the art on multiple benchmarks over Text-based Video/Video-moment Retrieval, Video Question Answering (QA), Video-and-language Inference and Video Captioning tasks across different domains is presented.

...read moreread less

Abstract: We present HERO, a novel framework for large-scale video+language omni-representation learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global video context is captured by a Temporal Transformer. In addition to standard Masked Language Modeling (MLM) and Masked Frame Modeling (MFM) objectives, we design two new pre-training tasks: (i) Video-Subtitle Matching (VSM), where the model predicts both global and local temporal alignment; and (ii) Frame Order Modeling (FOM), where the model predicts the right order of shuffled video frames. HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions. Comprehensive experiments demonstrate that HERO achieves new state of the art on multiple benchmarks over Text-based Video/Video-moment Retrieval, Video Question Answering (QA), Video-and-language Inference and Video Captioning tasks across different domains. We also introduce two new challenging benchmarks How2QA and How2R for Video QA and Retrieval, collected from diverse video content over multimodalities.

...read moreread less

302 citations

Journal Article•DOI•

STAT: Spatial-Temporal Attention Mechanism for Video Captioning

[...]

Chenggang Yan¹, Yunbin Tu², Wang Xingzheng³, Yongbing Zhang⁴, Xinhong Hao⁵, Yongdong Zhang¹, Qionghai Dai⁴ - Show less +3 more•Institutions (5)

University of Science and Technology of China¹, Hangzhou Dianzi University², Shenzhen University³, Tsinghua University⁴, Beijing Institute of Technology⁵

01 Jan 2020-IEEE Transactions on Multimedia

TL;DR: The proposed spatial-temporal attention mechanism (STAT) within an encoder-decoder neural network for video captioning successfully takes into account both the spatial and temporal structures in a video, so it makes the decoder to automatically select the significant regions in the most relevant temporal segments for word prediction.

...read moreread less

Abstract: Video captioning refers to automatic generate natural language sentences, which summarize the video contents. Inspired by the visual attention mechanism of human beings, temporal attention mechanism has been widely used in video description to selectively focus on important frames. However, most existing methods based on temporal attention mechanism suffer from the problems of recognition error and detail missing, because temporal attention mechanism cannot further catch significant regions in frames. In order to address above problems, we propose the use of a novel spatial-temporal attention mechanism (STAT) within an encoder-decoder neural network for video captioning. The proposed STAT successfully takes into account both the spatial and temporal structures in a video, so it makes the decoder to automatically select the significant regions in the most relevant temporal segments for word prediction. We evaluate our STAT on two well-known benchmarks: MSVD and MSR-VTT-10K. Experimental results show that our proposed STAT achieves the state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, and CIDEr.

...read moreread less

251 citations

Proceedings Article•DOI•

Object Relational Graph With Teacher-Recommended Learning for Video Captioning

[...]

Ziqi Zhang¹, Yaya Shi², Chunfeng Yuan, Bing Li, Peijin Wang¹, Weiming Hu¹, Zheng-Jun Zha² - Show less +3 more•Institutions (2)

Chinese Academy of Sciences¹, University of Science and Technology of China²

14 Jun 2020

TL;DR: Zhang et al. as mentioned in this paper proposed an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation, and designed a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.

...read moreread less

Abstract: Taking full advantage of the information from both vision and language is critical for the video captioning task. Existing models lack adequate visual representation due to the neglect of interaction between object, and sufficient training for content-related words due to long-tailed problems. In this paper, we propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model. The ELM generates more semantically similar word proposals which extend the groundtruth words used for training to deal with the long-tailed problem. Experimental evaluations on three benchmarks: MSVD, MSR-VTT and VATEX show the proposed ORG-TRL system achieves state-of-the-art performance. Extensive ablation studies and visualizations illustrate the effectiveness of our system.

...read moreread less

225 citations

Journal Article•DOI•

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

[...]

Jun Yu¹, Jing Li¹, Zhou Yu¹, Qingming Huang²•Institutions (2)

Hangzhou Dianzi University¹, Chinese Academy of Sciences²

01 Dec 2020-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: Inspired by the success of the Transformer model in machine translation, this work extends it to a Multimodal Transformer (MT) model for image captioning that significantly outperforms the previous state-of-the-art methods.

...read moreread less

Abstract: Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN) based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches, the MT model simultaneously captures intra- and inter-modal interactions in a unified attention block. Due to the in-depth modular composition of such attention blocks, the MT model can perform complex multimodal reasoning and output accurate captions. Moreover, to further improve the image captioning performance, multi-view visual features are seamlessly introduced into the MT model. We quantitatively and qualitatively evaluate our approach using the benchmark MSCOCO image captioning dataset and conduct extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method significantly outperforms the previous state-of-the-art methods. With an ensemble of seven models, our solution ranks the 1st place on the real-time leaderboard of the MSCOCO image captioning challenge at the time of the writing of this paper.

...read moreread less

206 citations

Proceedings Article•DOI•

Spatio-Temporal Graph for Video Captioning With Knowledge Distillation

[...]

Boxiao Pan¹, Haoye Cai¹, De-An Huang¹, Kuan-Hui Lee², Adrien Gaidon², Ehsan Adeli¹, Juan Carlos Niebles¹ - Show less +3 more•Institutions (2)

Stanford University¹, Toyota²

14 Jun 2020

TL;DR: This paper proposed a spatio-temporal graph model for video captioning that exploits object interactions in space and time to build interpretable links and is able to provide explicit visual grounding, and further proposed an object-aware knowledge distillation mechanism, in which local object information is used to regularize global scene features.

...read moreread less

Abstract: Video captioning is a challenging task that requires a deep understanding of visual scenes. State-of-the-art methods generate captions using either scene-level or object-level information but without explicitly modeling object interactions. Thus, they often fail to make visually grounded predictions, and are sensitive to spurious correlations. In this paper, we propose a novel spatio-temporal graph model for video captioning that exploits object interactions in space and time. Our model builds interpretable links and is able to provide explicit visual grounding. To avoid unstable performance caused by the variable number of objects, we further propose an object-aware knowledge distillation mechanism, in which local object information is used to regularize global scene features. We demonstrate the efficacy of our approach through extensive experiments on two benchmarks, showing our approach yields competitive performance with interpretable predictions.

...read moreread less

201 citations

Journal Article•DOI•

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

[...]

Lianli Gao¹, Xiangpeng Li¹, Jingkuan Song¹, Heng Tao Shen¹•Institutions (1)

University of Electronic Science and Technology of China¹

01 May 2020-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning that utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information.

...read moreread less

Abstract: Recent progress has been made in using attention based encoder-decoder framework for image and video captioning. Most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., “gun” and “shooting”) and non-visual words (e.g., “the”, “a”). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of visual captioning. Furthermore, the hierarchy of LSTMs enables more complex representation of visual data, capturing information at different scales. Considering these issues, we propose a hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning. Specifically, the proposed framework utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the caption generation. We design the hLSTMat model as a general framework, and we first instantiate it for the task of video captioning. Then, we further instantiate our hLSTMarefine it and apply it to the imioning task. To demonstrate the effectiveness of our proposed framework, we test our method on both video and image captioning tasks. Experimental results show that our approach achieves the state-of-the-art performance for most of the evaluation metrics on both tasks. The effect of important components is also well exploited in the ablation study.

...read moreread less

192 citations

Journal Article•DOI•

Stimulus-driven and concept-driven analysis for image caption generation

[...]

Songtao Ding, Shiru Qu¹, Yuling Xi¹, Shaohua Wan²•Institutions (2)

Northwestern Polytechnical University¹, Zhongnan University of Economics and Law²

20 Jul 2020-Neurocomputing

TL;DR: This work introduces the theory of attention in psychology to image caption generation with a combination of convolutional neural network over images and long-short term memory network over sentences.

...read moreread less

Proceedings Article•DOI•

Clotho: an Audio Captioning Dataset

[...]

Konstantinos Drossos, Samuel Lipping, Tuomas Virtanen

04 May 2020

TL;DR: Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results, is presented.

...read moreread less

Abstract: Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online 1.

...read moreread less

Proceedings Article•DOI•

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

[...]

Jie Lei¹, Liwei Wang², Yelong Shen³, Dong Yu⁴, Tamara L. Berg¹, Mohit Bansal¹ - Show less +2 more•Institutions (4)

University of North Carolina at Chapel Hill¹, The Chinese University of Hong Kong², Microsoft³, Tencent⁴

01 Jul 2020

TL;DR: Memory-Augmented Recurrent Transformer (MART) uses a memory module to augment the transformer architecture to generate a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects).

...read moreread less

Abstract: Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events.

...read moreread less

Journal Article•DOI•

When Radiology Report Generation Meets Knowledge Graph

[...]

Yixiao Zhang¹, Xiaosong Wang², Ziyue Xu², Qihang Yu¹, Alan L. Yuille¹, Daguang Xu² - Show less +2 more•Institutions (2)

Johns Hopkins University¹, Nvidia²

03 Apr 2020

TL;DR: Experimental results demonstrate the superior performance of the methods integrated with the proposed graph embedding module on a publicly accessible dataset (IU-RR) of chest radiographs compared with previous approaches using both the conventional evaluation metrics commonly adopted for image captioning and the proposed ones.

...read moreread less

Abstract: Automatic radiology report generation has been an attracting research problem towards computer-aided diagnosis to alleviate the workload of doctors in recent years. Deep learning techniques for natural image captioning are successfully adapted to generating radiology reports. However, radiology image reporting is different from the natural image captioning task in two aspects: 1) the accuracy of positive disease keyword mentions is critical in radiology image reporting in comparison to the equivalent importance of every single word in a natural image caption; 2) the evaluation of reporting quality should focus more on matching the disease keywords and their associated attributes instead of counting the occurrence of N-gram. Based on these concerns, we propose to utilize a pre-constructed graph embedding module (modeled with a graph convolutional neural network) on multiple disease findings to assist the generation of reports in this work. The incorporation of knowledge graph allows for dedicated feature learning for each disease finding and the relationship modeling between them. In addition, we proposed a new evaluation metric for radiology image reporting with the assistance of the same composed graph. Experimental results demonstrate the superior performance of the methods integrated with the proposed graph embedding module on a publicly accessible dataset (IU-RR) of chest radiographs compared with previous approaches using both the conventional evaluation metrics commonly adopted for image captioning and our proposed ones.

...read moreread less

Journal Article•DOI•

Learning visual relationship and context-aware attention for image captioning

[...]

Junbo Wang¹, Wei Wang¹, Liang Wang¹, Liang Wang², Zhiyong Wang³, David Dagan Feng³, Tieniu Tan¹, Tieniu Tan² - Show less +4 more•Institutions (3)

Chinese Academy of Sciences¹, Center for Excellence in Education², University of Sydney³

01 Feb 2020-Pattern Recognition

TL;DR: A novel method to implicitly model the relationship among regions of interest in an image with a graph neural network, as well as a novel context-aware attention mechanism to guide attention selection by fully memorizing previously attended visual content are proposed.

...read moreread less

Proceedings Article•DOI•

Normalized and Geometry-Aware Self-Attention Network for Image Captioning

[...]

Longteng Guo¹, Jing Liu¹, Xinxin Zhu¹, Peng Yao², Shichen Lu³, Hanqing Lu¹ - Show less +2 more•Institutions (3)

Chinese Academy of Sciences¹, University of Science and Technology Beijing², Wuhan University³

14 Jun 2020

TL;DR: In this article, Zhang et al. proposed a Geometry-aware self-attention (GSA) to explicitly and efficiently consider the relative geometry relations between the objects in the image.

...read moreread less

Abstract: Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of image captioning. First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA. While normalization is previously only applied outside SA, we introduce a novel normalization method and demonstrate that it is both possible and beneficial to perform it on the hidden activations inside SA. Second, to compensate for the major limit of Transformer that it fails to model the geometry structure of the input objects, we propose a class of Geometry-aware Self-Attention (GSA) that extends SA to explicitly and efficiently consider the relative geometry relations between the objects in the image. To construct our image captioning model, we combine the two modules and apply it to the vanilla self-attention network. We extensively evaluate our proposals on MS-COCO image captioning dataset and superior results are achieved when comparing to state-of-the-art approaches. Further experiments on three challenging tasks, i.e. video captioning, machine translation, and visual question answering, show the generality of our methods.

...read moreread less

Journal Article•DOI•

Context-Aware Visual Policy Network for Fine-Grained Image Captioning.

[...]

Zheng-Jun Zha¹, Daqing Liu¹, Hanwang Zhang², Yongdong Zhang¹, Feng Wu¹ - Show less +1 more•Institutions (2)

University of Science and Technology of China¹, Nanyang Technological University²

01 Jan 2020-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A Context-Aware Visual Policy network (CAVP) is proposed for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning, which can attend to complex visual compositions over time.

...read moreread less

Abstract: With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions. Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar. However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that are crucial for compositional reasoning such as object relationships (e.g., "man riding horse") and visual comparisons (e.g., "small(er) cat"). This issue is especially severe when generating longer sequences such as a paragraph. To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning. During captioning, CAVP explicitly considers the previous visual attentions as context, and decides whether the context is used for the current word/sentence generation given the current visual attention. Compared against traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time. The whole image captioning model-CAVP and its subsequent language policy network - can be efficiently optimized end-to-end by using an actor-critic policy gradient method. We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context.

...read moreread less

Proceedings Article•DOI•

More Grounded Image Captioning by Distilling Image-Text Matching Model

[...]

Yuanen Zhou¹, Meng Wang¹, Daqing Liu², Zhenzhen Hu¹, Hanwang Zhang³ - Show less +1 more•Institutions (3)

Hefei University of Technology¹, University of Science and Technology of China², Nanyang Technological University³

14 Jun 2020

TL;DR: A Part-of-Speech (POS) enhanced image-text matching model (SCAN) is proposed, which serves as a word-region alignment regularization for the captioner's visual attention module and it is demonstrated that conventional image captioners equipped with POS-SCAN can significantly improve the grounding accuracy without strong supervision.

...read moreread less

Abstract: Visual attention not only improves the performance of image captioners, but also serves as a visual interpretation to qualitatively measure the caption rationality and model transparency. Specifically, we expect that a captioner can fix its attentive gaze on the correct objects while generating the corresponding words. This ability is also known as grounded image captioning. However, the grounding accuracy of existing captioners is far from satisfactory.To improve the grounding accuracy while retaining the captioning quality, it is expensive to collect the word-region alignment as strong supervision.To this end, we propose a Part-of-Speech (POS) enhanced image-text matching model (SCAN): POS-SCAN, as the effective knowledge distillation for more grounded image captioning. The benefits are two-fold: 1) given a sentence and an image, POS-SCAN can ground the objects more accurately than SCAN; 2) POS-SCAN serves as a word-region alignment regularization for the captioner's visual attention module. By showing benchmark experimental results, we demonstrate that conventional image captioners equipped with POS-SCAN can significantly improve the grounding accuracy without strong supervision. Last but not the least, we explore the indispensable Self-Critical Sequence Training (SCST) in the context of grounded image captioning and show that the image-text matching score can serve as a reward for more grounded captioning.

...read moreread less

Book Chapter•DOI•

Comprehensive Image Captioning via Scene Graph Decomposition

[...]

Yiwu Zhong¹, Liwei Wang², Jianshu Chen², Dong Yu², Yin Li¹ - Show less +1 more•Institutions (2)

University of Wisconsin-Madison¹, Tencent²

23 Aug 2020

TL;DR: This work addresses the challenging problem of image captioning by revisiting the representation of image scene graph by designing a deep model to select important sub-graphs, and to decode each selected sub- graph into a single target sentence.

...read moreread less

Abstract: We address the challenging problem of image captioning by revisiting the representation of image scene graph. At the core of our method lies the decomposition of a scene graph into a set of sub-graphs, with each sub-graph capturing a semantic component of the input image. We design a deep model to select important sub-graphs, and to decode each selected sub-graph into a single target sentence. By using sub-graphs, our model is able to attend to different components of the image. Our method thus accounts for accurate, diverse, grounded and controllable captioning at the same time. We present extensive experiments to demonstrate the benefits of our comprehensive captioning model. Our method establishes new state-of-the-art results in caption diversity, grounding, and controllability, and compares favourably to latest methods in caption quality. Our project website can be found at http://pages.cs.wisc.edu/~yiwuzhong/Sub-GC.html.

...read moreread less

Journal Article•DOI•

Corrections to “STAT: Spatial-Temporal Attention Mechanism for Video Captioning”

[...]

Chenggang Yan¹, Yunbin Tu², Wang Xingzheng³, Yongbing Zhang⁴, Xinhong Hao⁵, Yongdong Zhang¹, Qionghai Dai⁴ - Show less +3 more•Institutions (5)

University of Science and Technology of China¹, Hangzhou Dianzi University², Shenzhen University³, Tsinghua University⁴, Beijing Institute of Technology⁵

21 Feb 2020-IEEE Transactions on Multimedia

Posted Content•

Deconfounded Image Captioning: A Causal Retrospect

[...]

Xu Yang¹, Hanwang Zhang¹, Jianfei Cai²•Institutions (2)

Nanyang Technological University¹, Monash University²

09 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper presents a novel perspective: Deconfounded Image Captioning (DIC), to find out the answer of this question, then retrospect modern neural image captioners, and proposes a DIC framework: DICv1.0 to alleviate the negative effects brought by dataset bias.

...read moreread less

Abstract: The dataset bias in vision-language tasks is becoming one of the main problems that hinder the progress of our community. However, recent studies lack a principled analysis of the bias. In this paper, we present a novel perspective: Deconfounded Image Captioning (DIC), to find out the cause of the bias in image captioning, then retrospect modern neural image captioners, and finally propose a DIC framework: DICv1.0. DIC is based on causal inference, whose two principles: the backdoor and front-door adjustments, help us to review previous works and design the effective models. In particular, we showcase that DICv1.0 can strengthen two prevailing captioning models and achieves a single-model 130.7 CIDEr-D and 128.4 c40 CIDEr-D on Karpathy split and online split of the challenging MS-COCO dataset, respectively. Last but not least, DICv1.0 is merely a natural derivation from our causal retrospect, which opens a promising direction for image captioning.

...read moreread less

Posted Content•

Normalized and Geometry-Aware Self-Attention Network for Image Captioning

[...]

Longteng Guo¹, Jing Liu¹, Xinxin Zhu¹, Peng Yao², Shichen Lu³, Hanqing Lu¹ - Show less +2 more•Institutions (3)

Chinese Academy of Sciences¹, University of Science and Technology Beijing², Wuhan University³

19 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA and introduces a novel normalization method and demonstrates that it is both possible and beneficial to perform it on the hidden activations inside SA.

...read moreread less

TRECVID 2018: Benchmarking video activity detection, video captioning and matching, video storytelling linking and video search

[...]

George Awad, Asad A. Butt, Keith Curtis, Yooyoung Lee, Jonathan G. Fiscus, Afzal Godil, David Joy, Andrew Delgado, Alan F. Smeaton, Yvette Graham, Wessel Kraaij, Georges Quénot, João Magalhães, David Semedo, Saverio Blasi - Show less +11 more

01 Jan 2020

TL;DR: The TREC Video Retrieval Evaluation (TRECVID) 2018 was a TREC-style video analysis and retrieval evaluation, the goal of which remains to promote progress in research and development of contentbased exploitation and retrieval of information from digital video via open, metrics-based evaluation.

...read moreread less

Posted Content•

Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network

[...]

Jiayi Ji¹, Yunpeng Luo¹, Xiaoshuai Sun¹, Fuhai Chen¹, Gen Luo¹, Yongjian Wu², Yue Gao³, Rongrong Ji¹ - Show less +4 more•Institutions (3)

Xiamen University¹, Tencent², Tsinghua University³

13 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper introduces a Global Enhanced Transformer (termed GET) to enable the extraction of a more comprehensive global representation, and then adaptively guide the decoder to generate high-quality captions.

...read moreread less

Abstract: Transformer-based architectures have shown great success in image captioning, where object regions are encoded and then attended into the vectorial representations to guide the caption decoding. However, such vectorial representations only contain region-level information without considering the global information reflecting the entire image, which fails to expand the capability of complex multi-modal reasoning in image captioning. In this paper, we introduce a Global Enhanced Transformer (termed GET) to enable the extraction of a more comprehensive global representation, and then adaptively guide the decoder to generate high-quality captions. In GET, a Global Enhanced Encoder is designed for the embedding of the global feature, and a Global Adaptive Decoder are designed for the guidance of the caption generation. The former models intra- and inter-layer global representation by taking advantage of the proposed Global Enhanced Attention and a layer-wise fusion module. The latter contains a Global Adaptive Controller that can adaptively fuse the global information into the decoder to guide the caption generation. Extensive experiments on MS COCO dataset demonstrate the superiority of our GET over many state-of-the-arts.

...read moreread less

Journal Article•DOI•

Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning

[...]

Ning Xu¹, Hanwang Zhang², An-An Liu¹, Weizhi Nie¹, Yuting Su¹, Jie Nie³, Yongdong Zhang⁴ - Show less +3 more•Institutions (4)

Tianjin University¹, Nanyang Technological University², Ocean University of China³, University of Science and Technology of China⁴

01 May 2020-IEEE Transactions on Multimedia

TL;DR: A novel multi-level policy and reward RL framework for image captioning that can be easily integrated with RNN-based captioning models, language metrics, or visual-semantic functions for optimization and achieves competitive performances on a variety of evaluation metrics.

...read moreread less

Abstract: Image captioning is one of the most challenging tasks in AI because it requires an understanding of both complex visuals and natural language. Because image captioning is essentially a sequential prediction task, recent advances in image captioning have used reinforcement learning (RL) to better explore the dynamics of word-by-word generation. However, the existing RL-based image captioning methods rely primarily on a single policy network and reward function—an approach that is not well matched to the multi-level (word and sentence) and multi-modal (vision and language) nature of the task. To solve this problem, we propose a novel multi-level policy and reward RL framework for image captioning that can be easily integrated with RNN-based captioning models, language metrics, or visual-semantic functions for optimization. Specifically, the proposed framework includes two modules: 1) a multi-level policy network that jointly updates the word- and sentence-level policies for word generation; and 2) a multi-level reward function that collaboratively leverages both a vision-language reward and a language-language reward to guide the policy. Furthermore, we propose a guidance term to bridge the policy and the reward for RL optimization. The extensive experiments on the MSCOCO and Flickr30k datasets and the analyses show that the proposed framework achieves competitive performances on a variety of evaluation metrics. In addition, we conduct ablation studies on multiple variants of the proposed framework and explore several representative image captioning models and metrics for the word-level policy network and the language-language reward function to evaluate the generalization ability of the proposed framework.

...read moreread less

Journal Article•DOI•

Federated Learning for Vision-and-Language Grounding Problems

[...]

Fenglin Liu¹, Xian Wu², Shen Ge², Wei Fan², Yuexian Zou¹ - Show less +1 more•Institutions (2)

Peking University¹, Tencent²

03 Apr 2020

TL;DR: This work proposes a federated learning framework to obtain various types of image representations from different tasks, which are then fused together to form fine-grained image representations that are much more powerful than the original representations alone in individual tasks.

...read moreread less

Abstract: Recently, vision-and-language grounding problems, e.g., image captioning and visual question answering (VQA), has attracted extensive interests from both academic and industrial worlds. However, given the similarity of these tasks, the efforts to obtain better results by combining the merits of their algorithms are not well studied. Inspired by the recent success of federated learning, we propose a federated learning framework to obtain various types of image representations from different tasks, which are then fused together to form fine-grained image representations. The representations merge useful features from different vision-and-language grounding problems, and are thus much more powerful than the original representations alone in individual tasks. To learn such image representations, we propose the Aligning, Integrating and Mapping Network (aimNet). The aimNet is validated on three federated learning settings, which include horizontal federated learning, vertical federated learning, and federated transfer learning. Experiments of aimNet-based federated learning framework on two representative tasks, i.e., image captioning and VQA, demonstrate the effective and universal improvements of all metrics over the baselines. In image captioning, we are able to get 14% and 13% relative gain on the task-specific metrics CIDEr and SPICE, respectively. In VQA, we could also boost the performance of strong baselines by up to 3%.

...read moreread less

Posted Content•

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer.

[...]

Vladimir Iashin, Esa Rahtu¹•Institutions (1)

Tampere University of Technology¹

17 May 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper shows the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task.

...read moreread less

Abstract: Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Transformer which generalizes the Transformer architecture for a bi-modal input. We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task. We also show that the pre-trained bi-modal encoder as a part of the bi-modal transformer can be used as a feature extractor for a simple proposal generation module. The performance is demonstrated on a challenging ActivityNet Captions dataset where our model achieves outstanding performance. The code is available: this http URL

...read moreread less

Posted Content•

Captioning Images Taken by People Who Are Blind

[...]

Danna Gurari¹, Yinan Zhao¹, Meng Zhang¹, Nilavra Bhattacharya¹•Institutions (1)

University of Texas at Austin¹

20 Feb 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work introduces the first image captioning dataset to represent this real use case, which consists of over 39,000 images originating from people who are blind that are each paired with five captions and analyzes modern image Captioning algorithms to identify what makes this new dataset challenging for the vision community.

...read moreread less

Abstract: While an important problem in the vision community is to design algorithms that can automatically caption images, few publicly-available datasets for algorithm development directly address the interests of real users. Observing that people who are blind have relied on (human-based) image captioning services to learn about images they take for nearly a decade, we introduce the first image captioning dataset to represent this real use case. This new dataset, which we call VizWiz-Captions, consists of over 39,000 images originating from people who are blind that are each paired with five captions. We analyze this dataset to (1) characterize the typical captions, (2) characterize the diversity of content found in the images, and (3) compare its content to that found in eight popular vision datasets. We also analyze modern image captioning algorithms to identify what makes this new dataset challenging for the vision community. We publicly-share the dataset with captioning challenge instructions at this https URL

...read moreread less

Book Chapter•DOI•

Captioning Images Taken by People Who Are Blind.

[...]

Danna Gurari¹, Yinan Zhao¹, Meng Zhang¹, Nilavra Bhattacharya¹•Institutions (1)

University of Texas at Austin¹

20 Feb 2020

TL;DR: The VizWiz-Captions dataset as mentioned in this paper consists of over 39,000 images originating from people who are blind that are each paired with five captions, which is the first publicly available dataset for image captioning.

...read moreread less

Posted Content•

TextCaps: a Dataset for Image Captioning with Reading Comprehension

[...]

Oleksii Sidorov¹, Ronghang Hu¹, Marcus Rohrbach¹, Amanpreet Singh²•Institutions (2)

Facebook¹, University of California, Berkeley²

24 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A novel dataset, TextCaps, with 145k captions for 28k images, challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects.

...read moreread less

Abstract: Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include written text in their descriptions, although text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. We study baselines and adapt existing approaches to this new task, which we refer to as image captioning with reading comprehension. Our analysis with automatic and human studies shows that our new TextCaps dataset provides many new technical challenges over previous datasets.

...read moreread less

Collapse