Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Contextualized Keyword Representations for Multi-modal Retinal Image Captioning

[...]

Jia-Hong Huang¹, Ting-Wei Wu², Marcel Worring¹•Institutions (2)

University of Amsterdam¹, Georgia Institute of Technology²

24 Aug 2021

TL;DR: In this article, a new end-to-end deep multi-modal medical image captioning model is proposed, which uses contextualized keyword representations, textual feature reinforcement, and masked self-attention.

...read moreread less

Abstract: Medical image captioning automatically generates a medical description to describe the content of a given medical image. Traditional medical image captioning models create a medical description based on a single medical image input only. Hence, an abstract medical description or concept is hard to be generated based on the traditional approach. Such a method limits the effectiveness of medical image captioning. Multi-modal medical image captioning is one of the approaches utilized to address this problem. In multi-modal medical image captioning, textual input, e.g., expert-defined keywords, is considered as one of the main drivers of medical description generation. Thus, encoding the textual input and the medical image effectively are both important for the task of multi-modal medical image captioning. In this work, a new end-to-end deep multi-modal medical image captioning model is proposed. Contextualized keyword representations, textual feature reinforcement, and masked self-attention are used to develop the proposed approach. Based on the evaluation of an existing multi-modal medical image captioning dataset, experimental results show that the proposed model is effective with an increase of +53.2% in BLEU-avg and +18.6% in CIDEr, compared with the state-of-the-art method. https://github.com/Jhhuangkay/Contextualized-Keyword-Representations-for-Multi-modal-Retinal-Image-Captioning

...read moreread less

16 citations

Journal Article•DOI•

Evaluation of automatic video captioning using direct assessment

[...]

Yvette Graham¹, George Awad², Alan F. Smeaton¹•Institutions (2)

Dublin City University¹, National Institute of Standards and Technology²

04 Sep 2018-PLOS ONE

TL;DR: In this paper, a method for manually assessing the quality of automatically-generated captions for video is presented, which brings human assessment into the evaluation by crowd sourcing how well a caption describes a video.

...read moreread less

Abstract: We present Direct Assessment, a method for manually assessing the quality of automatically-generated captions for video. Evaluating the accuracy of video captions is particularly difficult because for any given video clip there is no definitive ground truth or correct answer against which to measure. Metrics for comparing automatic video captions against a manual caption such as BLEU and METEOR, drawn from techniques used in evaluating machine translation, were used in the TRECVid video captioning task in 2016 but these are shown to have weaknesses. The work presented here brings human assessment into the evaluation by crowd sourcing how well a caption describes a video. We automatically degrade the quality of some sample captions which are assessed manually and from this we are able to rate the quality of the human assessors, a factor we take into account in the evaluation. Using data from the TRECVid video-to-text task in 2016, we show how our direct assessment method is replicable and robust and scales to where there are many caption-generation techniques to be evaluated including the TRECVid video-to-text task in 2017.

...read moreread less

16 citations

Posted Content•

An Attempt towards Interpretable Audio-Visual Video Captioning

[...]

Yapeng Tian, Chenxiao Guan, Justin Goodman, Marc Moore, Chenliang Xu - Show less +1 more

07 Dec 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes a multimodal convolutional neural network-based audio-visual video captioning framework and introduces a modality-aware module for exploring modality selection during sentence generation and shows that the model can still achieve comparable performance with recent state-of-the-art methods.

...read moreread less

Abstract: Automatically generating a natural language sentence to describe the content of an input video is a very challenging problem. It is an essential multimodal task in which auditory and visual contents are equally important. Although audio information has been exploited to improve video captioning in previous works, it is usually regarded as an additional feature fed into a black box fusion machine. How are the words in the generated sentences associated with the auditory and visual modalities? The problem is still not investigated. In this paper, we make the first attempt to design an interpretable audio-visual video captioning network to discover the association between words in sentences and audio-visual sequences. To achieve this, we propose a multimodal convolutional neural network-based audio-visual video captioning framework and introduce a modality-aware module for exploring modality selection during sentence generation. Besides, we collect new audio captioning and visual captioning datasets for further exploring the interactions between auditory and visual modalities for high-level video understanding. Extensive experiments demonstrate that the modality-aware module makes our model interpretable on modality selection during sentence generation. Even with the added interpretability, our video captioning network can still achieve comparable performance with recent state-of-the-art methods.

...read moreread less

16 citations

Journal Article•DOI•

Towards Generating and Evaluating Iconographic Image Captions of Artworks.

[...]

Eva Cetinic

23 Jul 2021-Journal of Imaging

TL;DR: The overall results suggest that the model can generate meaningful captions that indicate a stronger relevance to the art historical context, particularly in comparison to captions obtained from models trained only on natural image datasets.

...read moreread less

Abstract: To automatically generate accurate and meaningful textual descriptions of images is an ongoing research challenge. Recently, a lot of progress has been made by adopting multimodal deep learning approaches for integrating vision and language. However, the task of developing image captioning models is most commonly addressed using datasets of natural images, while not many contributions have been made in the domain of artwork images. One of the main reasons for that is the lack of large-scale art datasets of adequate image-text pairs. Another reason is the fact that generating accurate descriptions of artwork images is particularly challenging because descriptions of artworks are more complex and can include multiple levels of interpretation. It is therefore also especially difficult to effectively evaluate generated captions of artwork images. The aim of this work is to address some of those challenges by utilizing a large-scale dataset of artwork images annotated with concepts from the Iconclass classification system. Using this dataset, a captioning model is developed by fine-tuning a transformer-based vision-language pretrained model. Due to the complex relations between image and text pairs in the domain of artwork images, the generated captions are evaluated using several quantitative and qualitative approaches. The performance is assessed using standard image captioning metrics and a recently introduced reference-free metric. The quality of the generated captions and the model’s capacity to generalize to new data is explored by employing the model to another art dataset to compare the relation between commonly generated captions and the genre of artworks. The overall results suggest that the model can generate meaningful captions that indicate a stronger relevance to the art historical context, particularly in comparison to captions obtained from models trained only on natural image datasets.

...read moreread less

16 citations

Journal Article•DOI•

Vocabulary-Wide Credit Assignment for Training Image Captioning Models

[...]

Han Liu¹, Shifeng Zhang¹, Ke Lin², Jing Wen¹, Jianmin Li¹, Xiaolin Hu¹ - Show less +2 more•Institutions (2)

Tsinghua University¹, Samsung²

20 Jan 2021-IEEE Transactions on Image Processing

TL;DR: A Vocabulary-Critical Sequence Training (VCST) is proposed, which assigns every word in vocabulary an appropriate credit at each generation step and can be incorporated into existing RL methods for training image captioning models to achieve better results.

...read moreread less

Abstract: Reinforcement learning (RL) algorithms have been shown to be efficient in training image captioning models. A critical step in RL algorithms is to assign credits to appropriate actions. There are mainly two classes of credit assignment methods in existing RL methods for image captioning, assigning a single credit for the whole sentence and assigning a credit to every word in the sentence. In this article, we propose a new credit assignment method which is orthogonal to the above two. It assigns every word in vocabulary an appropriate credit at each generation step. It is called vocabulary-wide credit assignment. Based on this we propose a Vocabulary-Critical Sequence Training (VCST). VCST can be incorporated into existing RL methods for training image captioning models to achieve better results. Extensive experiments with many popular models validated the effectiveness of VCST.

...read moreread less

16 citations

Collapse

Network Information

Performance

Metrics

4,575

Papers

96,790

Citations

No. of papers in the topic in previous years
Year	Papers
2023	536
2022	1,030
2021	504
2020	530
2019	448
2018	334

Closed captioning

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics