scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Proceedings ArticleDOI
01 Jul 2019
TL;DR: It is shown that by combining the text and image information, a machine learning approach is built that accurately distinguishes between the relationship types and is directly used in end-user applications to optimize screen estate.
Abstract: Text in social media posts is frequently accompanied by images in order to provide content, supply context, or to express feelings. This paper studies how the meaning of the entire tweet is composed through the relationship between its textual content and its image. We build and release a data set of image tweets annotated with four classes which express whether the text or the image provides additional information to the other modality. We show that by combining the text and image information, we can build a machine learning approach that accurately distinguishes between the relationship types. Further, we derive insights into how these relationships are materialized through text and image content analysis and how they are impacted by user demographic traits. These methods can be used in several downstream applications including pre-training image tagging models, collecting distantly supervised data for image captioning, and can be directly used in end-user applications to optimize screen estate.

42 citations

Proceedings ArticleDOI
01 Jun 2021
TL;DR: Catt et al. as discussed by the authors proposed a Sketch, Ground, and Refine (SGR) model to generate paragraphs from a global view and then ground each event description to a video segment for detailed refinement.
Abstract: The dense video captioning task aims to detect and describe a sequence of events in a video for detailed and coherent storytelling. Previous works mainly adopt a "detect-then-describe" framework, which firstly detects event proposals in the video and then generates descriptions for the detected events. However, the definitions of events are diverse which could be as simple as a single action or as complex as a set of events, depending on different semantic con-texts. Therefore, directly detecting events based on video information is ill-defined and hurts the coherency and accuracy of generated dense captions. In this work, we reverse the predominant "detect-then-describe" fashion, proposing a top-down way to first generate paragraphs from a global view and then ground each event description to a video segment for detailed refinement. It is formulated as a Sketch, Ground, and Refine process (SGR). The sketch stage first generates a coarse-grained multi-sentence paragraph to describe the whole video, where each sentence is treated as an event and gets localised in the grounding stage. In the re-fining stage, we improve captioning quality via refinement-enhanced training and dual-path cross attention on both coarse-grained event captions and aligned event segments. The updated event caption can further adjust its segment boundaries. Our SGR model outperforms state-of-the-art methods on ActivityNet Captioning benchmark under traditional and story-oriented dense caption evaluations. Code will be released at github.com/bearcatt/SGR.

41 citations

Proceedings ArticleDOI
14 Jun 2020
TL;DR: Zhang et al. as discussed by the authors proposed a caption editing model consisting of two sub-modules: EditNet, a language module with an adaptive copy mechanism (Copy-LSTM) and a Selective Copy Memory Attention mechanism (SCMA), and DCNet, an LSTM-based denoising auto-encoder.
Abstract: Most image captioning frameworks generate captions directly from images, learning a mapping from visual features to natural language. However, editing existing captions can be easier than generating new ones from scratch. Intuitively, when editing captions, a model is not required to learn information that is already present in the caption (i.e. sentence structure), enabling it to focus on fixing details (e.g. replacing repetitive words). This paper proposes a novel approach to image captioning based on iterative adaptive refinement of an existing caption. Specifically, our caption-editing model consisting of two sub-modules: (1) EditNet, a language module with an adaptive copy mechanism (Copy-LSTM) and a Selective Copy Memory Attention mechanism (SCMA), and (2) DCNet, an LSTM-based denoising auto-encoder. These components enable our model to directly copy from and modify existing captions. Experiments demonstrate that our new approach achieves state of-art performance on the MS COCO dataset both with and without sequence-level training.

41 citations

Proceedings ArticleDOI
Fuhai Chen1, Rongrong Ji1, Jinsong Su1, Yongjian Wu2, Yunsheng Wu2 
19 Oct 2017
TL;DR: The proposed StructCap model parses a given image into key entities and their relations organized in a visual parsing tree, which is transformed and embedded under an encoder-decoder framework via visual attention.
Abstract: Image captioning has attracted ever-increasing research attention in multimedia and computer vision. To encode the visual content, existing approaches typically utilize the off-the-shelf deep Convolutional Neural Network (CNN) model to extract visual features, which are sent to Recurrent Neural Network (RNN) based textual generators to output word sequence. Some methods encode visual objects and scene information with attention mechanism more recently. Despite the promising progress, one distinct disadvantage lies in distinguishing and modeling key semantic entities and their relations, which are in turn widely regarded as the important cues for us to describe image content. In this paper, we propose a novel image captioning model, termed StructCap. It parses a given image into key entities and their relations organized in a visual parsing tree, which is transformed and embedded under an encoder-decoder framework via visual attention. We give an end-to-end formulation to facilitate joint training of visual tree parser, structured semantic attention and RNN-based captioning modules. Experimental results on two public benchmarks, Microsoft COCO and Flickr30K, show that the proposed StructCap model outperforms the state-of-the-art approaches under various standard evaluation metrics.

41 citations

Proceedings ArticleDOI
06 Nov 2017
TL;DR: A dual learning mechanism with a policy gradient method that generates highly rewarded captions is introduced that consistently outperforms previous methods for cross-domain image captioning.
Abstract: Recent AI research has witnessed increasing interests in automatically generating image descriptions in text, which is coined as theimage captioning problem. Significant progresses have been made in domains where plenty of labeled training data (i.e. image-text pairs) are readily available or collected. However, obtaining rich annotated data is a time-consuming and expensive process, creating a substantial barrier for applying image captioning methods to a new domain. In this paper, we propose a cross-domain image captioning approach that uses a novel dual learning mechanism to overcome this barrier. First, we model the alignment between the neural representations of images and that of natural languages in the source domain where one can access sufficient labeled data. Second, we adjust the pre-trained model based on examining limited data (or unpaired data) in the target domain. In particular, we introduce a dual learning mechanism with a policy gradient method that generates highly rewarded captions. The mechanism simultaneously optimizes two coupled objectives: generating image descriptions in text and generating plausible images from text descriptions, with the hope that by explicitly exploiting their coupled relation, one can safeguard the performance of image captioning in the target domain. To verify the effectiveness of our model, we use MSCOCO dataset as the source domain and two other datasets (Oxford-102 and Flickr30k) as the target domains. The experimental results show that our model consistently outperforms previous methods for cross-domain image captioning.

41 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334