scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Patent
Kuansan Wang1
11 May 2004
TL;DR: In this article, a speech input mode dynamically reports partial semantic parses, while audio captioning is still in progress, which is a significant departure from the turn-taking nature of a spoken dialogue.
Abstract: A method and system provide a speech input mode which dynamically reports partial semantic parses, while audio captioning is still in progress. The semantic parses can be evaluated with an outcome immediately reported back to the user. The net effect is that task conventionally performed in the system turn are now carried out in the midst of the user turn thereby presenting a significant departure from the turn-taking nature of a spoken dialogue.

174 citations

Posted Content
TL;DR: This work proposes an end-to-end transformer model, which employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements.
Abstract: Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descriptions. To address this problem, we propose an end-to-end transformer model for dense video captioning. The encoder encodes the video into appropriate representations. The proposal decoder decodes from the encoding with different anchors to form video event proposals. The captioning decoder employs a masking network to restrict its attention to the proposal event over the encoding feature. This masking network converts the event proposal to a differentiable mask, which ensures the consistency between the proposal and captioning during training. In addition, our model employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements. We demonstrate the effectiveness of this end-to-end model on ActivityNet Captions and YouCookII datasets, where we achieved 10.12 and 6.58 METEOR score, respectively.

171 citations

Proceedings ArticleDOI
19 Oct 2017
TL;DR: This work presents a novel deep framework to boost video captioning by learning Multimodal Attention Long-Short Term Memory networks (MA-LSTM), and designs a novel child-sum fusion unit in the MA-L STM to effectively combine different encoded modalities to the initial decoding states.
Abstract: Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence while ignoring intrinsic multimodality nature. Observing that different modalities (e.g., frame, motion, and audio streams), as well as the elements within each modality, contribute differently to the sentence generation, we present a novel deep framework to boost video captioning by learning Multimodal Attention Long-Short Term Memory networks (MA-LSTM). Our proposed MA-LSTM fully exploits both multimodal streams and temporal attention to selectively focus on specific elements during the sentence generation. Moreover, we design a novel child-sum fusion unit in the MA-LSTM to effectively combine different encoded modalities to the initial decoding states. Different from existing approaches that employ the same LSTM structure for different modalities, we train modality-specific LSTM to capture the intrinsic representations of individual modalities. The experiments on two benchmark datasets (MSVD and MSR-VTT) show that our MA-LSTM significantly outperforms the state-of-the-art methods with 52.3 BLEU@4 and 70.4 CIDER-D metrics on MSVD dataset, respectively.

171 citations

Patent
David H. Sloo1
21 May 2002
TL;DR: In this paper, closed captioning streams of textual data are extracted from video signals received by a client device, and closed-captioning streams may be searched for occurrences of text data in the closed captioned streams that match one or more search terms.
Abstract: In some implementations, closed captioning streams of textual data are extracted from video signals received by a client device. The closed captioning streams may be searched for occurrences of textual data in the closed captioning streams that match one or more search terms. When the number of matches between the search terms and a particular closed captioning stream exceeds a threshold number, a notification may be sent indicating that content programming determined to be of interest to a viewer has been located and/or the content programming may be recorded.

166 citations

Journal ArticleDOI
TL;DR: This work introduces the theory of attention in psychology to image caption generation with a combination of convolutional neural network over images and long-short term memory network over sentences.

164 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334