scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Proceedings ArticleDOI
15 Oct 2018
TL;DR: A new framework to automatically spot salient regions in each video frame and simultaneously learn a discriminative spatio-temporal representation for video captioning is proposed.
Abstract: Towards an interpretable video captioning process, we target to locate salient regions of video objects along with the sequentially uttering words. This paper proposes a new framework to automatically spot salient regions in each video frame and simultaneously learn a discriminative spatio-temporal representation for video captioning. First, in a Spot Module, we automatically learn the saliency value of each location to separate salient regions from video content as the foreground and the rest as background by two operations of 'hard separation' and 'soft separation', respectively. Then, in an Aggregate Module, to aggregate the foreground/background descriptors into a discriminative spatio-temporal representation, we devise a trainable video VLAD process to learn the aggregation parameters. Finally, we utilize the attention mechanism to decode the spatio-temporal representations of different regions into video descriptions. Experiments on two benchmark datasets demonstrate our method outperforms most of the state-of-the-art methods in terms of Bleu@4, METEOR and CIDEr metrics for the task of video captioning. Also examples demonstrate our method can successfully utter words to sequentially salient regions of video objects.

24 citations

Proceedings ArticleDOI
Junbo Wang1, Wei Wang1, Yan Huang1, Liang Wang1, Tieniu Tan1 
15 Oct 2018
TL;DR: A Hierarchical Memory Model (HMM) is proposed - a novel deep video captioning architecture which unifies a textual memory, a visual memory and an attribute memory in a hierarchical way and can largely reduce the semantic discrepancy between video and sentence.
Abstract: Translating videos into natural language sentences has drawn much attention recently. The framework of combining visual attention with Long Short-Term Memory (LSTM) based text decoder has achieved much progress. However, the vision-language translation still remains unsolved due to the semantic gap and misalignment between video content and described semantic concept. In this paper, we propose a Hierarchical Memory Model (HMM) - a novel deep video captioning architecture which unifies a textual memory, a visual memory and an attribute memory in a hierarchical way. These memories can guide attention for efficient video representation extraction and semantic attribute selection in addition to modelling the long-term dependency for video sequence and sentences, respectively. Compared with traditional vision-based text decoder, the proposed attribute-based text decoder can largely reduce the semantic discrepancy between video and sentence. To prove the effectiveness of the proposed model, we perform extensive experiments on two public benchmark datasets: MSVD and MSR-VTT. Experiments show that our model not only can discover appropriate video representation and semantic attributes but also can achieve comparable or superior performances than state-of-the-art methods on these datasets.

24 citations

Posted Content
TL;DR: In this article, the authors present a dataset consisting of eye movements and verbal descriptions recorded synchronously over images, and study the differences in human attention during free-viewing and image captioning tasks.
Abstract: In this work, we present a novel dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences in human attention during free-viewing and image captioning tasks. We look into the relationship between human attention and language constructs during perception and sentence articulation. We also analyse attention deployment mechanisms in the top-down soft attention approach that is argued to mimic human attention in captioning tasks, and investigate whether visual saliency can help image captioning. Our study reveals that (1) human attention behaviour differs in free-viewing and image description tasks. Humans tend to fixate on a greater variety of regions under the latter task, (2) there is a strong relationship between described objects and attended objects ($97\%$ of the described objects are being attended), (3) a convolutional neural network as feature encoder accounts for human-attended regions during image captioning to a great extent (around $78\%$), (4) soft-attention mechanism differs from human attention, both spatially and temporally, and there is low correlation between caption scores and attention consistency scores. These indicate a large gap between humans and machines in regards to top-down attention, and (5) by integrating the soft attention model with image saliency, we can significantly improve the model's performance on Flickr30k and MSCOCO benchmarks. The dataset can be found at: this https URL.

24 citations

Patent
27 Sep 2010
TL;DR: In this article, the authors present a method for converting speech to text based on analyzing multimedia content to determine the presence of closed captioning data and then indexing the closed captioned data as associated with the multimedia content.
Abstract: Methods and systems for converting speech to text are disclosed. One method includes analyzing multimedia content to determine the presence of closed captioning data. The method includes, upon detecting closed captioning data, indexing the closed captioning data as associated with the multimedia content. The method also includes, upon failure to detect closed captioning data in the multimedia content, extracting audio data from multimedia content, the audio data including speech data, performing a plurality of speech to text conversions on the speech data to create a plurality of transcripts of the speech data, selecting text from one or more of the plurality of transcripts to form an amalgamated transcript, and indexing the amalgamated transcript as associated with the multimedia content.

24 citations

Patent
31 Dec 2012
TL;DR: In this article, the system accepts captioning data and determines a number of errors in the caption data, as well as the number of words per minute across the entirety of an event corresponding to the captioning and time intervals of the event.
Abstract: A captioning evaluation system. The system accepts captioning data and determines a number of errors in the captioning data, as well as the number of words per minute across the entirety of an event corresponding to the captioning data and time intervals of the event. The errors may be used to determine the accuracy of the captioning and the words per minute, both for the entire event and the time intervals, used to determine a cadence and/or rhythm for the captioning. The accuracy and cadence may be used to score the captioning data and captioner.

24 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334