scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Posted Content
TL;DR: In this article, audio signals can carry surprising amount of information when it comes to high-level visual-lingual tasks, such as weakly-supervised dense event captioning in videos.
Abstract: Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning. However, much of the research has been limited to approaches that either do not take audio corresponding to video into account at all, or those that model the audio-visual correlations in service of sound or sound source localization. In this paper, we present the evidence, that audio signals can carry surprising amount of information when it comes to high-level visual-lingual tasks. Specifically, we focus on the problem of weakly-supervised dense event captioning in videos and show that audio on its own can nearly rival performance of a state-of-the-art visual model and, combined with video, can improve on the state-of-the-art performance. Extensive experiments on the ActivityNet Captions dataset show that our proposed multi-modal approach outperforms state-of-the-art unimodal methods, as well as validate specific feature representation and architecture design choices.

45 citations

Posted Content
TL;DR: A new task, PERSONALITY-CAPTIONS, where the goal is to be as engaging to humans as possible by incorporating controllable style and personality traits, and builds models that combine existing work from sentence representations and image representations.
Abstract: Standard image captioning tasks such as COCO and Flickr30k are factual, neutral in tone and (to a human) state the obvious (e.g., "a man playing a guitar"). While such tasks are useful to verify that a machine understands the content of an image, they are not engaging to humans as captions. With this in mind we define a new task, Personality-Captions, where the goal is to be as engaging to humans as possible by incorporating controllable style and personality traits. We collect and release a large dataset of 201,858 of such captions conditioned over 215 possible traits. We build models that combine existing work from (i) sentence representations (Mazare et al., 2018) with Transformers trained on 1.7 billion dialogue examples; and (ii) image representations (Mahajan et al., 2018) with ResNets trained on 3.5 billion social media images. We obtain state-of-the-art performance on Flickr30k and COCO, and strong performance on our new task. Finally, online evaluations validate that our task and models are engaging to humans, with our best model close to human performance.

44 citations

Patent
25 Jul 2006
TL;DR: In this article, a decoder device decodes the closed captioning information to determine the position of the speaker within the video data, and the time index to correlate the captioning text and positioning information to a specific frame of video data.
Abstract: Closed captioning information is provided regarding the location of a speaker, and when the text is spoken. An audio/video signal includes a video data and the closed captioning information. The closed captioning information includes a time index, a closed captioning text, and positioning information. The positioning information indicates a position within a frame of the video data, and is associated with the closed captioning text for a given time index. The position corresponds to the speaker who is speaking the associated closed captioning text. A decoder device decodes the closed captioning information to determine the position of the speaker within the video data, and the time index to correlate the closed captioning text and positioning information to a specific frame of video data. The video data is preferably scaled to provide a less than full screen video. The scaled video is appropriately positioned on a display screen and talk bubbles, which provide a visual link between the closed captioning text and the speaker, are preferably displayed off the scaled video. Alternatively, the video is not scaled and the talk bubbles are superimposed on the fall screen video in a blended fashion.

44 citations

Proceedings ArticleDOI
01 Oct 2019
TL;DR: There is evidence, that audio signals can carry surprising amount of information when it comes to high-level visual-lingual tasks, and the proposed multi-modal approach outperforms state-of-the-art unimodal methods, as well as validate specific feature representation and architecture design choices.
Abstract: Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning. However, much of the research has been limited to approaches that either do not take audio corresponding to video into account at all, or those that model the audio-visual correlations in service of sound or sound source localization. In this paper, we present the evidence, that audio signals can carry surprising amount of information when it comes to high-level visual-lingual tasks. Specifically, we focus on the problem of weakly-supervised dense event captioning in videos and show that audio on its own can nearly rival performance of a state-of-the-art visual model and, combined with video, can improve on the state-of-the-art performance. Extensive experiments on the ActivityNet Captions dataset show that our proposed multi-modal approach outperforms state-of-the-art unimodal methods, as well as validate specific feature representation and architecture design choices.

44 citations

Patent
07 Jan 2002
TL;DR: In this paper, a method and apparatus for use in connection with home television receivers involving processing an electronic signal, including audio and video processors; the audio information, including digital representations thereof, is analyzed and modified to compare words and phrases represented in the audio and phrases stored in electronic memory for elimination of undesirable words or phrases in audible or visible representations of the audio with options for replacing undesirable words with acceptable words.
Abstract: A method and apparatus for use in connection with home television receivers involving processing an electronic signal, including audio and video processors; the audio information, including digital representations thereof, is analyzed and modified to compare words and phrases represented in the audio information with words and phrases stored in electronic memory for elimination of undesirable words or phrases in audible or visible representations of the audio with options for replacing undesirable words with acceptable words. The options include varying degrees of selectivity in specifying words as undesirable and control over substitute words which are used to replace undesirable words. The options for control of the method and apparatus for language filtering are selectable from an on-screen menu through use of a conventional television remote transmitter. Full capability of the method and apparatus depends only on presence of closed caption or similar digitally-encoded language information being received with a television signal special instructions transmitted with a television signal may also be responded to for activating particular language libraries customizing a library for the program material, as well as unrelated viewer information and control functions.

44 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334