scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Proceedings ArticleDOI
15 Jun 2019
TL;DR: Zhang et al. as discussed by the authors designed an end-to-end context and attribute grounded dense captioning framework consisting of a contextual visual mining module and a multi-level attribute grounded description generation module.
Abstract: Dense captioning aims at simultaneously localizing semantic regions and describing these regions-of-interest (ROIs) with short phrases or sentences in natural language. Previous studies have shown remarkable progresses, but they are often vulnerable to the aperture problem that a caption generated by the features inside one ROI lacks contextual coherence with its surrounding context in the input image. In this work, we investigate contextual reasoning based on multi-scale message propagations from the neighboring contents to the target ROIs. To this end, we design a novel end-to-end context and attribute grounded dense captioning framework consisting of 1) a contextual visual mining module and 2) a multi-level attribute grounded description generation module. Knowing that captions often co-occur with the linguistic attributes (such as who, what and where), we also incorporate an auxiliary supervision from hierarchical linguistic attributes to augment the distinctiveness of the learned captions. Extensive experiments and ablation studies on Visual Genome dataset demonstrate the superiority of the proposed model in comparison to state-of-the-art methods.

29 citations

Patent
07 Nov 1996
TL;DR: In this article, a television system in which at least program title information for programs which are to be transmitted in the future is transmitted in advance to form a channel guide listing is described.
Abstract: In a television system in which at least program title information for programs which are to be transmitted in the future is transmitted in advance to form a channel guide listing, apparatus is provided for acquiring one of the title information and the current date, and generating display signal comprising data representing a text screen containing one of the title information and the current date for recording a user-viewable screen display on a video tape ahead of the television program signal. The title or date information acting as a leader to the following television program. In a second embodiment of the invention, in those instances where descriptive text accompanies the program listing, apparatus of the invention records the descriptive text relating to the title, the star, the director, or the context of the program.

29 citations

Book ChapterDOI
11 Sep 2006
TL;DR: In this article, the authors described a LVCSR system for automatic online subtitling (closed captioning) of TV transmissions of the Czech Parliament meetings based on Hidden Markov Models, lexical trees and bigram language model.
Abstract: This paper describes a LVCSR system for automatic online subtitling (closed captioning) of TV transmissions of the Czech Parliament meetings The recognition system is based on Hidden Markov Models, lexical trees and bigram language model The acoustic model is trained on 40 hours of parliament speech and the language model on more than 10M tokens of parliament speech trancriptions The first part of the article is focused on text normalization and class-based language model preparation The second part describes the recognition network and its decoding with respect to real-time operation demands using up to 100k vocabulary The third part outlines the application framework allowing generation and displaying of subtitles for any audio/video source Finally, experimental results obtained on parliament speeches with recognition accuracy varying from 80 to 95 % (according to the discussed topic) are reported and discussed.

29 citations

Posted Content
Linjie Li1, Yen-Chun Chen1, Yu Cheng1, Zhe Gan1, Licheng Yu1, Jingjing Liu1 
TL;DR: HERO as discussed by the authors is a novel framework for large-scale video+language omni-representation learning, where local context of a video frame is captured by a cross-modal Transformer via multimodal fusion, and global video context is capture by a Temporal Transformer.
Abstract: We present HERO, a novel framework for large-scale video+language omni-representation learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global video context is captured by a Temporal Transformer. In addition to standard Masked Language Modeling (MLM) and Masked Frame Modeling (MFM) objectives, we design two new pre-training tasks: (i) Video-Subtitle Matching (VSM), where the model predicts both global and local temporal alignment; and (ii) Frame Order Modeling (FOM), where the model predicts the right order of shuffled video frames. HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions. Comprehensive experiments demonstrate that HERO achieves new state of the art on multiple benchmarks over Text-based Video/Video-moment Retrieval, Video Question Answering (QA), Video-and-language Inference and Video Captioning tasks across different domains. We also introduce two new challenging benchmarks How2QA and How2R for Video QA and Retrieval, collected from diverse video content over multimodalities.

29 citations

Patent
07 Aug 2008
TL;DR: In this article, the authors propose to extract caption data from the input video stream, translate the caption data into at least one output caption format, and package the translated caption data to data packets for insertion into a video stream.
Abstract: Methods of preserving captioning information in an input video stream through transcoding of the input video stream include extracting caption data from the input video stream, translating the caption data into at least one output caption format, packaging the translated caption data into data packets for insertion into a video stream, synchronizing the packaged caption data with a transcoded version of the input video stream, receiving a preliminary output video stream that is a transcoded version of the input video stream, and combining the packaged caption data with the preliminary output video stream to form an output video stream. Related systems and computer program products are also disclosed.

29 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334