scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Journal ArticleDOI
TL;DR: Experiments on two datasets demonstrate that the ruminant decoding method can bring significant improvements over traditional single-pass decoding based models and achieves state-of-the-art performance.
Abstract: The encoder-decoder framework has been the base of popular image captioning models, which typically predicts the target sentence based on the encoded source image one word at a time in sequence. However, such a single-pass decoding framework encounters two problems. First, mistakes in the predicted words cannot be corrected and may propagate to the entire sentence. Second, because the single-pass decoder cannot access the following un-generated words, it can only perform local planning to choose every single word according to the preceding words, while lacks the global planning ability as for maintaining the semantic consistency and fluency of the whole sentence. In order to address the above two problems, in this work, we design a ruminant captioning framework which contains an image encoder, a base decoder, and a ruminant decoder. Specifically, the outputs of the former/base decoder are utilized as the global information to guide the words prediction of the latter/ruminant decoder, in an attempt to mimic human polishing process. We enable jointly training of the whole framework and overcome the non-differential problem of discrete words by designing a novel reinforcement learning based optimization algorithm. Experiments on two datasets (MS COCO and Flickr30 k) demonstrate that our ruminant decoding method can bring significant improvements over traditional single-pass decoding based models and achieves state-of-the-art performance.

44 citations

Journal ArticleDOI
TL;DR: This paper constructs spatio-temporal graphs to depict objects with their relations, where the temporal graphs represent objects’ inter-frame dynamics, and the spatial graphs represent Objects’ intra-frame interactive relationships, and achieves state-of-the-art performances in terms of BLEU@4, METEOR and CIDEr metrics.
Abstract: Video captioning is a significant challenging task in computer vision and natural language processing, aiming to automatically describe video content by natural language sentences. Comprehensive understanding of video is the key for accurate video captioning, which needs to not only capture the global content and salient objects in video, but also understand the spatio-temporal relations of objects, including their temporal trajectories and spatial relationships. Thus, it is important for video captioning to capture the objects’ relationships both within and across frames. Therefore, in this paper, we propose an object-aware spatio-temporal graph (OSTG) approach for video captioning. It constructs spatio-temporal graphs to depict objects with their relations, where the temporal graphs represent objects’ inter-frame dynamics, and the spatial graphs represent objects’ intra-frame interactive relationships. The main novelties and advantages are: (1) Bidirectional temporal alignment : Bidirectional temporal graph is constructed along and reversely along the temporal order to perform bidirectional temporal alignment for objects across different frames, which provides complementary clues to capture the inter-frame temporal trajectories for each salient object. (2) Graph based spatial relation learning : Spatial relation graph is constructed among objects in each frame by considering their relative spatial locations and semantic correlations, which is exploited to learn relation features that encode intra-frame relationships for salient objects. (3) Object-aware feature aggregation : Trainable VLAD (vector of locally aggregated descriptors) models are deployed to perform object-aware feature aggregation on objects’ local features, which learn discriminative aggregated representations for better video captioning. A hierarchical attention mechanism is also developed to distinguish contributions of different object instances. Experiments on two widely-used datasets, MSR-VTT and MSVD, demonstrate our proposed approach achieves state-of-the-art performances in terms of BLEU@4, METEOR and CIDEr metrics.

44 citations

Patent
Dale Malik1
19 Dec 2005
TL;DR: In this article, the authors propose a system that allows a user to watch television and view both television content and Internet content simultaneously without the need to access another device, such as a laptop computer, personal digital assistant, or the like.
Abstract: Content from a source, such as the Internet, may be displayed on a window on a television, such as a crawl screen, closed caption area, or picture-in-picture (PIP) area. This may allow a user to watch television and view both television content and Internet content simultaneously without the need to access another device, such as a laptop computer, personal digital assistant, or the like. For example, a user may connect a data processing system, such as a computer, to a video control module, e.g., digital video recorder (DVR) or other box used to control a television. The user may then access a tool bar, for example, provided with an Internet browser that runs on the data processing system to identify portions of one or more Web sites to be displayed on the television through the video control module. The video control module may also be configured to allow the user to interact with the Internet content displayed on the television through use of the television remote, for example.

43 citations

Patent
14 Jun 1999
TL;DR: In this paper, a method for compiling and searching the word content of multiple television news programs using closed captioning text encoded in the television signal is presented, where a remotely located server extracts a closed captioned text stream from a television signal and writes the text stream to a file according to parameters pre-programmed by the system administrator.
Abstract: A method for compiling and searching the word content of multiple television news programs uses closed captioning text encoded in the television signal. A remotely located server, using a closed captioning decoder, extracts a closed captioned text stream from a television signal and writes the text stream to a file according to parameters pre-programmed by the system administrator. A completed text file moves, via the Internet, from the remote central server to a central server. At the central server, the text file encounters a series of pre-index processes designed to impose consistency across all sources. Text files are parsed into smaller files by story or segment. The resulting multiple files are indexed by content and origination information. An interface for submitting search parameters is provided through a World Wide Web site. The site provides web pages to allow search parameters to be specified by a user. The results of a search are web pages displaying sentences and their citations that meet the specified search parameters. Associated with each cited sentence is an option allowing the user to view the full text of the newscast story or segment from which the cited sentence is derived. The web site also provides a feature called Auto Alert that allows the user to pre-specify search parameters for ongoing, automatic submission to the index server each time a new text file is received. The results of Auto Alert are e-mailed to the user, either instantly or according to a user-defined schedule.

43 citations

Proceedings ArticleDOI
01 Jul 2015
TL;DR: The core idea of the method is to translate the given visual query into a distributional semantics based form, which is generated by the average of the sentence vectors extracted from the captions of images visually similar to the input image.
Abstract: In this paper, we propose a novel query expansion approach for improving transferbased automatic image captioning. The core idea of our method is to translate the given visual query into a distributional semantics based form, which is generated by the average of the sentence vectors extracted from the captions of images visually similar to the input image. Using three image captioning benchmark datasets, we show that our approach provides more accurate results compared to the state-of-theart data-driven methods in terms of both automatic metrics and subjective evaluation.

43 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334