scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Patent
21 Mar 2005
TL;DR: In this paper, a subscription-based system for transcribed audio information to one or more mobile devices is presented, which includes a subscription gateway configured for live/current transfer of the transcribed data to the mobile devices.
Abstract: A subscription-based system provides transcribed audio information to one or more mobile devices. Some techniques feature a system for providing subscription services for currently-generated (e.g., not stored) information (e.g., caption information, transcribed audio) for one or more mobile devices for a live/current audio event. There can be a communication network for communicating to the one or more mobile devices, a transcriber configured for transcribing the event to generate information (e.g., caption information, transcribed audio). Caption data includes transcribed data and control code data. The system includes a subscription gateway configured for live/current transfer of the transcribed data to the one or more mobile devices. The subscription gateway is configured to provide access for the transcribed data to the one or more mobile devices. User preferences for subscribers can be set and/or updated by mobile device users and/or GPS-capable mobile devices to receive feeds for the live/current audio event.

53 citations

Journal ArticleDOI
TL;DR: A mechanism of fine-grained and semantic-guided visual attention is created, which can accurately link the relevant visual information with each semantic meaning inside the text, which significantly outperforms all other methods that use VGG-based CNN encoders without fine-tuning.
Abstract: The soft-attention mechanism is regarded as one of the representative methods for image captioning. Based on the end-to-end convolutional neural network (CNN)-long short term memory (LSTM) framework, the soft-attention mechanism attempts to link the semantic representation in text (i.e., captioning) with relevant visual information in the image for the first time. Motivated by this approach, several state-of-the-art attention methods are proposed. However, due to the constraints of CNN architecture, the given image is only segmented to the fixed-resolution grid at a coarse level. The visual feature extracted from each grid indiscriminately fuses all inside objects and/or their portions. There is no semantic link between grid cells. In addition, the large area “stuff” (e.g., the sky or a beach) cannot be represented using the current methods. To address these problems, this paper proposes a new model based on the fully convolutional network (FCN)-LSTM framework, which can generate an attention map at a fine-grained grid-wise resolution. Moreover, the visual feature of each grid cell is contributed only by the principal object. By adopting the grid-wise labels (i.e., semantic segmentation), the visual representations of different grid cells are correlated to each other. With the ability to attend to large area “stuff,” our method can further summarize an additional semantic context from semantic labels. This method can provide comprehensive context information to the language LSTM decoder. In this way, a mechanism of fine-grained and semantic-guided visual attention is created, which can accurately link the relevant visual information with each semantic meaning inside the text. Demonstrated by three experiments including both qualitative and quantitative analyses, our model can generate captions of high quality, specifically high levels of accuracy, completeness, and diversity. Moreover, our model significantly outperforms all other methods that use VGG-based CNN encoders without fine-tuning.

53 citations

Proceedings ArticleDOI
09 Jul 2020
TL;DR: In this article, the authors propose a visual reasoning approach for video captioning, named Reasoning Module Networks (RMN), to equip the existing encoder-decoder framework with the above reasoning capacity.
Abstract: Generating natural language descriptions for videos, i.e., video captioning, essentially requires step-by-step reasoning along the generation process. For example, to generate the sentence "a man is shooting a basketball", we need to first locate and describe the subject "man", next reason out the man is "shooting", then describe the object "basketball" of shooting. However, existing visual reasoning methods designed for visual question answering are not appropriate to video captioning, for it requires more complex visual reasoning on videos over both space and time, and dynamic module composition along the generation process. In this paper, we propose a novel visual reasoning approach for video captioning, named Reasoning Module Networks (RMN), to equip the existing encoder-decoder framework with the above reasoning capacity. Specifically, our RMN employs 1) three sophisticated spatio-temporal reasoning modules, and 2) a dynamic and discrete module selector trained by a linguistic loss with a Gumbel approximation. Extensive experiments on MSVD and MSR-VTT datasets demonstrate the proposed RMN outperforms the state-of-the-art methods while providing an explicit and explainable generation process. Our code is available at this https URL.

53 citations

Proceedings ArticleDOI
01 Oct 2016
TL;DR: Wang et al. as discussed by the authors proposed a novel video captioning framework, termed as ''Bidirectional Long-Short Term Memory'' (BiLSTM), which deeply captures bidirectional global temporal structure in video.
Abstract: Video captioning has been attracting broad research attention in multimedia community. However, most existing approaches either ignore temporal information among video frames or just employ local contextual temporal knowledge. In this work, we propose a novel video captioning framework, termed as \emph{Bidirectional Long-Short Term Memory} (BiLSTM), which deeply captures bidirectional global temporal structure in video. Specifically, we first devise a joint visual modelling approach to encode video data by combining a forward LSTM pass, a backward LSTM pass, together with visual features from Convolutional Neural Networks (CNNs). Then, we inject the derived video representation into the subsequent language model for initialization. The benefits are in two folds: 1) comprehensively preserving sequential and visual information; and 2) adaptively learning dense visual features and sparse semantic representations for videos and sentences, respectively. We verify the effectiveness of our proposed video captioning framework on a commonly-used benchmark, i.e., Microsoft Video Description (MSVD) corpus, and the experimental results demonstrate that the superiority of the proposed approach as compared to several state-of-the-art methods.

53 citations

Posted Content
TL;DR: Two different models are proposed, which employ different schemes for injecting sentiments into image captions, and the experimental results show that the proposed model outperform the state-of-the-art models in generating sentimental (i.e., sentiment-bearing) imageCaptions.
Abstract: Automatic image captioning has recently approached human-level performance due to the latest advances in computer vision and natural language understanding However, most of the current models can only generate plain factual descriptions about the content of a given image However, for human beings, image caption writing is quite flexible and diverse, where additional language dimensions, such as emotion, humor and language styles, are often incorporated to produce diverse, emotional, or appealing captions In particular, we are interested in generating sentiment-conveying image descriptions, which has received little attention The main challenge is how to effectively inject sentiments into the generated captions without altering the semantic matching between the visual content and the generated descriptions In this work, we propose two different models, which employ different schemes for injecting sentiments into image captions Compared with the few existing approaches, the proposed models are much simpler and yet more effective The experimental results show that our model outperform the state-of-the-art models in generating sentimental (ie, sentiment-bearing) image captions In addition, we can also easily manipulate the model by assigning different sentiments to the testing image to generate captions with the corresponding sentiments

53 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334