scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Proceedings ArticleDOI
16 Apr 2012
TL;DR: This approach couples the usage of off-the-shelf ASR (Automatic Speech Recognition) software with a novel caption alignment mechanism that smartly introduces unique audio markups into the audio stream before giving it to the ASR and transforms the plain transcript produced by theASR into a timecoded transcript.
Abstract: The simple act of listening or of taking notes while attending a lesson may represent an insuperable burden for millions of people with some form of disabilities (e.g., hearing impaired, dyslexic and ESL students). In this paper, we propose an architecture that aims at automatically creating captions for video lessons by exploiting advances in speech recognition technologies. Our approach couples the usage of off-the-shelf ASR (Automatic Speech Recognition) software with a novel caption alignment mechanism that smartly introduces unique audio markups into the audio stream before giving it to the ASR and transforms the plain transcript produced by the ASR into a timecoded transcript.

36 citations

Journal ArticleDOI
Chen Chen, Shuai Mu, Wanpeng Xiao, Ye Zexiong, Liesi Wu, Fuming Ma1, Qi Ju1 
TL;DR: A novel conditional-generative-adversarial-nets-based image captioning framework as an extension of traditional reinforcement-learning (RL)-based encoder-decoder architecture to deal with the inconsistent evaluation problem among different objective language metrics is proposed.
Abstract: In this paper, we propose a novel conditional-generative-adversarial-nets-based image captioning framework as an extension of traditional reinforcement-learning (RL)-based encoder-decoder architecture. To deal with the inconsistent evaluation problem among different objective language metrics, we are motivated to design some "discriminator" networks to automatically and progressively determine whether generated caption is human described or machine generated. Two kinds of discriminator architectures (CNN and RNN-based structures) are introduced since each has its own advantages. The proposed algorithm is generic so that it can enhance any existing RL-based image captioning framework and we show that the conventional RL training method is just a special case of our approach. Empirically, we show consistent improvements over all language evaluation metrics for different state-of-the-art image captioning models. In addition, the well-trained discriminators can also be viewed as objective image captioning evaluators

36 citations

Posted Content
TL;DR: In this paper, a new training method called Image-Text-Image (I2T2I) was proposed, which integrates text-to-image and image-totext (image captioning) synthesis to improve the performance of text to image synthesis.
Abstract: Translating information between text and image is a fundamental problem in artificial intelligence that connects natural language processing and computer vision. In the past few years, performance in image caption generation has seen significant improvement through the adoption of recurrent neural networks (RNN). Meanwhile, text-to-image generation begun to generate plausible images using datasets of specific categories like birds and flowers. We've even seen image generation from multi-category datasets such as the Microsoft Common Objects in Context (MSCOCO) through the use of generative adversarial networks (GANs). Synthesizing objects with a complex shape, however, is still challenging. For example, animals and humans have many degrees of freedom, which means that they can take on many complex shapes. We propose a new training method called Image-Text-Image (I2T2I) which integrates text-to-image and image-to-text (image captioning) synthesis to improve the performance of text-to-image synthesis. We demonstrate that %the capability of our method to understand the sentence descriptions, so as to I2T2I can generate better multi-categories images using MSCOCO than the state-of-the-art. We also demonstrate that I2T2I can achieve transfer learning by using a pre-trained image captioning module to generate human images on the MPII Human Pose

36 citations

Journal ArticleDOI
TL;DR: Neither near-verbatim captioning nor edited captioning was found to be better at facilitating comprehension; however, several issues emerged that provide specific directions for future research on edited captions.
Abstract: The study assessed the effects of near-verbatim captioning versus edited captioning on a comprehension task performed by 15 children, ages 7-11 years, who were deaf or hard of hearing. The children's animated television series Arthur was chosen as the content for the study. The researchers began the data collection procedure by asking participants to watch videotapes of the program. Researchers signed or spoke (or signed and spoke) 12 comprehension questions from a script to each participant. The sessions were videotaped, and a checklist was used to ensure consistency of the question-asking procedure across participants and sessions. Responses were coded as correct or incorrect, and the dependent variable was reported as the number of correct answers. Neither near-verbatim captioning nor edited captioning was found to be better at facilitating comprehension; however, several issues emerged that provide specific directions for future research on edited captions.

36 citations

Posted Content
TL;DR: Wang et al. as discussed by the authors proposed a reconstruction network with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning.
Abstract: In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder makes use of the forward flow to produce the sentence description based on the encoded video semantic features. Two types of reconstructors are customized to employ the backward flow and reproduce the video features based on the hidden state sequence generated by the decoder. The generation loss yielded by the encoder-decoder and the reconstruction loss introduced by the reconstructor are jointly drawn into training the proposed RecNet in an end-to-end fashion. Experimental results on benchmark datasets demonstrate that the proposed reconstructor can boost the encoder-decoder models and leads to significant gains in video caption accuracy.

36 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334