Topic
Closed captioning
About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.
Papers published on a yearly basis
Papers
More filters
••
16 Apr 2012
TL;DR: This approach couples the usage of off-the-shelf ASR (Automatic Speech Recognition) software with a novel caption alignment mechanism that smartly introduces unique audio markups into the audio stream before giving it to the ASR and transforms the plain transcript produced by theASR into a timecoded transcript.
Abstract: The simple act of listening or of taking notes while attending a lesson may represent an insuperable burden for millions of people with some form of disabilities (e.g., hearing impaired, dyslexic and ESL students). In this paper, we propose an architecture that aims at automatically creating captions for video lessons by exploiting advances in speech recognition technologies. Our approach couples the usage of off-the-shelf ASR (Automatic Speech Recognition) software with a novel caption alignment mechanism that smartly introduces unique audio markups into the audio stream before giving it to the ASR and transforms the plain transcript produced by the ASR into a timecoded transcript.
36 citations
••
TL;DR: A novel conditional-generative-adversarial-nets-based image captioning framework as an extension of traditional reinforcement-learning (RL)-based encoder-decoder architecture to deal with the inconsistent evaluation problem among different objective language metrics is proposed.
Abstract: In this paper, we propose a novel conditional-generative-adversarial-nets-based image captioning framework as an extension of traditional reinforcement-learning (RL)-based encoder-decoder architecture. To deal with the inconsistent evaluation problem among different objective language metrics, we are motivated to design some "discriminator" networks to automatically and progressively determine whether generated caption is human described or machine generated. Two kinds of discriminator architectures (CNN and RNN-based structures) are introduced since each has its own advantages. The proposed algorithm is generic so that it can enhance any existing RL-based image captioning framework and we show that the conventional RL training method is just a special case of our approach. Empirically, we show consistent improvements over all language evaluation metrics for different state-of-the-art image captioning models. In addition, the well-trained discriminators can also be viewed as objective image captioning evaluators
36 citations
•
TL;DR: In this paper, a new training method called Image-Text-Image (I2T2I) was proposed, which integrates text-to-image and image-totext (image captioning) synthesis to improve the performance of text to image synthesis.
Abstract: Translating information between text and image is a fundamental problem in artificial intelligence that connects natural language processing and computer vision. In the past few years, performance in image caption generation has seen significant improvement through the adoption of recurrent neural networks (RNN). Meanwhile, text-to-image generation begun to generate plausible images using datasets of specific categories like birds and flowers. We've even seen image generation from multi-category datasets such as the Microsoft Common Objects in Context (MSCOCO) through the use of generative adversarial networks (GANs). Synthesizing objects with a complex shape, however, is still challenging. For example, animals and humans have many degrees of freedom, which means that they can take on many complex shapes. We propose a new training method called Image-Text-Image (I2T2I) which integrates text-to-image and image-to-text (image captioning) synthesis to improve the performance of text-to-image synthesis. We demonstrate that %the capability of our method to understand the sentence descriptions, so as to I2T2I can generate better multi-categories images using MSCOCO than the state-of-the-art. We also demonstrate that I2T2I can achieve transfer learning by using a pre-trained image captioning module to generate human images on the MPII Human Pose
36 citations
••
TL;DR: Neither near-verbatim captioning nor edited captioning was found to be better at facilitating comprehension; however, several issues emerged that provide specific directions for future research on edited captions.
Abstract: The study assessed the effects of near-verbatim captioning versus edited captioning on a comprehension task performed by 15 children, ages 7-11 years, who were deaf or hard of hearing. The children's animated television series Arthur was chosen as the content for the study. The researchers began the data collection procedure by asking participants to watch videotapes of the program. Researchers signed or spoke (or signed and spoke) 12 comprehension questions from a script to each participant. The sessions were videotaped, and a checklist was used to ensure consistency of the question-asking procedure across participants and sessions. Responses were coded as correct or incorrect, and the dependent variable was reported as the number of correct answers. Neither near-verbatim captioning nor edited captioning was found to be better at facilitating comprehension; however, several issues emerged that provide specific directions for future research on edited captions.
36 citations
•
TL;DR: Wang et al. as discussed by the authors proposed a reconstruction network with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning.
Abstract: In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder makes use of the forward flow to produce the sentence description based on the encoded video semantic features. Two types of reconstructors are customized to employ the backward flow and reproduce the video features based on the hidden state sequence generated by the decoder. The generation loss yielded by the encoder-decoder and the reconstruction loss introduced by the reconstructor are jointly drawn into training the proposed RecNet in an end-to-end fashion. Experimental results on benchmark datasets demonstrate that the proposed reconstructor can boost the encoder-decoder models and leads to significant gains in video caption accuracy.
36 citations