Topic
Closed captioning
About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.
Papers published on a yearly basis
Papers
More filters
••
TL;DR: A Hierarchical & Multimodal Video Caption (HMVC) model is proposed to jointly learn the dynamics within both visual and textual modalities for video caption task, which infers an arbitrary length sentence according to the input video with arbitrary number of frames.
37 citations
••
11 Jul 2006
TL;DR: The development of a system that enables editors to correct errors in the captions as they are created by Automatic Speech Recognition is described.
Abstract: Deaf and hard of hearing people can find it difficult to follow speech through hearing alone or to take notes when lip-reading or watching a sign-language interpreter. Notetakers summarise what is being said while qualified sign language interpreters with a good understanding of the relevant higher education subject content are in very scarce supply. Real time captioning/transcription is not normally available in UK higher education because of the shortage of real time stenographers. Lectures can be digitally recorded and replayed to provide multimedia revision material for students who attended the class and a substitute learning experience for students unable to attend. Automatic Speech Recognition can provide real time captioning directly from lecturers' speech in classrooms but it is difficult to obtain accuracy comparable to stenography. This paper describes the development of a system that enables editors to correct errors in the captions as they are created by Automatic Speech Recognition
37 citations
••
TL;DR: This work proposes an end-to-end pipeline named Fused GRU with Semantic-Temporal Attention (STA-FG), which can explicitly incorporate the high-level visual concepts to the generation of semantic-temporal attention for video captioning.
37 citations
•
24 Apr 2018
TL;DR: A DNN for fine-grained action classification and video captioning that performs much better than the existing classification benchmark for Something-Something, with impressive fine- grained results, and it yields a strong baseline on the new Something- Something captioning task.
Abstract: We describe a DNN for video classification and captioning, trained end-to-end, with shared features, to solve tasks at different levels of granularity, exploring the link between granularity in a source task and the quality of learned features for transfer learning. For solving the new task domain in transfer learning, we freeze the trained encoder and fine-tune a neural net on the target domain. We train on the Something-Something dataset with over 220, 000 videos, and multiple levels of target granularity, including 50 action groups, 174 fine-grained action categories and captions. Classification and captioning with Something-Something are challenging because of the subtle differences between actions, applied to thousands of different object classes, and the diversity of captions penned by crowd actors. Our model performs better than existing classification baselines for SomethingSomething, with impressive fine-grained results. And it yields a strong baseline on the new Something-Something captioning task. Experiments reveal that training with more fine-grained tasks tends to produce better features for transfer learning.
37 citations
••
19 Oct 2017TL;DR: This work designs a key control unit, termed visual gate, to adaptively decide "when" and "what" the language generator attend to during the word generation process, and employs a bottom-up workflow to learn a pool of semantic attributes for serving as the propositional attention resources.
Abstract: Visual content description has been attracting broad research attention in multimedia community because it deeply uncovers intrinsic semantic facet of visual data. Most existing approaches formulate visual captioning as machine translation task (i.e., from vision to language) via a top-down paradigm with global attention, which ignores to distinguish visual and non-visual parts during word generation. In this work, we propose a novel adaptive attention strategy for visual captioning, which can selectively attend to salient visual content based on linguistic knowledge. Specifically, we design a key control unit, termed visual gate, to adaptively decide "when" and "what" the language generator attend to during the word generation process. We map all the preceding outputs of language generator into a latent space to derive the representation of sentence structures, which assists the "visual gate" to choose appropriate attention timing. Meanwhile, we employ a bottom-up workflow to learn a pool of semantic attributes for serving as the propositional attention resources. We evaluate the proposed approach on two commonly-used benchmarks, i.e., MSCOCO and MSVD. The experimental results demonstrate the superiority of our proposed approach compared to several state-of-the-art methods.
37 citations