scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Journal ArticleDOI
TL;DR: A novel CNN-based encoder-decoder framework for video captioning that first append inter-frame differences to each CNN-extracted frame feature to get a more discriminative representation, and encode each frame to be a more compact feature by a one-layer convolutional mapping, which could be taken as a reconstruction network.
Abstract: Recent advances towards video captioning mainly follow an encoder-decoder (sequence-to-sequence) framework and generate captions via a recurrent neural network (RNN). However, employing RNN as the decoder (generator) is prone to diluting long-term information, which weakens its ability to capture long-term dependencies. Recently, some work has demonstrated that the convolutional neural network (CNN) could be used to model sequential information. Though strengths in representation ability and computation efficiency, CNN has not been well exploited in video captioning. The reason partially comes from the difficulty of modeling multi-modal sequence with CNN. In this paper, we devise a novel CNN-based encoder-decoder framework for video captioning. Particularly, we first append inter-frame differences to each CNN-extracted frame feature to get a more discriminative representation; then with that as the input, we encode each frame to be a more compact feature by a one-layer convolutional mapping, which could be taken as a reconstruction network. In the decoding stage, we first fuse visual and lexical feature; then we stack multiple dilated convolutional layers to form a hierarchical decoder. As long-term dependencies could be captured by a shorter path along the hierarchical structure, the decoder could alleviate the loss of long-term information. Experiments on two benchmark datasets show that our method could obtain state-of-the-art performance.

25 citations

Journal ArticleDOI
TL;DR: This paper proposes a novel method, Image-Text Surgery, to synthesize pseudoimage-sentence pairs, and introduces adaptive visual replacement, which adaptively filters unnecessary visual features in pseudodata with an attention mechanism.
Abstract: Image captioning aims to generate natural language sentences to describe the salient parts of a given image. Although neural networks have recently achieved promising results, a key problem is that they can only describe concepts seen in the training image-sentence pairs. Efficient learning of novel concepts has thus been a topic of recent interest to alleviate the expensive manpower of labeling data. In this paper, we propose a novel method, Image-Text Surgery , to synthesize pseudoimage-sentence pairs. The pseudopairs are generated under the guidance of a knowledge base, with syntax from a seed data set (i.e., MSCOCO) and visual information from an existing large-scale image base (i.e., ImageNet). Via pseudodata, the captioning model learns novel concepts without any corresponding human-labeled pairs. We further introduce adaptive visual replacement, which adaptively filters unnecessary visual features in pseudodata with an attention mechanism. We evaluate our approach on a held-out subset of the MSCOCO data set. The experimental results demonstrate that the proposed approach provides significant performance improvements over state-of-the-art methods in terms of F1 score and sentence quality. An ablation study and the qualitative results further validate the effectiveness of our approach.

25 citations

Proceedings ArticleDOI
09 Jul 2020
TL;DR: This paper proposes a novel memory-based network rather than GAN, named Recurrent Relational Memory Network ($R^2M), which encodes visual context through unsupervised training on images, while enabling the memory to learn from irrelevant textual corpus via supervised fashion.
Abstract: Unsupervised image captioning with no annotations is an emerging challenge in computer vision, where the existing arts usually adopt GAN (Generative Adversarial Networks) models. In this paper, we propose a novel memory-based network rather than GAN, named Recurrent Relational Memory Network ($R^2M$). Unlike complicated and sensitive adversarial learning that non-ideally performs for long sentence generation, $R^2M$ implements a concepts-to-sentence memory translator through two-stage memory mechanisms: fusion and recurrent memories, correlating the relational reasoning between common visual concepts and the generated words for long periods. $R^2M$ encodes visual context through unsupervised training on images, while enabling the memory to learn from irrelevant textual corpus via supervised fashion. Our solution enjoys less learnable parameters and higher computational efficiency than GAN-based methods, which heavily bear parameter sensitivity. We experimentally validate the superiority of $R^2M$ than state-of-the-arts on all benchmark datasets.

25 citations

Proceedings ArticleDOI
22 Oct 2012
TL;DR: This study asked 48 deaf and hearing readers to evaluate transcripts produced by a professional captionist, ASR and crowd captioning software respectively and found the readers preferred crowd captions over professional captions and ASR.
Abstract: Deaf and hard of hearing individuals need accommodations that transform aural to visual information, such as captions that are generated in real-time to enhance their access to spoken information in lectures and other live events. The captions produced by professional captionists work well in general events such as community or legal meetings, but is often unsatisfactory in specialized content events such as higher education classrooms. In addition, it is hard to hire professional captionists, especially those that have experience in specialized content areas, as they are scarce and expensive. The captions produced by commercial automatic speech recognition (ASR) software are far cheaper, but is often perceived as unreadable due to ASR's sensitivity to accents, background noise and slow response time. We ran a study to evaluate the readability of captions generated by a new crowd captioning approach versus professional captionists and ASR. In this approach, captions are typed by classmates into a system that aligns and merges the multiple incomplete caption streams into a single, comprehensive real-time transcript. Our study asked 48 deaf and hearing readers to evaluate transcripts produced by a professional captionist, ASR and crowd captioning software respectively and found the readers preferred crowd captions over professional captions and ASR.

25 citations

Journal ArticleDOI
TL;DR: Captioning is commonly used to scaffold video viewing for second language learners, with the captioning affording the learners access to authentic videos that would ordinarily be out of their reach.
Abstract: Captioning is commonly used to scaffold video viewing for second language learners, with the captioning affording the learners access to authentic videos that would ordinarily be out of their reach...

25 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334