scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Posted Content
06 Dec 2017
TL;DR: This paper designs three approaches for crafting adversarial examples in image captioning: a targeted caption method; a targeted keyword method; and an untargeted method, and formulate the process of finding adversarial perturbations as optimization problems and design novel loss functions for efficient search.
Abstract: Modern neural image captioning systems typically adopt the encoder-decoder framework consisting of two principal components: a convolutional neural network (CNN) for image feature extraction and a recurrent neural network (RNN) for caption generation. Inspired by the robustness analysis of CNN-based image classifiers to adversarial perturbations, we propose \textbf{Show-and-Fool}, a novel algorithm for crafting adversarial examples in neural image captioning. Unlike image classification tasks with a finite set of class labels, finding visually-similar adversarial examples in an image captioning system is much more challenging since the space of possible captions in a captioning system is almost infinite. In this paper, we design three approaches for crafting adversarial examples in image captioning: (i) targeted caption method; (ii) targeted keyword method; and (iii) untargeted method. We formulate the process of finding adversarial perturbations as optimization problems and design novel loss functions for efficient search. Experimental results on the Show-and-Tell model and MSCOCO data set show that Show-and-Fool can successfully craft visually-similar adversarial examples with randomly targeted captions, and the adversarial examples can be made highly transferable to the Show-Attend-and-Tell model. Consequently, the presence of adversarial examples leads to new robustness implications of neural image captioning. To the best of our knowledge, this is the first work on crafting effective adversarial examples for image captioning tasks.

48 citations

Posted Content
TL;DR: XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through three novel generation tasks, including Image-conditioned Masked Language Modeling (IMLM), Image- Conditioned Denoising Autoencoding (IDA), and Text-conditioning Image Feature Generation (TIFG).
Abstract: While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through three novel generation tasks, including Image-conditioned Masked Language Modeling (IMLM), Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned Image Feature Generation (TIFG). As a result, the pre-trained XGPT can be fine-tuned without any task-specific architecture modifications to create state-of-the-art models for image captioning. Experiments show that XGPT obtains new state-of-the-art results on the benchmark datasets, including COCO Captions and Flickr30k Captions. We also use XGPT to generate new image captions as data augmentation for the image retrieval task and achieve significant improvement on all recall metrics.

48 citations

Journal ArticleDOI
TL;DR: The use of closed captioned television to teach reading to adults was investigated in this article, using a pre-experimental design, and the results indicated that students improved significantly on a prepost word recognition measure; however, student performance did not differ across treatments.
Abstract: The use of closed captioned television to teach reading to adults was investigated in this study, using a pre‐experimental design. Of most interest were the effects of the use of closed captioned television as a medium for sight vocabulary development. Also of interest were students’ reactions to using closed television as a means of reading instruction. Results indicated that, overall, students improved significantly on a pre‐post word recognition measure; however, student performance did not differ across treatments. Also, there were no significant differences among groups on measures administered after each lesson. Moreover, the group using closed captioned television, without instruction, did evidence a degree of success in reaching a specific criterion level on weekly sight vocabulary tests. Finally, student attitudes toward closed captioned television were extremely positive, not only toward its use as a means of learning to read, but as a means of increasing general knowledge. This pre‐exp...

48 citations

Proceedings ArticleDOI
01 Aug 2017
TL;DR: It is found that, in general, late merging outperforms injection, suggesting that RNNs are better viewed as encoders, rather than generators.
Abstract: Image captioning has evolved into a core task for Natural Language Generation and has also proved to be an important testbed for deep learning approaches to handling multimodal representations. Most contemporary approaches rely on a combination of a convolutional network to handle image features, and a recurrent network to encode linguistic information. The latter is typically viewed as the primary “generation” component. Beyond this high-level characterisation, a CNN+RNN model supports a variety of architectural designs. The dominant model in the literature is one in which visual features encoded by a CNN are “injected” as part of the linguistic encoding process, driving the RNN’s linguistic choices. By contrast, it is possible to envisage an architecture in which visual and linguistic features are encoded separately, and merged at a subsequent stage. In this paper, we address two related questions: (1) Is direct injection the best way of combining multimodal information, or is a late merging alternative better for the image captioning task? (2) To what extent should a recurrent network be viewed as actually generating, rather than simply encoding, linguistic information?

48 citations

Patent
06 Sep 2001
TL;DR: In this article, a method and an apparatus for use in connection with home television video recording, playback, and viewing involving processing an electronic signal, including audio and video information, whereby the audio information, including digital representations thereof, is analyzed and modified to compare words and phrases represented in the audio and phrases stored in electronic memory for elimination of undesirable words or phrases in audible or visible representations of the audio with options for replacing undesirable words with acceptable words.
Abstract: A method and an apparatus for use in connection with home television video recording, playback, and viewing involving processing an electronic signal, including audio and video information, whereby the audio information, including digital representations thereof, is analyzed and modified to compare words and phrases represented in the audio information with words and phrases stored in electronic memory for elimination of undesirable words or phrases in audible or visible representations of the audio with options for replacing undesirable words with acceptable words. The options include varying degrees of selectivity in specifying words as undesirable and control over substitute words which are used to replace undesirable words. The options for control of the method and apparatus for language filtering are selectable from an on-screen menu through operation of a control panel on the language filter apparatus or by use of a conventional television remote transmitter. Full capability of the method and apparatus depends only on presence of closed caption or similar digitally-encoded language information being received with a television signal but special instructions transmitted with a television signal may also be responded to for activating particular language libraries or customizing a library for the program material.

48 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334