scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Posted Content
TL;DR: This work proposes to fool an image captioning system to generate some targeted partial captions for an image polluted by adversarial noises, even the targeted captions are totally irrelevant to the image content.
Abstract: In this work, we study the robustness of a CNN+RNN based image captioning system being subjected to adversarial noises. We propose to fool an image captioning system to generate some targeted partial captions for an image polluted by adversarial noises, even the targeted captions are totally irrelevant to the image content. A partial caption indicates that the words at some locations in this caption are observed, while words at other locations are not this http URL is the first work to study exact adversarial attacks of targeted partial captions. Due to the sequential dependencies among words in a caption, we formulate the generation of adversarial noises for targeted partial captions as a structured output learning problem with latent variables. Both the generalized expectation maximization algorithm and structural SVMs with latent variables are then adopted to optimize the problem. The proposed methods generate very successful at-tacks to three popular CNN+RNN based image captioning models. Furthermore, the proposed attack methods are used to understand the inner mechanism of image captioning systems, providing the guidance to further improve automatic image captioning systems towards human captioning.

32 citations

Journal ArticleDOI
TL;DR: Experiments on the Flickr30k and COCO datasets indicate that the proposed adaptive attention model with a visual sentinel exhibits significant improvement in terms of the BLEU and METEOR evaluation criteria.
Abstract: Considering the image captioning problem, it is difficult to correctly extract the global features of the images. At the same time, most attention methods force each word to correspond to the image region, ignoring the phenomenon that words such as “the” in the description text cannot correspond to the image region. To address these problems, an adaptive attention model with a visual sentinel is proposed in this paper. In the encoding phase, the model introduces DenseNet to extract the global features of the image. At the same time, on each time axis, the sentinel gate is set by the adaptive attention mechanism to decide whether to use the image feature information for word generation. In the decoding phase, the long short-term memory (LSTM) network is applied as a language generation model for image captioning tasks to improve the quality of image caption generation. Experiments on the Flickr30k and COCO datasets indicate that the proposed model exhibits significant improvement in terms ofthe BLEU and METEOR evaluation criteria.

32 citations

Posted Content
TL;DR: Wang et al. as discussed by the authors proposed a generative approach, referred to as multi-modal stochastic RNNs networks (MS-RNN), which models the uncertainty observed in the data using latent variables.
Abstract: Video captioning in essential is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, etc. In this paper we build on the recent progress in using encoder-decoder framework for video captioning and address what we find to be a critical deficiency of the existing methods, that most of the decoders propagate deterministic hidden states. Such complex uncertainty cannot be modeled efficiently by the deterministic models. In this paper, we propose a generative approach, referred to as multi-modal stochastic RNNs networks (MS-RNN), which models the uncertainty observed in the data using latent stochastic variables. Therefore, MS-RNN can improve the performance of video captioning, and generate multiple sentences to describe a video considering different random factors. Specifically, a multi-modal LSTM (M-LSTM) is first proposed to interact with both visual and textual features to capture a high-level representation. Then, a backward stochastic LSTM (S-LSTM) is proposed to support uncertainty propagation by introducing latent variables. Experimental results on the challenging datasets MSVD and MSR-VTT show that our proposed MS-RNN approach outperforms the state-of-the-art video captioning benchmarks.

32 citations

Proceedings Article
01 Jan 2018
TL;DR: In this article, a novel attention mechanism is developed, which adaptively and sequentially focuses on different layers of CNN features (levels of feature abstraction), as well as local spatiotemporal regions of the feature maps at each layer.
Abstract: Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video features. However, the variable context-dependent semantics in the video may make it more appropriate to adaptively select features from the multiple CNN layers. We propose a new approach for generating adaptive spatiotemporal representations of videos for the captioning task. A novel attention mechanism is developed, that adaptively and sequentially focuses on different layers of CNN features (levels of feature "abstraction"), as well as local spatiotemporal regions of the feature maps at each layer. The proposed approach is evaluated on three benchmark datasets: YouTube2Text, M-VAD and MSR-VTT. Along with visualizing the results and how the model works, these experiments quantitatively demonstrate the effectiveness of the proposed adaptive spatiotemporal feature abstraction for translating videos to sentences with rich semantics.

32 citations

Proceedings ArticleDOI
29 Feb 2012
TL;DR: The development and evaluation of ICS videos framework and assessment of its value as an academic learning resource are reported on.
Abstract: Videos of classroom lectures have proven to be a popular and versatile learning resource. This paper reports on videos featuring Indexing, Captioning, and Search capability (ICS Videos). The goal is to allow a user to rapidly search and access a topic of interest, a key shortcoming of the standard video format. A lecture is automatically divided into logical indexed video segments by analyzing video frames. Text is automatically identified with OCR technology enhanced with image transformations to drive keyword search. Captions can be added to videos. The ICS video player integrates indexing, search, and captioning in video playback and has been used by dozens of courses and 1000s of students. This paper reports on the development and evaluation of ICS videos framework and assessment of its value as an academic learning resource.

32 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334