Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Posted Content•

Video Understanding as Machine Translation.

[...]

Bruno Korbar, Fabio Petroni, Rohit Girdhar, Lorenzo Torresani

12 Jun 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work removes the need for negative sampling by taking a generative modeling approach that poses the objective as a translation problem between modalities, which allows for a wide variety of downstream video understanding tasks by means of a single unified framework.

...read moreread less

Abstract: With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a contrastive metric learning problem between the modalities. To enable effective learning, however, these strategies require a careful selection of positive and negative samples often combined with hand-designed curriculum policies. In this work we remove the need for negative sampling by taking a generative modeling approach that poses the objective as a translation problem between modalities. Such a formulation allows us to tackle a wide variety of downstream video understanding tasks by means of a single unified framework, without the need for large batches of negative samples common in contrastive metric learning. We experiment with the large-scale HowTo100M dataset for training, and report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT), and text-based clip retrieval (YouCook2 and MSR-VTT).

...read moreread less

29 citations

Journal Article•DOI•

CaptionNet: A Tailor-made Recurrent Neural Network for Generating Image Descriptions

[...]

Longyu Yang¹, Hanli Wang¹, Pengjie Tang¹, Qinyu Li¹•Institutions (1)

Tongji University¹

01 Jan 2021-IEEE Transactions on Multimedia

TL;DR: A novel model named CaptionNet is proposed in this work as an improved LSTM specially designed for image captioning, where only attended image features are allowed to be fed into the memory of CaptionNet through input gates, reducing the dependency on the previous predicted words.

...read moreread less

Abstract: Image captioning is a challenging task of visual understanding and has drawn more attention of researchers. In general, two inputs are required at each time step by the Long Short-Term Memory (LSTM) network used in popular attention based image captioning frameworks, including image features and previous generated words. However, error will be accumulated if the previous words are not accurate and the related semantic is not efficient enough. Facing these challenges, a novel model named CaptionNet is proposed in this work as an improved LSTM specially designed for image captioning. Concretely, only attended image features are allowed to be fed into the memory of CaptionNet through input gates. In this way, the dependency on the previous predicted words can be reduced, forcing model to focus on more visual clues of images at the current time step. Moreover, a memory initialization method called image feature encoding is designed to capture richer semantics of the target image. The evaluation on the benchmark MSCOCO and Flickr30K datasets demonstrates the effectiveness of the proposed CaptionNet model, and extensive ablation studies are performed to verify each of the proposed methods. The project page can be found in https://mic.tongji.edu.cn/3f/9c/c9778a147356/page.htm .

...read moreread less

29 citations

Proceedings Article•DOI•

Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning

[...]

Shaoxiang Chen¹, Yu-Gang Jiang¹•Institutions (1)

Fudan University¹

01 Jun 2021

TL;DR: In this article, a weakly supervised Dense Event Captioning (WS-DEC) method is proposed, where the event captioner generates a sentence from a video segment and feeds it to the sentence localizer to reconstruct the segment, and the localizer produces word importance weights as a guidance for the captioner to improve event description.

...read moreread less

Abstract: Dense Event Captioning (DEC) aims to jointly localize and describe multiple events of interest in untrimmed videos, which is an advancement of the conventional video captioning task (generating a single sentence description for a trimmed video). Weakly Supervised Dense Event Captioning (WS-DEC) goes one step further by not relying on human-annotated temporal event boundaries. However, there are few methods trying to tackle this task, and how to connect localization and description remains an open problem. In this paper, we demonstrate that under weak supervision, the event captioning module and localization module should be more closely bridged in order to improve description performance. Different from previous approaches, in our method, the event captioner generates a sentence from a video segment and feeds it to the sentence localizer to reconstruct the segment, and the localizer produces word importance weights as a guidance for the captioner to improve event description. To further bridge the sentence localizer and event captioner, a concept learner is adopted as the basis of the sentence localizer, which can be utilized to construct an induced set of concept features to enhance video features and improve the event captioner. Finally, our proposed method outperforms state-of-the-art WS-DEC methods on the ActivityNet Captions dataset.

...read moreread less

29 citations

Journal Article•DOI•

M-FFN: multi-scale feature fusion network for image captioning

[...]

Jeripothula Prudviraj, C. Vishnu, C. Krishna Mohan

24 May 2022-Applied Intelligence

TL;DR: A novel multi-scale feature fusion network (M-FFN) for image captioning task to incorporate discriminative features and scene contextual information of an image to enrich spatial and global semantic information.

...read moreread less

29 citations

Posted Content•

Spatio-Temporal Attention Models for Grounded Video Captioning

[...]

Mihai Zanfir¹, Elisabeta Marinoiu¹, Cristian Sminchisescu¹, Cristian Sminchisescu²•Institutions (2)

Romanian Academy¹, Lund University²

17 Oct 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents an automatic video captioning model that combines spatio-temporal attention and image classification by means of deep neural network structures based on long short-term memory and is demonstrated to produce state-of-the-art results in the standard YouTube captioning benchmark.

...read moreread less

Abstract: Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video data to high level textual description, bypassing localization and recognition, thus discarding potentially valuable information for content localization and generalization. In this work we present an automatic video captioning model that combines spatio-temporal attention and image classification by means of deep neural network structures based on long short-term memory. The resulting system is demonstrated to produce state-of-the-art results in the standard YouTube captioning benchmark while also offering the advantage of localizing the visual concepts (subjects, verbs, objects), with no grounding supervision, over space and time.

...read moreread less

29 citations

Collapse

Network Information

Performance

Metrics

4,575

Papers

96,790

Citations

No. of papers in the topic in previous years
Year	Papers
2023	536
2022	1,030
2021	504
2020	530
2019	448
2018	334

Closed captioning

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics