scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Journal ArticleDOI
TL;DR: The results have validated that the image captions generated by the proposed method contain more accurate visual information and comply with language habits and grammar rules better.
Abstract: To generate an image caption, firstly, the content of the image should be fully understood; and then the semantic information contained in the image should be described using a phrase or statement that conforms to certain grammatical rules. Thus, it requires techniques from both computer vision and natural language processing to connect the two different media forms together, which is highly challenging. To adaptively adjust the effect of visual information and language information on the captioning process, in this paper, the part of speech information is proposed to novelly integrate with image captioning models based on the encoder-decoder framework. First, a part of speech prediction network is proposed to analyze and model the part of speech sequences for the words in natural language sentences; then, different mechanisms are proposed to integrate the part of speech guidance information with merge-based and inject-based image captioning models, respectively; finally, according to the integrated frameworks, a multi-task learning paradigm is proposed to facilitate model training. Experiments are conducted on two widely used image captioning datasets, Flickr30 k and COCO, and the results have validated that the image captions generated by the proposed method contain more accurate visual information and comply with language habits and grammar rules better.

26 citations

Journal ArticleDOI
TL;DR: This paper proposes to apply dual attention on pyramid image feature maps to fully explore the visual-semantic correlations and improve the quality of generated sentences with the full consideration of the contextual information provided by the hidden state of the RNN controller.
Abstract: Generating natural sentences from images is a fundamental learning task for visual-semantic understanding in multimedia. In this paper, we propose to apply dual attention on pyramid image feature maps to fully explore the visual-semantic correlations and improve the quality of generated sentences. Specifically, with the full consideration of the contextual information provided by the hidden state of the RNN controller, the pyramid attention can better localize the visually indicative and semantically consistent regions in images. On the other hand, the contextual information can help re-calibrate the importance of feature components by learning the channel-wise dependencies, to improve the discriminative power of visual features for better content description. We conducted comprehensive experiments on three well-known datasets: Flickr8K, Flickr30K and MS COCO, which achieved impressive results in generating descriptive and smooth natural sentences from images. Using either convolution visual features or more informative bottom-up attention features, the composite model can boost the performance of image-to-sentence translation, with a limited computational resource overhead. The proposed pyramid attention and dual attention methods are highly modular, which can be inserted into various image captioning modules to further improve the performance.

26 citations

Book ChapterDOI
23 Aug 2020
TL;DR: Wen et al. as mentioned in this paper proposed a distinctiveness metric between-set CIDEr (CIDErBtw) to evaluate the distinctiveness of a caption with respect to those of similar images.
Abstract: A wide range of image captioning models has been developed, achieving significant improvement based on popular metrics, such as BLEU, CIDEr, and SPICE. However, although the generated captions can accurately describe the image, they are generic for similar images and lack distinctiveness, i.e., cannot properly describe the uniqueness of each image. In this paper, we aim to improve the distinctiveness of image captions through training with sets of similar images. First, we propose a distinctiveness metric—between-set CIDEr (CIDErBtw) to evaluate the distinctiveness of a caption with respect to those of similar images. Our metric shows that the human annotations of each image are not equivalent based on distinctiveness. Thus we propose several new training strategies to encourage the distinctiveness of the generated caption for each image, which are based on using CIDErBtw in a weighted loss function or as a reinforcement learning reward. Finally, extensive experiments are conducted, showing that our proposed approach significantly improves both distinctiveness (as measured by CIDErBtw and retrieval metrics) and accuracy (e.g., as measured by CIDEr) for a wide variety of image captioning baselines. These results are further confirmed through a user study. Project page: https://wenjiaxu.github.io/ciderbtw/.

26 citations

Proceedings ArticleDOI
01 Jul 2019
TL;DR: A training mechanism of multi-scale cropping for remote sensing image captioning is proposed, which can extract more fine-grained information from remote sensing images and enhance the generalization performance of the base model.
Abstract: With the rapid development of artificial satellite, a large number of high resolution remote sensing images can be easily obtained now. Recently, remote sensing image captioning, which aims to generate accurate and concise descriptive sentences for remote sensing images, has been promoted by template-based model and encoder-decoder model with several related datasets released. Based on an encoder-decoder model, we propose a training mechanism of multi-scale cropping for remote sensing image captioning in this paper, which can extract more fine-grained information from remote sensing images and enhance the generalization performance of the base model. The experimental results on two datasets UCM-captions and Sydney-captions demonstrate that the proposed approach availably improves the performances in describing high resolution remote sensing images.

26 citations

Proceedings ArticleDOI
TL;DR: Wang et al. as mentioned in this paper integrated comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework.
Abstract: In e-commerce, consumer-generated videos, which in general deliver consumers' individual preferences for the different aspects of certain products, are massive in volume To recommend these videos to potential consumers more effectively, diverse and catchy video titles are critical However, consumer-generated videos seldom accompany appropriate titles To bridge this gap, we integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework Although automatic video titling is very useful and demanding, it is much less addressed than video captioning The latter focuses on generating sentences that describe videos as a whole while our task requires the product-aware multi-grained video analysis To tackle this issue, the proposed method consists of two processes, ie, granular-level interaction modeling and abstraction-level story-line summarization Specifically, the granular-level interaction modeling first utilizes temporal-spatial landmark cues, descriptive words, and abstractive attributes to builds three individual graphs and recognizes the intra-actions in each graph through Graph Neural Networks (GNN) Then the global-local aggregation module is proposed to model inter-actions across graphs and aggregate heterogeneous graphs into a holistic graph representation The abstraction-level story-line summarization further considers both frame-level video features and the holistic graph to utilize the interactions between products and backgrounds, and generate the story-line topic of the video We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform, and will make the desensitized version publicly available to nourish further development of the research community

26 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334