scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Proceedings ArticleDOI
01 Jun 2019
TL;DR: This article is the first survey of biomedical image captioning, discussing datasets, evaluation measures, and state of the art methods, and suggesting two baselines, a weak and a stronger one; the latter outperforms all current state-of-the art systems on one of the datasets.
Abstract: Image captioning applied to biomedical images can assist and accelerate the diagnosis process followed by clinicians. This article is the first survey of biomedical image captioning, discussing datasets, evaluation measures, and state of the art methods. Additionally, we suggest two baselines, a weak and a stronger one; the latter outperforms all current state of the art systems on one of the datasets.

41 citations

Journal ArticleDOI
TL;DR: Qualitative evaluation and user studies via Amazon Mechanical Turk show that the three novel features of the CSMN help enhance the performance of personalized image captioning over state-of-the-art captioning models.
Abstract: We address personalized image captioning, which generates a descriptive sentence for a user's image, accounting for prior knowledge such as her active vocabulary or writing style in her previous documents. As applications of personalized image captioning, we solve two post automation tasks in social networks: hashtag prediction and post generation . The hashtag prediction predicts a list of hashtags for an image, while the post generation creates a natural text consisting of normal words, emojis, and even hashtags. We propose a novel personalized captioning model named Context Sequence Memory Network ( CSMN ). Its unique updates over existing memory networks include (i) exploiting memory as a repository for multiple types of context information, (ii) appending previously generated words into memory to capture long-term information, and (iii) adopting CNN memory structure to jointly represent nearby ordered memory slots for better context understanding. For evaluation, we collect a new dataset InstaPIC-1.1M, comprising 1.1M Instagram posts from 6.3 K users. We further use the benchmark YFCC100M dataset [1] to validate the generality of our approach. With quantitative evaluation and user studies via Amazon Mechanical Turk, we show that the three novel features of the CSMN help enhance the performance of personalized image captioning over state-of-the-art captioning models.

41 citations

Journal ArticleDOI
TL;DR: In this article , a CLIP4Clip model is proposed to transfer the knowledge of the image-text pretrained CLIP model to video-text tasks in an end-to-end manner.

41 citations

Posted Content
TL;DR: This work proposes Scan2Cap, an end-to-end trained method to detect objects in the input scene and describe them in natural language, which can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin.
Abstract: We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% CiDEr@0.5IoUimprovement).

41 citations

Posted Content
TL;DR: A visual question answering model is designed that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions.
Abstract: Much recent progress in Vision-to-Language problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. In this paper we first propose a method of incorporating high-level concepts into the successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering. We further show that the same mechanism can be used to incorporate external knowledge, which is critically important for answering high level visual questions. Specifically, we design a visual question answering model that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. It particularly allows questions to be asked about the contents of an image, even when the image itself does not contain a complete answer. Our final model achieves the best reported results on both image captioning and visual question answering on several benchmark datasets.

41 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334