scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Patent
22 Sep 2010
TL;DR: In this article, a synchronization process between captioning data and corresponding metatags and the associated media file parses the media file, correlates the caption information and/or metatag with segments of the media, and provides a capability for textual search and selection of particular segments.
Abstract: A synchronization process between captioning data and/or corresponding metatags and the associated media file parses the media file, correlates the caption information and/or metatags with segments of the media file, and provides a capability for textual search and selection of particular segments. A time-synchronized version of the captions is created that is synchronized to the moment that the speech is uttered in the recorded media. The caption data is leveraged to enable search engines to index not merely the title of a video, but the entirety of what was said during the video as well as any associated metatags relating to contents of the video. Further, because the entire media file is indexed, a search can request a particular scene or occurrence within the event recorded by the media file, and the exact moment within the media relevant to the search can be accessed and played for the requester.

55 citations

Posted Content
TL;DR: In this paper, the authors introduce a novel framework for image captioning which can generate diverse descriptions by allowing both grounding and controllability, given a control signal in the form of a sequence or set of image regions, the corresponding caption through a recurrent architecture which predicts textual chunks explicitly grounded on regions, following the constraints of the given control.
Abstract: Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As an image can be described in infinite ways depending on the goal and the context at hand, a higher degree of controllability is needed to apply captioning algorithms in complex scenarios. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by allowing both grounding and controllability. Given a control signal in the form of a sequence or set of image regions, we generate the corresponding caption through a recurrent architecture which predicts textual chunks explicitly grounded on regions, following the constraints of the given control. Experiments are conducted on Flickr30k Entities and on COCO Entities, an extended version of COCO in which we add grounding annotations collected in a semi-automatic manner. Results demonstrate that our method achieves state of the art performances on controllable image captioning, in terms of caption quality and diversity. Code and annotations are publicly available at: this https URL.

55 citations

Journal ArticleDOI
TL;DR: Experiments on the image captioning task on the MS-COCO and Flickr30K datasets validate the usefulness of this framework by showing that the different given topics can lead to different captions describing specific aspects of the given image and that the quality of generated captions is higher than the control model without a topic as input.
Abstract: We present an image captioning framework that generates captions under a given topic. The topic candidates are extracted from the caption corpus. A given image’s topics are then selected from these candidates by a CNN-based multi-label classifier. The input to the caption generation model is an image-topic pair, and the output is a caption of the image. For this purpose, a cross-modal embedding method is learned for the images, topics, and captions. In the proposed framework, the topic, caption, and image are organized in a hierarchical structure, which is preserved in the embedding space by using the order-embedding method. The caption embedding is upper bounded by the corresponding image embedding and lower bounded by the topic embedding. The lower bound pushes the images and captions about the same topic closer together in the embedding space. A bidirectional caption-image retrieval task is conducted on the learned embedding space and achieves the state-of-the-art performance on the MS-COCO and Flickr30K datasets, demonstrating the effectiveness of the embedding method. To generate a caption for an image, an embedding vector is sampled from the region bounded by the embeddings of the image and the topic, then a language model decodes it to a sentence as the output. The lower bound set by the topic shrinks the output space of the language model, which may help the model to learn to match images and captions better. Experiments on the image captioning task on the MS-COCO and Flickr30K datasets validate the usefulness of this framework by showing that the different given topics can lead to different captions describing specific aspects of the given image and that the quality of generated captions is higher than the control model without a topic as input. In addition, the proposed method is competitive with many state-of-the-art methods in terms of standard evaluation metrics.

55 citations

Book ChapterDOI
30 Nov 2020
TL;DR: This work introduces a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions, which achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.
Abstract: Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect of captioning is the notion of attention: how to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous works have proposed the transformer architecture for image captioning. However, the structure between the semantic units in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer’s internal architecture to images. In this work, we introduce the image transformer, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widens the original transformer layer’s inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks. The code is available at https://github.com/wtliao/ImageTransformer.

55 citations

Journal ArticleDOI
TL;DR: It is shown that both hearing and DHH participants preferred and followed collaborative captions better than those generated by automatic speech recognition (ASR) or professionals due to the more consistent flow of the resulting captions.
Abstract: Real-time captioning enables deaf and hard of hearing (DHH) people to follow classroom lectures and other aural speech by converting it into visual text with less than a five second delay Keeping the delay short allows end-users to follow and participate in conversations This article focuses on the fundamental problem that makes real-time captioning difficult: sequential keyboard typing is much slower than speaking We first surveyed the audio characteristics of 240 one-hour-long captioned lectures on YouTube, such as speed and duration of speaking bursts We then analyzed how these characteristics impact caption generation and readability, considering specifically our human-powered collaborative captioning approach We note that most of these characteristics are also present in more general domains For our caption comparison evaluation, we transcribed a classroom lecture in real-time using all three captioning approaches We recruited 48 participants (24 DHH) to watch these classroom transcripts in an eye-tracking laboratory We presented these captions in a randomized, balanced order We show that both hearing and DHH participants preferred and followed collaborative captions better than those generated by automatic speech recognition (ASR) or professionals due to the more consistent flow of the resulting captions These results show the potential to reliably capture speech even during sudden bursts of speed, as well as for generating “enhanced” captions, unlike other human-powered captioning approaches

55 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334