scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Posted Content
TL;DR: Based on the Transformer, the authors explore image captioning from a cross-modal perspective and propose the Global and Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language.
Abstract: Recently, attention-based encoder-decoder models have been used extensively in image captioning. Yet there is still great difficulty for the current methods to achieve deep image understanding. In this work, we argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest. Based on the Transformer, to perform effective attention, we explore image captioning from a cross-modal perspective and propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language. It globally provides the aspect vector, a spatial and relational representation of images based on caption contexts, through the extraction of salient region groupings and attribute collocations, and locally extracts the fine-grained regions and attributes in reference to the aspect vector for word selection. Our Transformer-based model achieves a CIDEr score of 129.3 in offline COCO evaluation on the COCO testing set with remarkable efficiency in terms of accuracy, speed, and parameter budget.

16 citations

Proceedings ArticleDOI
30 Apr 2020
TL;DR: Comparing two popular convolution networks architectures – VGG and ResNet – as encoders for the same image captioning model in order to find out which method is the best at image representation used for caption generation shows that encoder plays a big role and can significantly improve model without changing a decoder architecture.
Abstract: Recent models for image captioning are usually based on an encoder-decoder framework. Large pre-trained convolutional neural networks are often used as encoders. However, different authors use different encoder architectures for their image captioning models. This makes it more difficult to determine the effect that the encoder has on the overall model performance. In this paper we compare two popular convolution networks architectures – VGG and ResNet – as encoders for the same image captioning model in order to find out which method is the best at image representation used for caption generation.The results show that the ResNet outperforms VGG allowing image captioning model achieve higher BLEU-4 score. Furthermore, the results show that the ResNet allows model to achieve a score comparable with the VGG-based model with a less amount of training epochs. Based on this data we can state that encoder plays a big role and can significantly improve model without changing a decoder architecture.

16 citations

Book ChapterDOI
23 Aug 2020
TL;DR: In this article, a new metric called SPICE-U was proposed by introducing a notion of uniqueness over the concepts generated in a caption, which is better correlated with human judgements compared to SPICE.
Abstract: Despite considerable progress, state of the art image captioning models produce generic captions, leaving out important image details. Furthermore, these systems may even misrepresent the image in order to produce a simpler caption consisting of common concepts. In this paper, we first analyze both modern captioning systems and evaluation metrics through empirical experiments to quantify these phenomena. We find that modern captioning systems return higher likelihoods for incorrect distractor sentences compared to ground truth captions, and that evaluation metrics like SPICE can be ‘topped’ using simple captioning systems relying on object detectors. Inspired by these observations, we design a new metric (SPICE-U) by introducing a notion of uniqueness over the concepts generated in a caption. We show that SPICE-U is better correlated with human judgements compared to SPICE, and effectively captures notions of diversity and descriptiveness. Finally, we also demonstrate a general technique to improve any existing captioning model – by using mutual information as a re-ranking objective during decoding. Empirically, this results in more unique and informative captions, and improves three different state-of-the-art models on SPICE-U as well as average score over existing metrics (Code is available at https://github.com/princetonvisualai/SPICE-U).

16 citations

18 Apr 1996
TL;DR: This paper examines 22 empirical computer-assisted language learning (CALL) studies published between 1989 and 1994, and 13 reviews and syntheses published between 1987 and 1992, pertaining to CALL in higher education in the United States, and provides three general conclusions.
Abstract: This paper examines 22 empirical computer-assisted language learning (CALL) studies published between 1989 and 1994, and 13 reviews and syntheses published between 1987 and 1992, pertaining to CALL in higher education in the United States. A "three streams" framework helps to place CALL in a larger context and illustrate its several dimensions. Any specific CALL program involves decisions in relation to developments in at lea.it three fields: educational psychology; linguistics; and computer technology. These three fields may be conceptualized as streams, where each stream flows more or less independently of the others, but where the practice of CALL at any given time requires making a passage across all three. An interpretive summary of five major findings from the review of the empirical CALL studies is offered: (1) captioning video segments can dramatically boost student comprehension; (2) CALL can connect students with other people inside and outside of the classroom, promoting natural and spontaneous communication in the target language; (3) the type of CALL fe-!dback provided to students can play a central role in learning; (4) student attitudes toward CALL are not consistently linked to student achievement using CALL; and (5) CALL can substantially improve achievement as compared with traditional instruction. This paper also provides three general conclusions, each accompanied by recommendations for future CALL practice and research. Appendices include the material sear-h procedure; captioning information; supplementary findings from the empirical studies; individual summaries of empirical studies; and individual summaries of CALL and Computer-Assistcd Instruction (CAI) reviews. (Contains 43 references.) (Author/AEF) *********************************************************************** Reproductions supplied by EDRS are the best that can be made from the original document. **********************************************************************

16 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334