scispace - formally typeset
Search or ask a question
Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.


Papers
More filters
Journal ArticleDOI
TL;DR: A new boosted and parallel architecture is proposed for video captioning using Long Short-Term Memory (LSTM) networks that considerably improves the accuracy of the generated sentence.

29 citations

Proceedings ArticleDOI
18 Jul 2021
TL;DR: In this paper, an attentive contextual network (ACN) is proposed to learn the spatially transformed image features and dense multi-scale contextual information of an image to generate semantically meaningful captions.
Abstract: Existing image captioning approaches fail to generate fine-grained captions due to the lack of rich encoding representation of an image. In this paper, we present an attentive contextual network (ACN) to learn the spatially transformed image features and dense multi-scale contextual information of an image to generate semantically meaningful captions. At first, we construct deformable network on intermediate layers of convolutional neural network (CNN) to cultivate spatial invariant features. And the multi-scale contextual features are produced by employing contextual network on top of last layers of CNN. Then, we exploit attention mechanism on contextual network to extract dense contextual features. Further, the extracted spatial and contextual features are combined to encode the holistic representation of an image. Finally, a multi-stage caption decoder with visual attention module is incorporated to generate fine-grained captions. The performance of the proposed approach is demonstrated on COCO dataset, the largest dataset for image captioning.

29 citations

Proceedings ArticleDOI
01 Nov 2020
TL;DR: The authors propose a technique named vokenization that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images, which is trained on relatively small image captioning datasets and applied to generate vokens for large language corpora.
Abstract: Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named “vokenization” that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call “vokens”). The “vokenizer” is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks such as GLUE, SQuAD, and SWAG.

29 citations

Journal ArticleDOI
TL;DR: A scene graph auto-encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions and reconstruct sentences in the language domain.
Abstract: We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions. Intuitively, we humans use the inductive bias to compose collocations and contextual inferences in discourse. For example, when we see the relation "a person on a bike", it is natural to replace "on" with "ride" and infer "a person riding a bike on a road" even when the "road" is not evident. Therefore, exploiting such bias as a language prior is expected to help the conventional encoder-decoder models reason as we humans and generate more descriptive captions. Specifically, we use the scene graph-a directed graph (G) where an object node is connected by adjective nodes and relationship nodes-to represent the complex structural layout of both image (I) and sentence (S). In the language domain, we use SGAE to learn a dictionary set (D) that helps reconstruct sentences in the S → G S → D → S auto-encoding pipeline, where D encodes the desired language prior and the decoder learns to caption from such a prior; in the vision-language domain, we share D in the I → G I → D → S pipeline and distill the knowledge of the language decoder of the auto-encoder to that of the encoder-decoder based image captioner to transfer the language inductive bias. In this way, the shared D provides hidden embeddings about descriptive collocations to the encoder-decoder and the distillation strategy teaches the encoder-decoder to transform these embeddings to human-like captions as the auto-encoder. Thanks to the scene graph representation, the shared dictionary set, and the Knowledge Distillation strategy, the inductive bias is transferred across domains in principle. We validate the effectiveness of SGAE on the challenging MS-COCO image captioning benchmark, where our SGAE-based single-model achieves a new state-of-the-art 129.6 CIDEr-D on the Karpathy split, and a competitive 126.6 CIDEr-D (c40) on the official server, which is even comparable to other ensemble models. Furthermore, we validate the transferability of SGAE on two more challenging settings: transferring inductive bias from other language corpora and unpaired image captioning. Once again, the results of both settings confirm the superiority of SGAE.

29 citations

Posted Content
TL;DR: This work investigates gender bias in the COCO captioning dataset and shows that it engenders not only from the statistical distribution of genders with contexts but also from the flawed annotation by the human annotators.
Abstract: The task of image captioning implicitly involves gender identification. However, due to the gender bias in data, gender identification by an image captioning model suffers. Also, the gender-activity bias, owing to the word-by-word prediction, influences other words in the caption prediction, resulting in the well-known problem of label bias. In this work, we investigate gender bias in the COCO captioning dataset and show that it engenders not only from the statistical distribution of genders with contexts but also from the flawed annotation by the human annotators. We look at the issues created by this bias in the trained models. We propose a technique to get rid of the bias by splitting the task into 2 subtasks: gender-neutral image captioning and gender classification. By this decoupling, the gender-context influence can be eradicated. We train the gender-neutral image captioning model, which gives comparable results to a gendered model even when evaluating against a dataset that possesses a similar bias as the training data. Interestingly, the predictions by this model on images with no humans, are also visibly different from the one trained on gendered captions. We train gender classifiers using the available bounding box and mask-based annotations for the person in the image. This allows us to get rid of the context and focus on the person to predict the gender. By substituting the genders into the gender-neutral captions, we get the final gendered predictions. Our predictions achieve similar performance to a model trained with gender, and at the same time are devoid of gender bias. Finally, our main result is that on an anti-stereotypical dataset, our model outperforms a popular image captioning model which is trained with gender.

28 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Object detection
46.1K papers, 1.3M citations
82% related
Convolutional neural network
74.7K papers, 2M citations
82% related
Deep learning
79.8K papers, 2.1M citations
82% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023536
20221,030
2021504
2020530
2019448
2018334