Topic

Closed captioning

About: Closed captioning is a research topic. Over the lifetime, 3011 publications have been published within this topic receiving 64494 citations. The topic is also known as: CC.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Video captioning using boosted and parallel Long Short-Term Memory networks

[...]

Masoomeh Nabati¹, Alireza Behrad¹•Institutions (1)

Shahed University¹

01 Jan 2020-Computer Vision and Image Understanding

TL;DR: A new boosted and parallel architecture is proposed for video captioning using Long Short-Term Memory (LSTM) networks that considerably improves the accuracy of the generated sentence.

...read moreread less

29 citations

Proceedings Article•DOI•

Attentive Contextual Network for Image Captioning

[...]

Jeripothula Prudviraj¹, C. Vishnu¹, C. Krishna Mohan¹•Institutions (1)

Indian Institute of Technology, Hyderabad¹

18 Jul 2021

TL;DR: In this paper, an attentive contextual network (ACN) is proposed to learn the spatially transformed image features and dense multi-scale contextual information of an image to generate semantically meaningful captions.

...read moreread less

Abstract: Existing image captioning approaches fail to generate fine-grained captions due to the lack of rich encoding representation of an image. In this paper, we present an attentive contextual network (ACN) to learn the spatially transformed image features and dense multi-scale contextual information of an image to generate semantically meaningful captions. At first, we construct deformable network on intermediate layers of convolutional neural network (CNN) to cultivate spatial invariant features. And the multi-scale contextual features are produced by employing contextual network on top of last layers of CNN. Then, we exploit attention mechanism on contextual network to extract dense contextual features. Further, the extracted spatial and contextual features are combined to encode the holistic representation of an image. Finally, a multi-stage caption decoder with visual attention module is incorporated to generate fine-grained captions. The performance of the proposed approach is demonstrated on COCO dataset, the largest dataset for image captioning.

...read moreread less

29 citations

Proceedings Article•DOI•

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

[...]

Hao Tan¹, Mohit Bansal¹•Institutions (1)

University of North Carolina at Chapel Hill¹

01 Nov 2020

TL;DR: The authors propose a technique named vokenization that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images, which is trained on relatively small image captioning datasets and applied to generate vokens for large language corpora.

...read moreread less

Abstract: Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named “vokenization” that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call “vokens”). The “vokenizer” is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks such as GLUE, SQuAD, and SWAG.

...read moreread less

29 citations

Journal Article•DOI•

Auto-encoding and Distilling Scene Graphs for Image Captioning.

[...]

Xu Yang¹, Hanwang Zhang¹, Jianfei Cai²•Institutions (2)

Nanyang Technological University¹, Monash University²

03 Dec 2020-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A scene graph auto-encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions and reconstruct sentences in the language domain.

...read moreread less

Abstract: We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions. Intuitively, we humans use the inductive bias to compose collocations and contextual inferences in discourse. For example, when we see the relation "a person on a bike", it is natural to replace "on" with "ride" and infer "a person riding a bike on a road" even when the "road" is not evident. Therefore, exploiting such bias as a language prior is expected to help the conventional encoder-decoder models reason as we humans and generate more descriptive captions. Specifically, we use the scene graph-a directed graph (G) where an object node is connected by adjective nodes and relationship nodes-to represent the complex structural layout of both image (I) and sentence (S). In the language domain, we use SGAE to learn a dictionary set (D) that helps reconstruct sentences in the S → G S → D → S auto-encoding pipeline, where D encodes the desired language prior and the decoder learns to caption from such a prior; in the vision-language domain, we share D in the I → G I → D → S pipeline and distill the knowledge of the language decoder of the auto-encoder to that of the encoder-decoder based image captioner to transfer the language inductive bias. In this way, the shared D provides hidden embeddings about descriptive collocations to the encoder-decoder and the distillation strategy teaches the encoder-decoder to transform these embeddings to human-like captions as the auto-encoder. Thanks to the scene graph representation, the shared dictionary set, and the Knowledge Distillation strategy, the inductive bias is transferred across domains in principle. We validate the effectiveness of SGAE on the challenging MS-COCO image captioning benchmark, where our SGAE-based single-model achieves a new state-of-the-art 129.6 CIDEr-D on the Karpathy split, and a competitive 126.6 CIDEr-D (c40) on the official server, which is even comparable to other ensemble models. Furthermore, we validate the transferability of SGAE on two more challenging settings: transferring inductive bias from other language corpora and unpaired image captioning. Once again, the results of both settings confirm the superiority of SGAE.

...read moreread less

29 citations

Posted Content•

Exposing and correcting the gender bias in image captioning datasets and models

[...]

Shruti Bhargava, David Forsyth

26 Apr 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work investigates gender bias in the COCO captioning dataset and shows that it engenders not only from the statistical distribution of genders with contexts but also from the flawed annotation by the human annotators.

...read moreread less

Abstract: The task of image captioning implicitly involves gender identification. However, due to the gender bias in data, gender identification by an image captioning model suffers. Also, the gender-activity bias, owing to the word-by-word prediction, influences other words in the caption prediction, resulting in the well-known problem of label bias. In this work, we investigate gender bias in the COCO captioning dataset and show that it engenders not only from the statistical distribution of genders with contexts but also from the flawed annotation by the human annotators. We look at the issues created by this bias in the trained models. We propose a technique to get rid of the bias by splitting the task into 2 subtasks: gender-neutral image captioning and gender classification. By this decoupling, the gender-context influence can be eradicated. We train the gender-neutral image captioning model, which gives comparable results to a gendered model even when evaluating against a dataset that possesses a similar bias as the training data. Interestingly, the predictions by this model on images with no humans, are also visibly different from the one trained on gendered captions. We train gender classifiers using the available bounding box and mask-based annotations for the person in the image. This allows us to get rid of the context and focus on the person to predict the gender. By substituting the genders into the gender-neutral captions, we get the final gendered predictions. Our predictions achieve similar performance to a model trained with gender, and at the same time are devoid of gender bias. Finally, our main result is that on an anti-stereotypical dataset, our model outperforms a popular image captioning model which is trained with gender.

...read moreread less

28 citations

Collapse

Network Information

Performance

Metrics

4,575

Papers

96,790

Citations

No. of papers in the topic in previous years
Year	Papers
2023	536
2022	1,030
2021	504
2020	530
2019	448
2018	334

Closed captioning

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics