scispace - formally typeset
Open AccessProceedings ArticleDOI

Dense-Captioning Events in Videos

Reads0
Chats0
TLDR
In this article, the authors introduce a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language, which is called ActivityNet Captions.
Abstract
Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language. Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time. Finally, we report performances of our model for dense-captioning events, video retrieval and localization.

read more

Citations
More filters
Proceedings ArticleDOI

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

TL;DR: This work describes the ongoing collection of the “something-something” database of video prediction tasks whose solutions require a common sense understanding of the depicted situation, and describes the challenges in crowd-sourcing this data at scale.
Posted Content

VideoBERT: A Joint Model for Video and Language Representation Learning.

TL;DR: In this article, a joint visual-linguistic model is proposed to learn high-level features without any explicit supervision, inspired by its recent success in language modeling, and it outperforms the state-of-the-art on video captioning, and quantitative results verify that the model learns highlevel semantic features.
Proceedings ArticleDOI

VideoBERT: A Joint Model for Video and Language Representation Learning

TL;DR: This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.
Proceedings ArticleDOI

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

TL;DR: In this paper, the authors introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and present SWAG, a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations.
Proceedings ArticleDOI

TVQA: Localized, Compositional Video Question Answering

TL;DR: This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.
References
More filters
Proceedings Article

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.
Proceedings Article

Recurrent neural network based language model

TL;DR: Results indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state of the art backoff language model.
Proceedings ArticleDOI

Large-Scale Video Classification with Convolutional Neural Networks

TL;DR: This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Journal ArticleDOI

3D Convolutional Neural Networks for Human Action Recognition

TL;DR: Wang et al. as mentioned in this paper developed a novel 3D CNN model for action recognition, which extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.
Proceedings ArticleDOI

Long-term recurrent convolutional networks for visual recognition and description

TL;DR: A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Related Papers (5)