Dense-Captioning Events in Videos

doi:10.1109/ICCV.2017.83

Open AccessProceedings ArticleDOI

Dense-Captioning Events in Videos

Ranjay Krishna, +4 more

- pp 706-715

Chats0

TLDR

In this article, the authors introduce a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language, which is called ActivityNet Captions.

Abstract:

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language. Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time. Finally, we report performances of our model for dense-captioning events, video retrieval and localization.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

Raghav Goyal, +13 more

TL;DR: This work describes the ongoing collection of the “something-something” database of video prediction tasks whose solutions require a common sense understanding of the depicted situation, and describes the challenges in crowd-sourcing this data at scale.

...read moreread less

Posted Content

VideoBERT: A Joint Model for Video and Language Representation Learning.

Chen Sun, +4 more

- 03 Apr 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: In this article, a joint visual-linguistic model is proposed to learn high-level features without any explicit supervision, inspired by its recent success in language modeling, and it outperforms the state-of-the-art on video captioning, and quantitative results verify that the model learns highlevel semantic features.

...read moreread less

Proceedings ArticleDOI

VideoBERT: A Joint Model for Video and Language Representation Learning

Chen Sun, +4 more

TL;DR: This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.

...read moreread less

Proceedings ArticleDOI

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Rowan Zellers, +3 more

TL;DR: In this paper, the authors introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and present SWAG, a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations.

...read moreread less

Proceedings ArticleDOI

TVQA: Localized, Compositional Video Question Answering

Jie Lei, +3 more

TL;DR: This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu, +10 more

TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

...read moreread less

Proceedings Article

Recurrent neural network based language model

Tomas Mikolov, +4 more

TL;DR: Results indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state of the art backoff language model.

...read moreread less

Proceedings ArticleDOI

Large-Scale Video Classification with Convolutional Neural Networks

Andrej Karpathy, +5 more

TL;DR: This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

...read moreread less

Journal ArticleDOI

3D Convolutional Neural Networks for Human Action Recognition

Shuiwang Ji, +3 more

- 01 Jan 2013 -

IEEE Transactions on Pattern Analysis an...

TL;DR: Wang et al. as mentioned in this paper developed a novel 3D CNN model for action recognition, which extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.

...read moreread less

Proceedings ArticleDOI

Long-term recurrent convolutional networks for visual recognition and description

Jeff Donahue, +6 more

TL;DR: A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

Collapse

Dense-Captioning Events in Videos

Citations

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

VideoBERT: A Joint Model for Video and Language Representation Learning.

VideoBERT: A Joint Model for Video and Language Representation Learning

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

TVQA: Localized, Compositional Video Question Answering

References

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Recurrent neural network based language model

Large-Scale Video Classification with Convolutional Neural Networks

3D Convolutional Neural Networks for Human Action Recognition

Long-term recurrent convolutional networks for visual recognition and description

Related Papers (5)

Attention is All you Need

Deep Residual Learning for Image Recognition

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Learning Spatiotemporal Features with 3D Convolutional Networks

Glove: Global Vectors for Word Representation