Localizing Moments in Video with Natural Language

doi:10.1109/ICCV.2017.618

Open AccessProceedings ArticleDOI

Localizing Moments in Video with Natural Language

- pp 5804-5813

TLDR

In this paper, a Moment Context Network (MCNCLN) is proposed to localize natural language queries in videos by integrating local and global video features over time, which can identify a specific temporal segment, or moment, from a video given a natural language text description.

Abstract:

We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global video features over time. A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment. Therefore, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring expressions. We demonstrate that MCN outperforms several baseline methods and believe that our initial results together with the release of DiDeMo will inspire further research on localizing video moments with natural language.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

From Recognition to Cognition: Visual Commonsense Reasoning

Rowan Zellers, +3 more

TL;DR: To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.

...read moreread less

Proceedings ArticleDOI

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Rowan Zellers, +3 more

TL;DR: In this paper, the authors introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and present SWAG, a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations.

...read moreread less

Proceedings ArticleDOI

TVQA: Localized, Compositional Video Question Answering

Jie Lei, +3 more

TL;DR: This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.

...read moreread less

Posted Content

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Antoine Miech, +5 more

- 07 Jun 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.

...read moreread less

Proceedings ArticleDOI

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Antoine Miech, +5 more

TL;DR: This article proposed to learn text-to-video embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations, which leads to state-of-the-art results on instructional video datasets such as YouCook2 or CrossTask.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015 -

International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Posted Content

Caffe: Convolutional Architecture for Fast Feature Embedding

Yangqing Jia, +7 more

- 20 Jun 2014 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Caffe as discussed by the authors is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

...read moreread less

Collapse

Localizing Moments in Video with Natural Language

Citations

From Recognition to Cognition: Visual Commonsense Reasoning

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

TVQA: Localized, Compositional Video Question Answering

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

References

Long short-term memory

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet Large Scale Visual Recognition Challenge

Glove: Global Vectors for Word Representation

Caffe: Convolutional Architecture for Fast Feature Embedding

Related Papers (5)

Glove: Global Vectors for Word Representation

Learning Spatiotemporal Features with 3D Convolutional Networks

Dense-Captioning Events in Videos

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Deep Residual Learning for Image Recognition