Anticipating Visual Representations from Unlabeled Video

Open AccessPosted Content

Anticipating Visual Representations from Unlabeled Video

Carl Vondrick, +2 more

- 29 Apr 2015 -

arXiv: Computer Vision and Pattern Recog...

Chats0

TLDR

In this article, a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects is presented. But this task is challenging partly because it requires leveraging extensive knowledge of the world that is difficult to write down.

Abstract:

Anticipating actions and objects before they start or appear is a difficult problem in computer vision with several real-world applications. This task is challenging partly because it requires leveraging extensive knowledge of the world that is difficult to write down. We believe that a promising resource for efficiently learning this knowledge is through readily available unlabeled video. We present a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects. The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future. Visual representations are a promising prediction target because they encode images at a higher semantic level than pixels yet are automatic to compute. We then apply recognition algorithms on our predicted representation to anticipate objects and actions. We experimentally validate this idea on two datasets, anticipating actions one second in the future and objects five seconds in the future.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Generating Videos with Scene Dynamics

Carl Vondrick, +2 more

TL;DR: A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.

...read moreread less

Posted Content

SoundNet: Learning Sound Representations from Unlabeled Video

Yusuf Aytar, +2 more

- 27 Oct 2016 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: In this article, the authors leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos and propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabelled video as a bridge.

...read moreread less

Proceedings ArticleDOI

From Recognition to Cognition: Visual Commonsense Reasoning

Rowan Zellers, +3 more

TL;DR: To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.

...read moreread less

Posted Content

VideoBERT: A Joint Model for Video and Language Representation Learning.

Chen Sun, +4 more

- 03 Apr 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: In this article, a joint visual-linguistic model is proposed to learn high-level features without any explicit supervision, inspired by its recent success in language modeling, and it outperforms the state-of-the-art on video captioning, and quantitative results verify that the model learns highlevel semantic features.

...read moreread less

Proceedings Article

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

Dan Hendrycks, +3 more

TL;DR: This work finds that self-supervision can benefit robustness in a variety of ways, including robustness to adversarial examples, label corruption, and common input corruptions, and greatly benefits out-of-distribution detection on difficult, near-dist distribution outliers.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Journal Article

Dropout: a simple way to prevent neural networks from overfitting

Nitish Srivastava, +4 more

- 01 Jan 2014 -

Journal of Machine Learning Research

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

Posted Content

Distilling the Knowledge in a Neural Network

Geoffrey E. Hinton, +2 more

- 09 Mar 2015 -

arXiv: Machine Learning

TL;DR: This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.

...read moreread less

Posted Content

Caffe: Convolutional Architecture for Fast Feature Embedding

Yangqing Jia, +7 more

- 20 Jun 2014 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Caffe as discussed by the authors is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

...read moreread less

Posted Content

CNN Features off-the-shelf: an Astounding Baseline for Recognition

Ali Sharif Razavian, +3 more

- 23 Mar 2014 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13 suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.

...read moreread less

Collapse

arXiv: Robotics

What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention

Antonino Furnari, +1 more

Anticipating Visual Representations from Unlabeled Video

Citations

Generating Videos with Scene Dynamics

SoundNet: Learning Sound Representations from Unlabeled Video

From Recognition to Cognition: Visual Commonsense Reasoning

VideoBERT: A Joint Model for Video and Language Representation Learning.

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

References

ImageNet Classification with Deep Convolutional Neural Networks

Dropout: a simple way to prevent neural networks from overfitting

Distilling the Knowledge in a Neural Network

Caffe: Convolutional Architecture for Fast Feature Embedding

CNN Features off-the-shelf: an Astounding Baseline for Recognition

Related Papers (5)

Anticipating Visual Representations from Unlabeled Video

An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders

Learning state representations for robotic control: Information disentangling and multi-modal learning

Hindsight for Foresight: Unsupervised Structured Dynamics Models from Physical Interaction

What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention