Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Open AccessPosted Content

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

- 22 May 2017 -

arXiv: Computer Vision and Pattern Recog...

TLDR

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

Abstract:

The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.9% on HMDB-51 and 98.0% on UCF-101.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Video Diffusion Models

Jonathan Ho, +5 more

TL;DR: The authors proposed a diffusion model for video generation, which is a natural extension of the standard image diffusion architecture and enables jointly training from image and video data, which they find to reduce the variance of minibatch gradients and speed up optimization.

...read moreread less

Posted Content

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Andrew Owens, +1 more

- 10 Apr 2018 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.

...read moreread less

Proceedings ArticleDOI

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Zhan Tong, +3 more

TL;DR: This paper shows that video masked autoencoders (VideoMAE) are data-efﬁcient learners for self-supervised video pre-training (SSVP) and shows that data quality is more important than data quantity for SSVP.

...read moreread less

Proceedings ArticleDOI

Masked Autoencoders As Spatiotemporal Learners

Christoph Feichtenhofer, +3 more

TL;DR: It is shown that the MAE method can learn strong representations with almost no inductive bias on spacetime, and spacetime- agnostic random masking performs the best, and the general framework of masked autoencoding can be a uniﬁed methodology for representation learning with minimal domain knowledge.

...read moreread less

Proceedings ArticleDOI

SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos

Silvio Giancola, +3 more

TL;DR: SoccerNet as discussed by the authors is a dataset of 500 complete soccer games from six main European leagues, covering three seasons from 2014 to 2017 and a total duration of 764 hours, with an average of one event every 6.9 minutes, focusing on the problem of localizing very sparse events within long videos.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Posted Content

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, +3 more

- 04 Jun 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.

...read moreread less

Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Ross Girshick, +3 more

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less

Posted Content

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, +1 more

- 11 Feb 2015 -

arXiv: Learning

TL;DR: Batch Normalization as mentioned in this paper normalizes layer inputs for each training mini-batch to reduce the internal covariate shift in deep neural networks, and achieves state-of-the-art performance on ImageNet.

...read moreread less

Collapse

Related Papers (5)

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, +2 more

- 03 Dec 2012 -

arXiv: Computer Vision and Pattern Recog...

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Citations

Video Diffusion Models

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Masked Autoencoders As Spatiotemporal Learners

SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos

References

Deep Residual Learning for Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Related Papers (5)

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Learning Spatiotemporal Features with 3D Convolutional Networks

Deep Residual Learning for Image Recognition

Large-scale Video Classiﬁcation with Convolutional Neural Networks

Long-term recurrent convolutional networks for visual recognition and description