Open AccessPosted Content
Convolutional Two-Stream Network Fusion for Video Action Recognition
TLDR
In this paper, a spatial and temporal network can be fused at the last convolution layer without loss of performance, but with a substantial saving in parameters, and furthermore, pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance.Abstract:
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.read more
Citations
More filters
Posted Content
The Kinetics Human Action Video Dataset
Andrew Zisserman,Joao Carreira,Karen Simonyan,Will Kay,Brian Hu Zhang,Chloe Hillier,Sudheendra Vijayanarasimhan,Fabio Viola,Tim Green,Trevor Back,Paul Natsev,Mustafa Suleyman +11 more
TL;DR: The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given.
Proceedings ArticleDOI
SlowFast Networks for Video Recognition
TL;DR: This work presents SlowFast networks for video recognition, which achieves strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by the SlowFast concept.
Journal ArticleDOI
Multimodal Machine Learning: A Survey and Taxonomy
TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.
Posted Content
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Joao Carreira,Andrew Zisserman +1 more
TL;DR: I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.
Proceedings ArticleDOI
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet
TL;DR: Whether current video datasets have sufficient data for training very deep convolutional neural networks with spatio-temporal three-dimensional (3D) kernels is determined and it is believed that using deep 3D CNNs together with Kinetics will retrace the successful history of 2DCNNs and ImageNet, and stimulate advances in computer vision for videos.
References
More filters
Proceedings ArticleDOI
Action Recognition with Improved Trajectories
Heng Wang,Cordelia Schmid +1 more
TL;DR: Dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets are improved by taking into account camera motion to correct them.
Proceedings ArticleDOI
MatConvNet: Convolutional Neural Networks for MATLAB
Andrea Vedaldi,Karel Lenc +1 more
TL;DR: MatConvNet exposes the building blocks of CNNs as easy-to-use MATLAB functions, providing routines for computing convolutions with filter banks, feature pooling, normalisation, and much more.
Book ChapterDOI
Improving the fisher kernel for large-scale image classification
TL;DR: In an evaluation involving hundreds of thousands of training images, it is shown that classifiers learned on Flickr groups perform surprisingly well and that they can complement classifier learned on more carefully annotated datasets.
Book ChapterDOI
High Accuracy Optical Flow Estimation Based on a Theory for Warping
TL;DR: By proving that this scheme implements a coarse-to-fine warping strategy, this work gives a theoretical foundation for warping which has been used on a mainly experimental basis so far and demonstrates its excellent robustness under noise.
Proceedings Article
Unsupervised Learning of Video Representations using LSTMs
TL;DR: In this paper, an encoder LSTM is used to map an input video sequence into a fixed length representation, which is then decoded using single or multiple decoder Long Short Term Memory (LSTM) networks to perform different tasks.