scispace - formally typeset
Open AccessPosted Content

Convolutional Two-Stream Network Fusion for Video Action Recognition

TLDR
In this paper, a spatial and temporal network can be fused at the last convolution layer without loss of performance, but with a substantial saving in parameters, and furthermore, pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance.
Abstract
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.

read more

Citations
More filters
Posted Content

The Kinetics Human Action Video Dataset

TL;DR: The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given.
Proceedings ArticleDOI

SlowFast Networks for Video Recognition

TL;DR: This work presents SlowFast networks for video recognition, which achieves strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by the SlowFast concept.
Journal ArticleDOI

Multimodal Machine Learning: A Survey and Taxonomy

TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.
Posted Content

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

TL;DR: I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.
Proceedings ArticleDOI

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet

TL;DR: Whether current video datasets have sufficient data for training very deep convolutional neural networks with spatio-temporal three-dimensional (3D) kernels is determined and it is believed that using deep 3D CNNs together with Kinetics will retrace the successful history of 2DCNNs and ImageNet, and stimulate advances in computer vision for videos.
References
More filters
Proceedings ArticleDOI

Action Recognition with Improved Trajectories

TL;DR: Dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets are improved by taking into account camera motion to correct them.
Proceedings ArticleDOI

MatConvNet: Convolutional Neural Networks for MATLAB

TL;DR: MatConvNet exposes the building blocks of CNNs as easy-to-use MATLAB functions, providing routines for computing convolutions with filter banks, feature pooling, normalisation, and much more.
Book ChapterDOI

Improving the fisher kernel for large-scale image classification

TL;DR: In an evaluation involving hundreds of thousands of training images, it is shown that classifiers learned on Flickr groups perform surprisingly well and that they can complement classifier learned on more carefully annotated datasets.
Book ChapterDOI

High Accuracy Optical Flow Estimation Based on a Theory for Warping

TL;DR: By proving that this scheme implements a coarse-to-fine warping strategy, this work gives a theoretical foundation for warping which has been used on a mainly experimental basis so far and demonstrates its excellent robustness under noise.
Proceedings Article

Unsupervised Learning of Video Representations using LSTMs

TL;DR: In this paper, an encoder LSTM is used to map an input video sequence into a fixed length representation, which is then decoded using single or multiple decoder Long Short Term Memory (LSTM) networks to perform different tasks.
Related Papers (5)