Convolutional Two-Stream Network Fusion for Video Action Recognition

Open AccessPosted Content

Convolutional Two-Stream Network Fusion for Video Action Recognition

- 22 Apr 2016 -

arXiv: Computer Vision and Pattern Recog...

TLDR

In this paper, a spatial and temporal network can be fused at the last convolution layer without loss of performance, but with a substantial saving in parameters, and furthermore, pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance.

Abstract:

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.

Citations

PDF

Open Access

More filters

Posted Content

The Kinetics Human Action Video Dataset

Andrew Zisserman, +11 more

- 19 May 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given.

...read moreread less

Proceedings ArticleDOI

SlowFast Networks for Video Recognition

Christoph Feichtenhofer, +3 more

TL;DR: This work presents SlowFast networks for video recognition, which achieves strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by the SlowFast concept.

...read moreread less

Journal ArticleDOI

Multimodal Machine Learning: A Survey and Taxonomy

Tadas Baltrusaitis, +2 more

- 01 Feb 2019 -

IEEE Transactions on Pattern Analysis an...

TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.

...read moreread less

Posted Content

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Joao Carreira, +1 more

- 22 May 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

...read moreread less

Proceedings ArticleDOI

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet

Kensho Hara, +2 more

TL;DR: Whether current video datasets have sufficient data for training very deep convolutional neural networks with spatio-temporal three-dimensional (3D) kernels is determined and it is believed that using deep 3D CNNs together with Kinetics will retrace the successful history of 2DCNNs and ImageNet, and stimulate advances in computer vision for videos.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Action Recognition with Improved Trajectories

Heng Wang, +1 more

TL;DR: Dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets are improved by taking into account camera motion to correct them.

...read moreread less

Proceedings ArticleDOI

MatConvNet: Convolutional Neural Networks for MATLAB

Andrea Vedaldi, +1 more

TL;DR: MatConvNet exposes the building blocks of CNNs as easy-to-use MATLAB functions, providing routines for computing convolutions with filter banks, feature pooling, normalisation, and much more.

...read moreread less

Book ChapterDOI

Improving the fisher kernel for large-scale image classification

Florent Perronnin, +2 more

TL;DR: In an evaluation involving hundreds of thousands of training images, it is shown that classifiers learned on Flickr groups perform surprisingly well and that they can complement classifier learned on more carefully annotated datasets.

...read moreread less

Book ChapterDOI

High Accuracy Optical Flow Estimation Based on a Theory for Warping

Thomas Brox, +3 more

TL;DR: By proving that this scheme implements a coarse-to-fine warping strategy, this work gives a theoretical foundation for warping which has been used on a mainly experimental basis so far and demonstrates its excellent robustness under noise.

...read moreread less

Proceedings Article

Unsupervised Learning of Video Representations using LSTMs

Nitish Srivastava, +2 more

TL;DR: In this paper, an encoder LSTM is used to map an input video sequence into a fixed length representation, which is then decoded using single or multiple decoder Long Short Term Memory (LSTM) networks to perform different tasks.

...read moreread less

Collapse

Convolutional Two-Stream Network Fusion for Video Action Recognition

Citations

The Kinetics Human Action Video Dataset

SlowFast Networks for Video Recognition

Multimodal Machine Learning: A Survey and Taxonomy

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet

References

Action Recognition with Improved Trajectories

MatConvNet: Convolutional Neural Networks for MATLAB

Improving the fisher kernel for large-scale image classification

High Accuracy Optical Flow Estimation Based on a Theory for Warping

Unsupervised Learning of Video Representations using LSTMs

Related Papers (5)

Learning Spatiotemporal Features with 3D Convolutional Networks

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Deep Residual Learning for Image Recognition

Large-scale Video Classiﬁcation with Convolutional Neural Networks