Convolutional Two-Stream Network Fusion for Video Action Recognition

Open AccessPosted Content

Convolutional Two-Stream Network Fusion for Video Action Recognition

- 22 Apr 2016 -

arXiv: Computer Vision and Pattern Recog...

TLDR

In this paper, a spatial and temporal network can be fused at the last convolution layer without loss of performance, but with a substantial saving in parameters, and furthermore, pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance.

Abstract:

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.

Citations

PDF

Open Access

More filters

Posted Content

The Kinetics Human Action Video Dataset

Andrew Zisserman, +11 more

- 19 May 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given.

...read moreread less

Proceedings ArticleDOI

SlowFast Networks for Video Recognition

Christoph Feichtenhofer, +3 more

TL;DR: This work presents SlowFast networks for video recognition, which achieves strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by the SlowFast concept.

...read moreread less

Journal ArticleDOI

Multimodal Machine Learning: A Survey and Taxonomy

Tadas Baltrusaitis, +2 more

- 01 Feb 2019 -

IEEE Transactions on Pattern Analysis an...

TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.

...read moreread less

Posted Content

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Joao Carreira, +1 more

- 22 May 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

...read moreread less

Proceedings ArticleDOI

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet

Kensho Hara, +2 more

TL;DR: Whether current video datasets have sufficient data for training very deep convolutional neural networks with spatio-temporal three-dimensional (3D) kernels is determined and it is believed that using deep 3D CNNs together with Kinetics will retrace the successful history of 2DCNNs and ImageNet, and stimulate advances in computer vision for videos.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Posted Content

Sequence to Sequence -- Video to Text

Subhashini Venugopalan, +5 more

- 03 May 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A novel end- to-end sequence-to-sequence model to generate captions for videos that naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model.

...read moreread less

Posted Content

Learning to track for spatio-temporal action localization

Philippe Weinzaepfel, +2 more

- 05 Jun 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: The approach first detects proposals at the frame-level and scores them with a combination of static and motion CNN features, then tracks high-scoring proposals throughout the video using a tracking-by-detection approach that outperforms the state of the art with a margin of 15%, 7% and 12% respectively in mAP.

...read moreread less

Proceedings ArticleDOI

Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors

Limin Wang, +2 more

- 19 May 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Wang et al. as mentioned in this paper proposed a trajectory-pooled deep-convolutional descriptor (TDD) to combine hand-crafted features and deep-learned features.

...read moreread less

Posted Content

Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice

Xiaojiang Peng, +3 more

- 18 May 2014 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Zhang et al. as discussed by the authors presented a comprehensive study of all steps in BoVW and different fusion methods, and uncover some good practice to produce a state-of-the-art action recognition system.

...read moreread less

Collapse

Convolutional Two-Stream Network Fusion for Video Action Recognition

Citations

The Kinetics Human Action Video Dataset

SlowFast Networks for Video Recognition

Multimodal Machine Learning: A Survey and Taxonomy

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet

References

Sequence to Sequence -- Video to Text

Learning to track for spatio-temporal action localization

Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors

Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice

Related Papers (5)

Learning Spatiotemporal Features with 3D Convolutional Networks

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Deep Residual Learning for Image Recognition

Large-scale Video Classiﬁcation with Convolutional Neural Networks