Convolutional Two-Stream Network Fusion for Video Action Recognition

Open AccessPosted Content

Convolutional Two-Stream Network Fusion for Video Action Recognition

- 22 Apr 2016 -

arXiv: Computer Vision and Pattern Recog...

TLDR

In this paper, a spatial and temporal network can be fused at the last convolution layer without loss of performance, but with a substantial saving in parameters, and furthermore, pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance.

Abstract:

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

3D-CNN-Based Fused Feature Maps with LSTM Applied to Action Recognition

Sheeraz Arif, +3 more

- 13 Feb 2019 -

Future Internet

TL;DR: A new framework which intelligently combines 3D-CNN and LSTM networks which integrates discriminative information from a video into a map called a ‘motion map’ by using a deep 3-dimensional convolutional network (C3D).

...read moreread less

Posted Content

Non-Linear Temporal Subspace Representations for Activity Recognition

Anoop Cherian, +3 more

- 27 Mar 2018 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A novel pooling method is proposed, kernelized rank pooling, that represents a given sequence compactly as the pre-image of the parameters of a hyperplane in a reproducing kernel Hilbert space, projections of data onto which captures their temporal order.

...read moreread less

Journal ArticleDOI

TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Devices.

Ji Lin, +3 more

- 09 Oct 2020 -

IEEE Transactions on Pattern Analysis an...

TL;DR: A generic and effective Temporal Shift Module that enjoys both high efficiency and high performance and enables action concepts learning, which 2D networks cannot model; the category attention map is visualize and that spatial-temporal action detector emerges during the training of classification tasks.

...read moreread less

Journal ArticleDOI

Action Knowledge Transfer for Action Prediction with Partial Videos

Yijun Cai, +3 more

TL;DR: The proposed action knowledge transfer method can significantly improve the performance of action prediction, especially for the actions with small observation ratios (e.g., 10%).

...read moreread less

Journal ArticleDOI

Listen and Look: Audio–Visual Matching Assisted Speech Source Separation

Rui Lu, +2 more

- 05 Jul 2018 -

IEEE Signal Processing Letters

TL;DR: An audio–visual matching network is proposed to learn the correspondence between voice fluctuations and lip movements and a framework to apply this network to address the source permutation problem and improve over audio-only speech separation methods is proposed.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Proceedings ArticleDOI

Going deeper with convolutions

Christian Szegedy, +8 more

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, +1 more

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

...read moreread less

Collapse

Convolutional Two-Stream Network Fusion for Video Action Recognition

Citations

3D-CNN-Based Fused Feature Maps with LSTM Applied to Action Recognition

Non-Linear Temporal Subspace Representations for Activity Recognition

TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Devices.

Action Knowledge Transfer for Action Prediction with Partial Videos

Listen and Look: Audio–Visual Matching Assisted Speech Source Separation

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

Going deeper with convolutions

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Related Papers (5)

Learning Spatiotemporal Features with 3D Convolutional Networks

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Deep Residual Learning for Image Recognition

Large-scale Video Classiﬁcation with Convolutional Neural Networks