Spatiotemporal Multiplier Networks for Video Action Recognition

doi:10.1109/CVPR.2017.787

Proceedings ArticleDOI

Spatiotemporal Multiplier Networks for Video Action Recognition

Christoph Feichtenhofer, +2 more

- pp 7445-7454

Chats0

TLDR

A general ConvNet architecture for video action recognition based on multiplicative interactions of spacetime features that combines the appearance and motion pathways of a two-stream architecture by motion gating and is trained end-to-end.

Abstract:

This paper presents a general ConvNet architecture for video action recognition based on multiplicative interactions of spacetime features. Our model combines the appearance and motion pathways of a two-stream architecture by motion gating and is trained end-to-end. We theoretically motivate multiplicative gating functions for residual networks and empirically study their effect on classification accuracy. To capture long-term dependencies we inject identity mapping kernels for learning temporal relationships. Our architecture is fully convolutional in spacetime and able to evaluate a video in a single forward pass. Empirical investigation reveals that our model produces state-of-the-art results on two standard action recognition datasets.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet

Kensho Hara, +2 more

TL;DR: Whether current video datasets have sufficient data for training very deep convolutional neural networks with spatio-temporal three-dimensional (3D) kernels is determined and it is believed that using deep 3D CNNs together with Kinetics will retrace the successful history of 2DCNNs and ImageNet, and stimulate advances in computer vision for videos.

...read moreread less

Journal ArticleDOI

Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey

Longlong Jing, +1 more

- 01 Nov 2021 -

IEEE Transactions on Pattern Analysis an...

TL;DR: An extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos as a subset of unsupervised learning methods to learn general image and video features from large-scale unlabeled data without using any human-annotated labels is provided.

...read moreread less

Book ChapterDOI

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Saining Xie, +4 more

TL;DR: In this article, it was shown that it is possible to replace many of the expensive 3D convolutions by low-cost 2D convolution, and the best result was achieved when replacing the 3D CNNs at the bottom of the network, suggesting that temporal representation learning on high-level semantic features is more useful.

...read moreread less

Book ChapterDOI

Videos as Space-Time Region Graphs

Xiaolong Wang, +1 more

TL;DR: The proposed graph representation achieves state-of-the-art results on the Charades and Something-Something datasets and obtains a huge gain when the model is applied in complex environments.

...read moreread less

Posted Content

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Saining Xie, +4 more

- 13 Dec 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: It is shown that it is possible to replace many of the 3D convolutions by low-cost 2D convolution, suggesting that temporal representation learning on high-level “semantic” features is more useful.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings ArticleDOI

Going deeper with convolutions

Christian Szegedy, +8 more

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, +1 more

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

...read moreread less

Proceedings ArticleDOI

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy, +4 more

TL;DR: In this article, the authors explore ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.

...read moreread less

Collapse

Spatiotemporal Multiplier Networks for Video Action Recognition

Citations

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet

Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Videos as Space-Time Region Graphs

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

References

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Going deeper with convolutions

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Rethinking the Inception Architecture for Computer Vision

Related Papers (5)

Learning Spatiotemporal Features with 3D Convolutional Networks

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Deep Residual Learning for Image Recognition

Two-Stream Convolutional Networks for Action Recognition in Videos

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild