Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection

doi:10.1109/ICCV.2017.316

Open AccessProceedings ArticleDOI

Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection

Mohammadreza Zolfaghari, +3 more

- pp 2923-2932

Chats0

TLDR

This paper proposes a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images and introduces a Markov chain model which adds cues successively.

Abstract:

General human action recognition requires understanding of various visual cues. In this paper, we propose a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images. For the integration, we introduce a Markov chain model which adds cues successively. The resulting approach is efficient and applicable to action classification as well as to spatial and temporal action localization. The two contributions clearly improve the performance over respective baselines. The overall approach achieves state-of-the-art action classification performance on HMDB51, J-HMDB and NTU RGB+D datasets. Moreover, it yields state-of-the-art spatio-temporal action localization results on UCF101 and J-HMDB.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding

Jun Liu, +5 more

- 01 Oct 2020 -

IEEE Transactions on Pattern Analysis an...

TL;DR: This work introduces a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames, and investigates a novel one-shot 3D activity recognition problem on this dataset.

...read moreread less

Book ChapterDOI

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Saining Xie, +4 more

TL;DR: In this article, it was shown that it is possible to replace many of the expensive 3D convolutions by low-cost 2D convolution, and the best result was achieved when replacing the 3D CNNs at the bottom of the network, suggesting that temporal representation learning on high-level semantic features is more useful.

...read moreread less

Book ChapterDOI

ECO: Efficient Convolutional Network for Online Video Understanding

Mohammadreza Zolfaghari, +2 more

TL;DR: A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.

...read moreread less

Proceedings ArticleDOI

PoTion: Pose MoTion Representation for Action Recognition

Vasileios Choutas, +3 more

TL;DR: A novel representation that gracefully encodes the movement of some semantic keypoints is introduced that outperforms other state-of-the-art pose representations and is complementary to standard appearance and motion streams.

...read moreread less

Journal ArticleDOI

RGB-D-based human motion recognition with deep learning: A survey

Pichao Wang, +4 more

- 01 Jun 2018 -

Computer Vision and Image Understanding

TL;DR: A detailed overview of recent advances in RGB-D-based motion recognition is presented in this paper, where the reviewed methods are broadly categorized into four groups, depending on the modality adopted for recognition: RGB-based, depth based, skeleton-based and RGB+D based.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Vijay Badrinarayanan, +2 more

- 01 Dec 2017 -

IEEE Transactions on Pattern Analysis an...

TL;DR: Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet.

...read moreread less

Proceedings ArticleDOI

Caffe: Convolutional Architecture for Fast Feature Embedding

Yangqing Jia, +7 more

TL;DR: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

...read moreread less

Proceedings ArticleDOI

Learning Spatiotemporal Features with 3D Convolutional Networks

Du Tran, +5 more

TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.

...read moreread less

Proceedings ArticleDOI

Video Google: a text retrieval approach to object matching in videos

Sivic, +1 more

TL;DR: An approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video, represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion.

...read moreread less

Proceedings Article

Two-Stream Convolutional Networks for Action Recognition in Videos

Karen Simonyan, +1 more

TL;DR: This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

...read moreread less

Collapse

Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection

Citations

NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

ECO: Efficient Convolutional Network for Online Video Understanding

PoTion: Pose MoTion Representation for Action Recognition

RGB-D-based human motion recognition with deep learning: A survey

References

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Caffe: Convolutional Architecture for Fast Feature Embedding

Learning Spatiotemporal Features with 3D Convolutional Networks

Video Google: a text retrieval approach to object matching in videos

Two-Stream Convolutional Networks for Action Recognition in Videos

Related Papers (5)

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Learning Spatiotemporal Features with 3D Convolutional Networks

Two-Stream Convolutional Networks for Action Recognition in Videos

Deep Residual Learning for Image Recognition

Long-term recurrent convolutional networks for visual recognition and description