scispace - formally typeset
Open AccessProceedings ArticleDOI

Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection

Reads0
Chats0
TLDR
This paper proposes a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images and introduces a Markov chain model which adds cues successively.
Abstract
General human action recognition requires understanding of various visual cues. In this paper, we propose a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images. For the integration, we introduce a Markov chain model which adds cues successively. The resulting approach is efficient and applicable to action classification as well as to spatial and temporal action localization. The two contributions clearly improve the performance over respective baselines. The overall approach achieves state-of-the-art action classification performance on HMDB51, J-HMDB and NTU RGB+D datasets. Moreover, it yields state-of-the-art spatio-temporal action localization results on UCF101 and J-HMDB.

read more

Citations
More filters
Journal ArticleDOI

NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding

TL;DR: This work introduces a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames, and investigates a novel one-shot 3D activity recognition problem on this dataset.
Book ChapterDOI

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

TL;DR: In this article, it was shown that it is possible to replace many of the expensive 3D convolutions by low-cost 2D convolution, and the best result was achieved when replacing the 3D CNNs at the bottom of the network, suggesting that temporal representation learning on high-level semantic features is more useful.
Book ChapterDOI

ECO: Efficient Convolutional Network for Online Video Understanding

TL;DR: A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.
Proceedings ArticleDOI

PoTion: Pose MoTion Representation for Action Recognition

TL;DR: A novel representation that gracefully encodes the movement of some semantic keypoints is introduced that outperforms other state-of-the-art pose representations and is complementary to standard appearance and motion streams.
Journal ArticleDOI

RGB-D-based human motion recognition with deep learning: A survey

TL;DR: A detailed overview of recent advances in RGB-D-based motion recognition is presented in this paper, where the reviewed methods are broadly categorized into four groups, depending on the modality adopted for recognition: RGB-based, depth based, skeleton-based and RGB+D based.
References
More filters
Journal ArticleDOI

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

TL;DR: Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet.
Proceedings ArticleDOI

Caffe: Convolutional Architecture for Fast Feature Embedding

TL;DR: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
Proceedings ArticleDOI

Learning Spatiotemporal Features with 3D Convolutional Networks

TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.
Proceedings ArticleDOI

Video Google: a text retrieval approach to object matching in videos

TL;DR: An approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video, represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion.
Proceedings Article

Two-Stream Convolutional Networks for Action Recognition in Videos

TL;DR: This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Related Papers (5)