RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos

doi:10.1109/ICCV.2017.402

Proceedings ArticleDOI

RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos

Wenbin Du, +2 more

- pp 3745-3754

Chats0

TLDR

The proposed recurrent pose-attention network (RPAN) is an end-toend recurrent network which can exploit important spatialtemporal evolutions of human pose to assist action recognition in a unified framework and outperforms the recent state-of-the-art methods on these challenging datasets.

Abstract:

Recent studies demonstrate the effectiveness of Recurrent Neural Networks (RNNs) for action recognition in videos. However, previous works mainly utilize video-level category as supervision to train RNNs, which may prohibit RNNs to learn complex motion structures along time. In this paper, we propose a recurrent pose-attention network (RPAN) to address this challenge, where we introduce a novel pose-attention mechanism to adaptively learn pose-related features at every time-step action prediction of RNNs. More specifically, we make three main contributions in this paper. Firstly, unlike previous works on pose-related action recognition, our RPAN is an end-toend recurrent network which can exploit important spatialtemporal evolutions of human pose to assist action recognition in a unified framework. Secondly, instead of learning individual human-joint features separately, our poseattention mechanism learns robust human-part features by sharing attention parameters partially on the semanticallyrelated human joints. These human-part features are then fed into the human-part pooling layer to construct a highlydiscriminative pose-related representation for temporal action modeling. Thirdly, one important byproduct of our RPAN is pose estimation in videos, which can be used for coarse pose annotation in action videos. We evaluate the proposed RPAN quantitatively and qualitatively on two popular benchmarks, i.e., Sub-JHMDB and PennAction. Experimental results show that RPAN outperforms the recent state-of-the-art methods on these challenging datasets.

RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos

Citations

Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking

Recognizing Human Actions as the Evolution of Pose Estimation Maps

PoTion: Pose MoTion Representation for Action Recognition

RGB-D-based human motion recognition with deep learning: A survey

Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison

References

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Going deeper with convolutions

Learning Spatiotemporal Features with 3D Convolutional Networks

Related Papers (5)

Learning Spatiotemporal Features with 3D Convolutional Networks

Deep Residual Learning for Image Recognition

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Two-Stream Convolutional Networks for Action Recognition in Videos

Long-term recurrent convolutional networks for visual recognition and description