scispace - formally typeset
Open AccessPosted Content

Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition.

TLDR
Wang et al. as discussed by the authors designed a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtained a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, where ''x'' denotes the scaling coefficient.
Abstract
One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the recent State-Of-The-Art (SOTA) models for this task tends to be exceedingly sophisticated and over-parameterized. The low efficiency in model training and inference has increased the validation costs of model architectures in large-scale datasets. To address the above issue, recent advanced separable convolutional layers are embedded into an early fused Multiple Input Branches (MIB) network, constructing an efficient Graph Convolutional Network (GCN) baseline for skeleton-based action recognition. In addition, based on such the baseline, we design a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtain a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, termed EfficientGCN-Bx, where ''x'' denotes the scaling coefficient. On two large-scale datasets, i.e., NTU RGB+D 60 and 120, the proposed EfficientGCN-B4 baseline outperforms other SOTA methods, e.g., achieving 91.7% accuracy on the cross-subject benchmark of NTU 60 dataset, while being 3.15x smaller and 3.21x faster than MS-G3D, which is one of the best SOTA methods. The source code in PyTorch version and the pretrained models are available at this https URL.

read more

Citations
More filters
Posted ContentDOI

Human Action Recognition from Various Data Modalities: A Review

TL;DR: This paper reviews both the hand-crafted feature-based and deep learning-based methods for single data modalities and also the methods based on multiple modalities, including the fusion-based frameworks and the co-learning-based approaches for HAR.
Journal ArticleDOI

A union of deep learning and swarm-based optimization for 3D human action recognition

TL;DR: DSwarm-Net as mentioned in this paper employs deep learning and swarm intelligence-based metaheuristic for HAR that uses 3D skeleton data for action classification and extracts four different types of features from the skeletal data namely: Distance, Distance Velocity, Angle, and Angle Velocity, which capture complementary information from the skeleton joints for encoding them into images.
Journal ArticleDOI

Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?

TL;DR: Zhang et al. as discussed by the authors conducted a comparative study of the accuracy-complexity trade-off between CNN and Transformer for action recognition in video clips, and based on the performance analysis's outcome, the question of whether CNN or Vision Transformers will win the race was discussed.
Journal ArticleDOI

Human Action Recognition: A Taxonomy-Based Survey, Updates, and Opportunities

TL;DR: In this paper , a taxonomy-based, rigorous study of human activity recognition techniques is presented, discussing the best ways to acquire human action features, derived using RGB and depth data, as well as the latest research on deep learning and hand-crafted techniques.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Posted Content

Semi-Supervised Classification with Graph Convolutional Networks

TL;DR: A scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs which outperforms related methods by a significant margin.
Journal ArticleDOI

Squeeze-and-Excitation Networks

TL;DR: This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.
Posted Content

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

TL;DR: This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Proceedings ArticleDOI

MobileNetV2: Inverted Residuals and Linear Bottlenecks

TL;DR: MobileNetV2 as mentioned in this paper is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers and intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity.
Related Papers (5)