Open AccessPosted Content
Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition.
TLDR
Wang et al. as discussed by the authors designed a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtained a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, where ''x'' denotes the scaling coefficient.Abstract:
One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the recent State-Of-The-Art (SOTA) models for this task tends to be exceedingly sophisticated and over-parameterized. The low efficiency in model training and inference has increased the validation costs of model architectures in large-scale datasets. To address the above issue, recent advanced separable convolutional layers are embedded into an early fused Multiple Input Branches (MIB) network, constructing an efficient Graph Convolutional Network (GCN) baseline for skeleton-based action recognition. In addition, based on such the baseline, we design a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtain a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, termed EfficientGCN-Bx, where ''x'' denotes the scaling coefficient. On two large-scale datasets, i.e., NTU RGB+D 60 and 120, the proposed EfficientGCN-B4 baseline outperforms other SOTA methods, e.g., achieving 91.7% accuracy on the cross-subject benchmark of NTU 60 dataset, while being 3.15x smaller and 3.21x faster than MS-G3D, which is one of the best SOTA methods. The source code in PyTorch version and the pretrained models are available at this https URL.read more
Citations
More filters
Posted ContentDOI
Human Action Recognition from Various Data Modalities: A Review
TL;DR: This paper reviews both the hand-crafted feature-based and deep learning-based methods for single data modalities and also the methods based on multiple modalities, including the fusion-based frameworks and the co-learning-based approaches for HAR.
Journal ArticleDOI
A union of deep learning and swarm-based optimization for 3D human action recognition
TL;DR: DSwarm-Net as mentioned in this paper employs deep learning and swarm intelligence-based metaheuristic for HAR that uses 3D skeleton data for action classification and extracts four different types of features from the skeletal data namely: Distance, Distance Velocity, Angle, and Angle Velocity, which capture complementary information from the skeleton joints for encoding them into images.
Journal ArticleDOI
Development and Validation of a Deep Learning Method to Predict Cerebral Palsy From Spontaneous Movements in Infants at High Risk
Daniel Groos,Lars Adde,Sindre Aubert,Lynn Boswell,R de Regnier,Toril Fjørtoft,Deborah Gaebler-Spira,A. Haukeland,Marianne Loennecken,Michael E. Msall,Unn Inger Møinichen,Aurelie Pascal,Colleen Peyton,Heri Ramampiaro,Michael D. Schreiber,Inger Elisabeth Silberg,Nils Thomas Songstad,Niranjan Thomas,Christine Van den Broeck,Gunn Kristin Øberg,Espen A. F. Ihlen,Ragnhild Støen +21 more
TL;DR: It is suggested that deep learning–based assessments could support early detection of CP in infants at high risk of perinatal brain injury.
Journal ArticleDOI
Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?
Oumaima Moutik,Hiba Sekkat,Smail Tigani,Abdellah Chehri,Rachid Saadane,Taha Ait Tchakoucht,Anand Paul +6 more
TL;DR: Zhang et al. as discussed by the authors conducted a comparative study of the accuracy-complexity trade-off between CNN and Transformer for action recognition in video clips, and based on the performance analysis's outcome, the question of whether CNN or Vision Transformers will win the race was discussed.
Journal ArticleDOI
Human Action Recognition: A Taxonomy-Based Survey, Updates, and Opportunities
TL;DR: In this paper , a taxonomy-based, rigorous study of human activity recognition techniques is presented, discussing the best ways to acquire human action features, derived using RGB and depth data, as well as the latest research on deep learning and hand-crafted techniques.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Posted Content
Semi-Supervised Classification with Graph Convolutional Networks
Thomas Kipf,Max Welling +1 more
TL;DR: A scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs which outperforms related methods by a significant margin.
Journal ArticleDOI
Squeeze-and-Excitation Networks
TL;DR: This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.
Posted Content
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew Howard,Menglong Zhu,Bo Chen,Dmitry Kalenichenko,Weijun Wang,Tobias Weyand,M. Andreetto,Hartwig Adam +7 more
TL;DR: This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Proceedings ArticleDOI
MobileNetV2: Inverted Residuals and Linear Bottlenecks
TL;DR: MobileNetV2 as mentioned in this paper is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers and intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity.