scispace - formally typeset
Search or ask a question
Author

Anirudh Thatipelli

Bio: Anirudh Thatipelli is an academic researcher from International Institute of Information Technology, Hyderabad. The author has contributed to research in topics: Bottleneck. The author has an hindex of 2, co-authored 4 publications receiving 5 citations.
Topics: Bottleneck

Papers
More filters
Posted Content
TL;DR: The results from benchmarking the top performers of NTU-120 on Skeletics-152 reveal the challenges and domain gap induced by actions 'in the wild', and proposes new frontiers for human action recognition.
Abstract: In this paper, we study current and upcoming frontiers across the landscape of skeleton-based human action recognition. To begin with, we benchmark state-of-the-art models on the NTU-120 dataset and provide multi-layered assessment of the results. To examine skeleton action recognition 'in the wild', we introduce Skeletics-152, a curated and 3-D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset. The results from benchmarking the top performers of NTU-120 on Skeletics-152 reveal the challenges and domain gap induced by actions 'in the wild'. We extend our study to include out-of-context actions by introducing Skeleton-Mimetics, a dataset derived from the recently introduced Mimetics dataset. Finally, as a new frontier for action recognition, we introduce Metaphorics, a dataset with caption-style annotated YouTube videos of the popular social game Dumb Charades and interpretative dance performances. Overall, our work characterizes the strengths and limitations of existing approaches and datasets. It also provides an assessment of top-performing approaches across a spectrum of activity settings and via the introduced datasets, proposes new frontiers for human action recognition.

24 citations

Journal ArticleDOI
TL;DR: Skeleton-Mimetics-152 as discussed by the authors is a 3D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset, and Metaphorics, a dataset with caption style annotated YouTube videos of the popular social game Dumb Charades and interpretative dance performances.
Abstract: In this paper, we study current and upcoming frontiers across the landscape of skeleton-based human action recognition. To study skeleton-action recognition in the wild, we introduce Skeletics-152, a curated and 3-D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset. We extend our study to include out-of-context actions by introducing Skeleton-Mimetics, a dataset derived from the recently introduced Mimetics dataset. We also introduce Metaphorics, a dataset with caption-style annotated YouTube videos of the popular social game Dumb Charades and interpretative dance performances. We benchmark state-of-the-art models on the NTU-120 dataset and provide multi-layered assessment of the results. The results from benchmarking the top performers of NTU-120 on the newly introduced datasets reveal the challenges and domain gap induced by actions in the wild. Overall, our work characterizes the strengths and limitations of existing approaches and datasets. Via the introduced datasets, our work enables new frontiers for human action recognition.

16 citations

Posted Content
TL;DR: In this article, the authors introduce two new pose based human action datasets, NTU60-X and NTU120-X, which includes finger and facial joints, enabling a richer skeleton representation.
Abstract: The lack of fine-grained joints (facial joints, hand fingers) is a fundamental performance bottleneck for state of the art skeleton action recognition models. Despite this bottleneck, community's efforts seem to be invested only in coming up with novel architectures. To specifically address this bottleneck, we introduce two new pose based human action datasets - NTU60-X and NTU120-X. Our datasets extend the largest existing action recognition dataset, NTU-RGBD. In addition to the 25 body joints for each skeleton as in NTU-RGBD, NTU60-X and NTU120-X dataset includes finger and facial joints, enabling a richer skeleton representation. We appropriately modify the state of the art approaches to enable training using the introduced datasets. Our results demonstrate the effectiveness of these NTU-X datasets in overcoming the aforementioned bottleneck and improve state of the art performance, overall and on previously worst performing action categories.
Posted Content
27 Jan 2021
TL;DR: In this article, the authors introduce a new skeleton based human action dataset, NTU60-X, which includes finger and facial joints, enabling a richer skeleton representation and improving state-of-the-art performance.
Abstract: The lack of fine-grained joints such as hand fingers is a fundamental performance bottleneck for state of the art skeleton action recognition models trained on the largest action recognition dataset, NTU-RGBD. To address this bottleneck, we introduce a new skeleton based human action dataset - NTU60-X. In addition to the 25 body joints for each skeleton as in NTU-RGBD, NTU60-X dataset includes finger and facial joints, enabling a richer skeleton representation. We appropriately modify the state of the art approaches to enable training using the introduced dataset. Our results demonstrate the effectiveness of NTU60-X in overcoming the aforementioned bottleneck and improve state of the art performance, overall and on hitherto worst performing action categories.

Cited by
More filters
Proceedings ArticleDOI
01 Jun 2022
TL;DR: PoseConv3D as mentioned in this paper uses a 3D heatmap volume instead of a graph sequence as the base representation of human skeletons, which is more effective in learning spatio-temporal features, more robust against pose estimation noises, and generalizes better in cross-dataset settings.
Abstract: Human skeleton, as a compact representation of human action, has received increasing attention in recent years. Many skeleton-based action recognition methods adopt GCNs to extract features on top of human skeletons. Despite the positive results shown in these attempts, GCN-based methods are subject to limitations in robustness, interoperability, and scalability. In this work, we propose PoseConv3D, a new approach to skeleton-based action recognition. PoseConv3D relies on a 3D heatmap volume instead of a graph sequence as the base representation of human skeletons. Compared to GCN-based methods, PoseConv3D is more effective in learning spatiotemporal features, more robust against pose estimation noises, and generalizes better in cross-dataset settings. Also, PoseConv3D can handle multiple-person scenarios without additional computation costs. The hierarchical features can be easily integrated with other modalities at early fusion stages, providing a great design space to boost the performance. PoseConv3D achieves the state-of-the-art on five of six standard skeleton-based action recognition benchmarks. Once fused with other modalities, it achieves the state-of-the-art on all eight multi-modality action recognition benchmarks. Code has been made available at: https://github.com/kennymckormick/pyskl.

44 citations

Journal ArticleDOI
18 Dec 2020-Sensors
TL;DR: The AMIRO social robotics framework as discussed by the authors is designed in a modular and robust way for assistive care scenarios, including robotic services for navigation, person detection and recognition, multi-lingual natural language interaction and dialogue management, as well as activity recognition and general behavior composition.
Abstract: Recent studies in social robotics show that it can provide economic efficiency and growth in domains such as retail, entertainment, and active and assisted living (AAL). Recent work also highlights that users have the expectation of affordable social robotics platforms, providing focused and specific assistance in a robust manner. In this paper, we present the AMIRO social robotics framework, designed in a modular and robust way for assistive care scenarios. The framework includes robotic services for navigation, person detection and recognition, multi-lingual natural language interaction and dialogue management, as well as activity recognition and general behavior composition. We present AMIRO platform independent implementation based on a Robot Operating System (ROS). We focus on quantitative evaluations of each functionality module, providing discussions on their performance in different settings and the possible improvements. We showcase the deployment of the AMIRO framework on a popular social robotics platform-the Pepper robot-and present the experience of developing a complex user interaction scenario, employing all available functionality modules within AMIRO.

9 citations

Journal ArticleDOI
01 Dec 2022
TL;DR: ZoomTransformer as discussed by the authors exploits both the low-level single-person motion information and the high-level multi-person interaction information in a uniform model structure with carefully designed relation-aware maps.
Abstract: Skeleton-based human action recognition has attracted increasing attention and many methods have been proposed to boost the performance. However, these methods still confront three main limitations: 1) Focusing on single-person action recognition while neglecting the group activity of multiple people (more than 5 people). In practice, multi-person group activity recognition via skeleton data is also a meaningful problem. 2) Unable to mine high-level semantic information from the skeleton data, such as interactions among multiple people and their positional relationships. 3) Existing datasets used for multi-person group activity recognition are all RGB videos involved, which cannot be directly applied to skeleton-based group activity analysis. To address these issues, we propose a novel Zoom Transformer to exploit both the low-level single-person motion information and the high-level multi-person interaction information in a uniform model structure with carefully designed Relation-aware Maps. Besides, we estimate the multi-person skeletons from the existing real-world video datasets i.e. Kinetics and Volleyball-Activity, and release two new benchmarks to verify the effectiveness of our Zoom Transfromer. Extensive experiments demonstrate that our model can effectively cope with the skeleton-based multi-person group activity. Additionally, experiments on the large-scale NTU-RGB+D dataset validate that our model also achieves remarkable performance for single-person action recognition. The code and the skeleton data are publicly available at https://github.com/Kebii/Zoom-Transformer

5 citations

Journal ArticleDOI
TL;DR: Zoom Transformer as mentioned in this paper exploits both low-level single-person motion information and the high-level multi-person interaction information in a uniform model structure with carefully designed Relation-aware Maps.
Abstract: Skeleton-based human action recognition has attracted increasing attention and many methods have been proposed to boost the performance. However, these methods still confront three main limitations: 1) Focusing on single-person action recognition while neglecting the group activity of multiple people (more than 5 people). In practice, multi-person group activity recognition via skeleton data is also a meaningful problem. 2) Unable to mine high-level semantic information from the skeleton data, such as interactions among multiple people and their positional relationships. 3) Existing datasets used for multi-person group activity recognition are all RGB videos involved, which cannot be directly applied to skeleton-based group activity analysis. To address these issues, we propose a novel Zoom Transformer to exploit both the low-level single-person motion information and the high-level multi-person interaction information in a uniform model structure with carefully designed Relation-aware Maps. Besides, we estimate the multi-person skeletons from the existing real-world video datasets i.e. Kinetics and Volleyball-Activity, and release two new benchmarks to verify the effectiveness of our Zoom Transfromer. Extensive experiments demonstrate that our model can effectively cope with the skeleton-based multi-person group activity. Additionally, experiments on the large-scale NTU-RGB+D dataset validate that our model also achieves remarkable performance for single-person action recognition. The code and the skeleton data are publicly available at https://github.com/Kebii/Zoom-Transformer

5 citations

Journal ArticleDOI
01 Dec 2022
TL;DR: Li et al. as mentioned in this paper proposed a Motion Guided Attention Learning (MG-AL) framework, which formulates the action representation learning as a self-supervised motion attention prediction problem.
Abstract: 3D human action recognition has received increasing attention due to its potential application in video surveillance equipment. To guarantee satisfactory performance, previous studies are mainly based on supervised methods, which have to add a large amount of manual annotation costs. In addition, general deep networks for video sequences suffer from heavy computational costs, thus cannot satisfy the basic requirement of embedded systems. In this paper, a novel Motion Guided Attention Learning (MG-AL) framework is proposed, which formulates the action representation learning as a self-supervised motion attention prediction problem. Specifically, MG-AL is a lightweight network. A set of simple motion priors (e.g., intra-joint variance, inter-frame deviation, intra-joint variance, and cross-joint covariance), which minimizes additional parameters and computational overhead, is regarded as a supervisory signal to guide the attention generation. The encoder is trained via predicting multiple self-attention tasks to capture action-specific feature representations. Extensive evaluations are performed on three challenging benchmark datasets (NTU-RGB+D 60, NTU-RGB+D 120 and NW-UCLA). The proposed method achieves superior performance compared to state-of-the-art methods, while having a very low computational cost.

4 citations