scispace - formally typeset
Search or ask a question
Author

Aditya Aggarwal

Bio: Aditya Aggarwal is an academic researcher from International Institute of Information Technology, Hyderabad. The author has contributed to research in topics: 3D reconstruction & Rendering (computer graphics). The author has an hindex of 2, co-authored 5 publications receiving 6 citations.

Papers
More filters
Posted Content
TL;DR: The results from benchmarking the top performers of NTU-120 on Skeletics-152 reveal the challenges and domain gap induced by actions 'in the wild', and proposes new frontiers for human action recognition.
Abstract: In this paper, we study current and upcoming frontiers across the landscape of skeleton-based human action recognition. To begin with, we benchmark state-of-the-art models on the NTU-120 dataset and provide multi-layered assessment of the results. To examine skeleton action recognition 'in the wild', we introduce Skeletics-152, a curated and 3-D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset. The results from benchmarking the top performers of NTU-120 on Skeletics-152 reveal the challenges and domain gap induced by actions 'in the wild'. We extend our study to include out-of-context actions by introducing Skeleton-Mimetics, a dataset derived from the recently introduced Mimetics dataset. Finally, as a new frontier for action recognition, we introduce Metaphorics, a dataset with caption-style annotated YouTube videos of the popular social game Dumb Charades and interpretative dance performances. Overall, our work characterizes the strengths and limitations of existing approaches and datasets. It also provides an assessment of top-performing approaches across a spectrum of activity settings and via the introduced datasets, proposes new frontiers for human action recognition.

24 citations

Journal ArticleDOI
TL;DR: Skeleton-Mimetics-152 as discussed by the authors is a 3D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset, and Metaphorics, a dataset with caption style annotated YouTube videos of the popular social game Dumb Charades and interpretative dance performances.
Abstract: In this paper, we study current and upcoming frontiers across the landscape of skeleton-based human action recognition. To study skeleton-action recognition in the wild, we introduce Skeletics-152, a curated and 3-D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset. We extend our study to include out-of-context actions by introducing Skeleton-Mimetics, a dataset derived from the recently introduced Mimetics dataset. We also introduce Metaphorics, a dataset with caption-style annotated YouTube videos of the popular social game Dumb Charades and interpretative dance performances. We benchmark state-of-the-art models on the NTU-120 dataset and provide multi-layered assessment of the results. The results from benchmarking the top performers of NTU-120 on the newly introduced datasets reveal the challenges and domain gap induced by actions in the wild. Overall, our work characterizes the strengths and limitations of existing approaches and datasets. Via the introduced datasets, our work enables new frontiers for human action recognition.

16 citations

Posted Content
TL;DR: This paper presents a new system to obtain dense object reconstructions along with 6-DoF poses from a single image, and demonstrates that the approach—dubbed reconstruct, rasterize and backprop (RRB)—achieves significantly lower pose estimation errors compared to prior art, and is able to recover dense object shapes and poses from imagery.
Abstract: This paper presents a new system to obtain dense object reconstructions along with 6-DoF poses from a single image. Geared towards high fidelity reconstruction, several recent approaches leverage implicit surface representations and deep neural networks to estimate a 3D mesh of an object, given a single image. However, all such approaches recover only the shape of an object; the reconstruction is often in a canonical frame, unsuitable for downstream robotics tasks. To this end, we leverage recent advances in differentiable rendering (in particular, rasterization) to close the loop with 3D reconstruction in camera frame. We demonstrate that our approach---dubbed reconstruct, rasterize and backprop (RRB) achieves significantly lower pose estimation errors compared to prior art, and is able to recover dense object shapes and poses from imagery. We further extend our results to an (offline) setup, where we demonstrate a dense monocular object-centric egomotion estimation system.

1 citations

Proceedings ArticleDOI
02 Jul 2019
TL;DR: This work presents a new SLAM framework in which they use monocular edge based SLAM, along with category level models, to localize objects in the scene as well as improve the camera trajectory, object poses along with its shape and 3D structure.
Abstract: Monocular SLAM is a well-studied problem and has shown significant progress in recent years, but still, challenges remain in creating a rich semantic description of the scene. Feature-based visual SLAMs are vulnerable to erroneous pose estimates due to insufficient tracking of mapped points or motion induced errors such as in large or in-place rotations. We present a new SLAM framework in which we use monocular edge based SLAM [1], along with category level models, to localize objects in the scene as well as improve the camera trajectory. In monocular SLAM systems, the camera track tends to break in conditions with abrupt motion which leads to reduction in the number of 2D point correspondences. In order to tackle this problem, we propose the first most principled formulation of its kind which integrates object category models in the core SLAM back-end to jointly optimize for the camera trajectory, object poses along with its shape and 3D structure. We show that our joint optimization is able to recover a better camera trajectory in such cases, as compared to Edge SLAM. Moreover, this method gives a better visualization incorporating object representations in the scene along with the 3D structure of the base SLAM system, which makes it useful for augmented reality (AR) applications.

1 citations

Proceedings ArticleDOI
25 Apr 2020
TL;DR: In this article, Li et al. presented a new system to obtain dense object reconstructions along with 6-DoF poses from a single image by leveraging recent advances in differentiable rendering (in particular, rasterization) to close the loop with 3D reconstruction in camera frame.
Abstract: This paper presents a new system to obtain dense object reconstructions along with 6-DoF poses from a single image. Geared towards high fidelity reconstruction, several recent approaches leverage implicit surface representations and deep neural networks to estimate a 3D mesh of an object, given a single image. However, all such approaches recover only the shape of an object; the reconstruction is often in a canonical frame, unsuitable for downstream robotics tasks. To this end, we leverage recent advances in differentiable rendering (in particular, rasterization) to close the loop with 3D reconstruction in camera frame. We demonstrate that our approach-dubbed reconstruct, rasterize and backprop (RRB)-achieves significantly lower pose estimation errors compared to prior art, and is able to recover dense object shapes and poses from imagery. We further extend our results to an (offline) setup, where we demonstrate a dense monocular object-centric egomotion estimation system.

Cited by
More filters
Proceedings ArticleDOI
01 Jun 2022
TL;DR: PoseConv3D as mentioned in this paper uses a 3D heatmap volume instead of a graph sequence as the base representation of human skeletons, which is more effective in learning spatio-temporal features, more robust against pose estimation noises, and generalizes better in cross-dataset settings.
Abstract: Human skeleton, as a compact representation of human action, has received increasing attention in recent years. Many skeleton-based action recognition methods adopt GCNs to extract features on top of human skeletons. Despite the positive results shown in these attempts, GCN-based methods are subject to limitations in robustness, interoperability, and scalability. In this work, we propose PoseConv3D, a new approach to skeleton-based action recognition. PoseConv3D relies on a 3D heatmap volume instead of a graph sequence as the base representation of human skeletons. Compared to GCN-based methods, PoseConv3D is more effective in learning spatiotemporal features, more robust against pose estimation noises, and generalizes better in cross-dataset settings. Also, PoseConv3D can handle multiple-person scenarios without additional computation costs. The hierarchical features can be easily integrated with other modalities at early fusion stages, providing a great design space to boost the performance. PoseConv3D achieves the state-of-the-art on five of six standard skeleton-based action recognition benchmarks. Once fused with other modalities, it achieves the state-of-the-art on all eight multi-modality action recognition benchmarks. Code has been made available at: https://github.com/kennymckormick/pyskl.

44 citations

Journal ArticleDOI
18 Dec 2020-Sensors
TL;DR: The AMIRO social robotics framework as discussed by the authors is designed in a modular and robust way for assistive care scenarios, including robotic services for navigation, person detection and recognition, multi-lingual natural language interaction and dialogue management, as well as activity recognition and general behavior composition.
Abstract: Recent studies in social robotics show that it can provide economic efficiency and growth in domains such as retail, entertainment, and active and assisted living (AAL). Recent work also highlights that users have the expectation of affordable social robotics platforms, providing focused and specific assistance in a robust manner. In this paper, we present the AMIRO social robotics framework, designed in a modular and robust way for assistive care scenarios. The framework includes robotic services for navigation, person detection and recognition, multi-lingual natural language interaction and dialogue management, as well as activity recognition and general behavior composition. We present AMIRO platform independent implementation based on a Robot Operating System (ROS). We focus on quantitative evaluations of each functionality module, providing discussions on their performance in different settings and the possible improvements. We showcase the deployment of the AMIRO framework on a popular social robotics platform-the Pepper robot-and present the experience of developing a complex user interaction scenario, employing all available functionality modules within AMIRO.

9 citations

Journal ArticleDOI
01 Dec 2022
TL;DR: ZoomTransformer as discussed by the authors exploits both the low-level single-person motion information and the high-level multi-person interaction information in a uniform model structure with carefully designed relation-aware maps.
Abstract: Skeleton-based human action recognition has attracted increasing attention and many methods have been proposed to boost the performance. However, these methods still confront three main limitations: 1) Focusing on single-person action recognition while neglecting the group activity of multiple people (more than 5 people). In practice, multi-person group activity recognition via skeleton data is also a meaningful problem. 2) Unable to mine high-level semantic information from the skeleton data, such as interactions among multiple people and their positional relationships. 3) Existing datasets used for multi-person group activity recognition are all RGB videos involved, which cannot be directly applied to skeleton-based group activity analysis. To address these issues, we propose a novel Zoom Transformer to exploit both the low-level single-person motion information and the high-level multi-person interaction information in a uniform model structure with carefully designed Relation-aware Maps. Besides, we estimate the multi-person skeletons from the existing real-world video datasets i.e. Kinetics and Volleyball-Activity, and release two new benchmarks to verify the effectiveness of our Zoom Transfromer. Extensive experiments demonstrate that our model can effectively cope with the skeleton-based multi-person group activity. Additionally, experiments on the large-scale NTU-RGB+D dataset validate that our model also achieves remarkable performance for single-person action recognition. The code and the skeleton data are publicly available at https://github.com/Kebii/Zoom-Transformer

5 citations

Journal ArticleDOI
TL;DR: Zoom Transformer as mentioned in this paper exploits both low-level single-person motion information and the high-level multi-person interaction information in a uniform model structure with carefully designed Relation-aware Maps.
Abstract: Skeleton-based human action recognition has attracted increasing attention and many methods have been proposed to boost the performance. However, these methods still confront three main limitations: 1) Focusing on single-person action recognition while neglecting the group activity of multiple people (more than 5 people). In practice, multi-person group activity recognition via skeleton data is also a meaningful problem. 2) Unable to mine high-level semantic information from the skeleton data, such as interactions among multiple people and their positional relationships. 3) Existing datasets used for multi-person group activity recognition are all RGB videos involved, which cannot be directly applied to skeleton-based group activity analysis. To address these issues, we propose a novel Zoom Transformer to exploit both the low-level single-person motion information and the high-level multi-person interaction information in a uniform model structure with carefully designed Relation-aware Maps. Besides, we estimate the multi-person skeletons from the existing real-world video datasets i.e. Kinetics and Volleyball-Activity, and release two new benchmarks to verify the effectiveness of our Zoom Transfromer. Extensive experiments demonstrate that our model can effectively cope with the skeleton-based multi-person group activity. Additionally, experiments on the large-scale NTU-RGB+D dataset validate that our model also achieves remarkable performance for single-person action recognition. The code and the skeleton data are publicly available at https://github.com/Kebii/Zoom-Transformer

5 citations

Journal ArticleDOI
01 Dec 2022
TL;DR: Li et al. as mentioned in this paper proposed a Motion Guided Attention Learning (MG-AL) framework, which formulates the action representation learning as a self-supervised motion attention prediction problem.
Abstract: 3D human action recognition has received increasing attention due to its potential application in video surveillance equipment. To guarantee satisfactory performance, previous studies are mainly based on supervised methods, which have to add a large amount of manual annotation costs. In addition, general deep networks for video sequences suffer from heavy computational costs, thus cannot satisfy the basic requirement of embedded systems. In this paper, a novel Motion Guided Attention Learning (MG-AL) framework is proposed, which formulates the action representation learning as a self-supervised motion attention prediction problem. Specifically, MG-AL is a lightweight network. A set of simple motion priors (e.g., intra-joint variance, inter-frame deviation, intra-joint variance, and cross-joint covariance), which minimizes additional parameters and computational overhead, is regarded as a supervisory signal to guide the attention generation. The encoder is trained via predicting multiple self-attention tasks to capture action-specific feature representations. Extensive evaluations are performed on three challenging benchmark datasets (NTU-RGB+D 60, NTU-RGB+D 120 and NW-UCLA). The proposed method achieves superior performance compared to state-of-the-art methods, while having a very low computational cost.

4 citations