scispace - formally typeset
Search or ask a question
Proceedings Article•DOI•

The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

01 Dec 2013-pp 2752-2759
TL;DR: A fast, simple, yet powerful non-parametric Moving Pose (MP) framework that enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences and is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks.
Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP) framework for low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article•DOI•
07 Jun 2015
TL;DR: This paper proposes an end-to-end hierarchical RNN for skeleton based action recognition, and demonstrates that the model achieves the state-of-the-art performance with high computational efficiency.
Abstract: Human actions can be represented by the trajectories of skeleton joints. Traditional methods generally model the spatial structure and temporal dynamics of human skeleton with hand-crafted features and recognize human actions by well-designed classifiers. In this paper, considering that recurrent neural network (RNN) can model the long-term contextual information of temporal sequences well, we propose an end-to-end hierarchical RNN for skeleton based action recognition. Instead of taking the whole skeleton as the input, we divide the human skeleton into five parts according to human physical structure, and then separately feed them to five subnets. As the number of layers increases, the representations extracted by the subnets are hierarchically fused to be the inputs of higher layers. The final representations of the skeleton sequences are fed into a single-layer perceptron, and the temporally accumulated output of the perceptron is the final decision. We compare with five other deep RNN architectures derived from our model to verify the effectiveness of the proposed network, and also compare with several other methods on three publicly available datasets. Experimental results demonstrate that our model achieves the state-of-the-art performance with high computational efficiency.

1,642 citations


Cites background from "The Moving Pose: An Efficient 3D Ki..."

  • ...[38] propose a moving pose descriptor for capturing postures and skeleton joints....

    [...]

Proceedings Article•DOI•
21 Jul 2017
TL;DR: This work proposes a new class of LSTM network, Global Context-Aware Attention L STM (GCA-LSTM), for 3D action recognition, which is able to selectively focus on the informative joints in the action sequence with the assistance of global contextual information.
Abstract: Long Short-Term Memory (LSTM) networks have shown superior performance in 3D human action recognition due to their power in modeling the dynamics and dependencies in sequential data. Since not all joints are informative for action analysis and the irrelevant joints often bring a lot of noise, we need to pay more attention to the informative ones. However, original LSTM does not have strong attention capability. Hence we propose a new class of LSTM network, Global Context-Aware Attention LSTM (GCA-LSTM), for 3D action recognition, which is able to selectively focus on the informative joints in the action sequence with the assistance of global contextual information. In order to achieve a reliable attention representation for the action sequence, we further propose a recurrent attention mechanism for our GCA-LSTM network, in which the attention performance is improved iteratively. Experiments show that our end-to-end network can reliably focus on the most informative joints in each frame of the skeleton sequence. Moreover, our network yields state-of-the-art performance on three challenging datasets for 3D action recognition.

573 citations


Cites methods from "The Moving Pose: An Efficient 3D Ki..."

  • ...[71] proposed a Moving Pose framework in conjunction with a modified kNN classifier for low-latency activity recognition....

    [...]

Journal Article•DOI•
TL;DR: Wang et al. as discussed by the authors proposed a global context-aware attention LSTM for skeleton-based action recognition, which is capable of selectively focusing on the informative joints in each frame by using global context memory cell.
Abstract: Human action recognition in 3D skeleton sequences has attracted a lot of research attention. Recently, long short-term memory (LSTM) networks have shown promising performance in this task due to their strengths in modeling the dependencies and dynamics in sequential data. As not all skeletal joints are informative for action recognition, and the irrelevant joints often bring noise which can degrade the performance, we need to pay more attention to the informative ones. However, the original LSTM network does not have explicit attention ability. In this paper, we propose a new class of LSTM network, global context-aware attention LSTM, for skeleton-based action recognition, which is capable of selectively focusing on the informative joints in each frame by using a global context memory cell. To further improve the attention capability, we also introduce a recurrent attention mechanism, with which the attention performance of our network can be enhanced progressively. Besides, a two-stream framework, which leverages coarse-grained attention and fine-grained attention, is also introduced. The proposed method achieves state-of-the-art performance on five challenging datasets for skeleton-based action recognition.

419 citations

Proceedings Article•DOI•
01 Jun 2018
TL;DR: This work collects RGB-D video sequences comprised of more than 100K frames of 45 daily hand action categories, involving 26 different objects in several hand configurations, and sees clear benefits of using hand pose as a cue for action recognition compared to other data modalities.
Abstract: In this work we study the use of 3D hand poses to recognize first-person dynamic hand actions interacting with 3D objects. Towards this goal, we collected RGB-D video sequences comprised of more than 100K frames of 45 daily hand action categories, involving 26 different objects in several hand configurations. To obtain hand pose annotations, we used our own mo-cap system that automatically infers the 3D location of each of the 21 joints of a hand model via 6 magnetic sensors and inverse kinematics. Additionally, we recorded the 6D object poses and provide 3D object models for a subset of hand-object interaction sequences. To the best of our knowledge, this is the first benchmark that enables the study of first-person hand actions with the use of 3D hand poses. We present an extensive experimental evaluation of RGB-D and pose-based action recognition by 18 baselines/state-of-the-art approaches. The impact of using appearance features, poses, and their combinations are measured, and the different training/testing protocols are evaluated. Finally, we assess how ready the 3D hand pose estimation field is when hands are severely occluded by objects in egocentric views and its influence on action recognition. From the results, we see clear benefits of using hand pose as a cue for action recognition compared to other data modalities. Our dataset and experiments can be of interest to communities of 3D hand pose estimation, 6D object pose, and robotics as well as action recognition.

391 citations


Cites background or methods from "The Moving Pose: An Efficient 3D Ki..."

  • ...Popular approaches include the use of temporal state-space models [17, 70, 74, 75, 86], key-poses [66, 85], hand-crafted pose features [64, 65], and temporal recurrent models [12, 63, 87]....

    [...]

  • ...We start with descriptor-based methods such as Moving Pose [71] that encodes atomic motion information and [55] who represents poses as points on a Lie group....

    [...]

  • ...Following common practice in full body-pose action recognition [64, 85], we compensate for anthropomorphic and viewpoint differences by normalizing poses to have the same distance between pairs of joints and defining the wrist as the center of coordinates....

    [...]

  • ...We start with descriptor-based methods such as Moving Pose [85] that encodes atomic motion information and [64], which represents poses as points on a Lie group....

    [...]

Journal Article•DOI•
07 Jun 2015
TL;DR: This paper finds that features from different channels could share some similar hidden structures, and proposes a joint learning model to simultaneously explore the shared and feature-specific components as an instance of heterogeneous multi-task learning for RGB-D activity recognition.
Abstract: In this paper, we focus on heterogeneous features learning for RGB-D activity recognition. We find that features from different channels (RGB, depth) could share some similar hidden structures, and then propose a joint learning model to simultaneously explore the shared and feature-specific components as an instance of heterogeneous multi-task learning. The proposed model formed in a unified framework is capable of: 1) jointly mining a set of subspaces with the same dimensionality to exploit latent shared features across different feature channels, 2) meanwhile, quantifying the shared and feature-specific components of features in the subspaces, and 3) transferring feature-specific intermediate transforms (i-transforms) for learning fusion of heterogeneous features across datasets. To efficiently train the joint model, a three-step iterative optimization algorithm is proposed, followed by a simple inference model. Extensive experimental results on four activity datasets have demonstrated the efficacy of the proposed method. A new RGB-D activity dataset focusing on human-object interaction is further contributed, which presents more challenges for RGB-D activity benchmarking.

387 citations


Cites background from "The Moving Pose: An Efficient 3D Ki..."

  • ...human skeleton (3D posture) tracker from single depth image [31], human motions can be effectively captured by the positional dynamics of each individual skeletal joint [7], [13], [23], [44] or the relationship of joint pairs [22], [27], [47] or even their combination [19], [52], [58]....

    [...]

References
More filters
Journal Article•DOI•
TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Abstract: The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

15,935 citations


"The Moving Pose: An Efficient 3D Ki..." refers methods in this paper

  • ...We performed action detection, following precision-recall experimental protocols widely used in object detection from images, such as Pascal VOC Challenge [8] (our overlapping threshold is 0.2 as in [6])....

    [...]

  • ...We performed action detection, following precision-recall experimental protocols widely used in object detection from images, such as Pascal VOC Challenge [8] (our overlapping threshold is 0....

    [...]

Proceedings Article•DOI•
20 Jun 2011
TL;DR: This work takes an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem, and generates confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.
Abstract: We propose a new method to quickly and accurately predict 3D positions of body joints from a single depth image, using no temporal information. We take an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem. Our large and highly varied training dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc. Finally we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The system runs at 200 frames per second on consumer hardware. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state of the art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

3,579 citations


"The Moving Pose: An Efficient 3D Ki..." refers methods in this paper

  • ...The 3D skeleton, represented as a set of 3D body joint positions, is available for each frame, being tracked with the method of [21]....

    [...]

  • ...The accurate real-time tracking of 3D skeletons [21], made possible with the introduction of low-cost RGB-D cameras, led to the development of efficient methods for classification of dance moves [17] and other arbitrary actions [7, 24]....

    [...]

Proceedings Article•DOI•
16 Jun 2012
TL;DR: An actionlet ensemble model is learnt to represent each action and to capture the intra-class variance, and novel features that are suitable for depth data are proposed.
Abstract: Human action recognition is an important yet challenging task. The recently developed commodity depth sensors open up new possibilities of dealing with this problem but also present some unique challenges. The depth maps captured by the depth cameras are very noisy and the 3D positions of the tracked joints may be completely wrong if serious occlusions occur, which increases the intra-class variations in the actions. In this paper, an actionlet ensemble model is learnt to represent each action and to capture the intra-class variance. In addition, novel features that are suitable for depth data are proposed. They are robust to noise, invariant to translational and temporal misalignments, and capable of characterizing both the human motion and the human-object interactions. The proposed approach is evaluated on two challenging action recognition datasets captured by commodity depth cameras, and another dataset captured by a MoCap system. The experimental evaluations show that the proposed approach achieves superior performance to the state of the art algorithms.

1,578 citations


"The Moving Pose: An Efficient 3D Ki..." refers background or methods in this paper

  • ...[24] uses a Multiclass-MKL with actionlet mining....

    [...]

  • ...The only classes causing confusion are, in our case, the ones involving interactions with other objects (pick up and throw, hand-catch and hammer), which (as also observed by [24]) may not be easily captured by the skeleton information alone....

    [...]

  • ...The accurate real-time tracking of 3D skeletons [21], made possible with the introduction of low-cost RGB-D cameras, led to the development of efficient methods for classification of dance moves [17] and other arbitrary actions [7, 24]....

    [...]

  • ...About 10 skeleton sequences were not used in [24] because of missing data or highly erroneous joint positions....

    [...]

  • ...The subjects used for training are (1, 3, 5, 7, 9), whereas the rest are used for testing, as in [24]....

    [...]

Proceedings Article•DOI•
13 Jun 2010
TL;DR: An action graph is employed to model explicitly the dynamics of the actions and a bag of 3D points to characterize a set of salient postures that correspond to the nodes in the action graph to recognize human actions from sequences of depth maps.
Abstract: This paper presents a method to recognize human actions from sequences of depth maps. Specifically, we employ an action graph to model explicitly the dynamics of the actions and a bag of 3D points to characterize a set of salient postures that correspond to the nodes in the action graph. In addition, we propose a simple, but effective projection based sampling scheme to sample the bag of 3D points from the depth maps. Experimental results have shown that over 90% recognition accuracy were achieved by sampling only about 1% 3D points from the depth maps. Compared to the 2D silhouette based recognition, the recognition errors were halved. In addition, we demonstrate the potential of the bag of points posture model to deal with occlusions through simulation.

1,437 citations


"The Moving Pose: An Efficient 3D Ki..." refers background or methods in this paper

  • ...7 Action Graph on Bag of 3D Points [10] 74....

    [...]

  • ...We use the cross-subject test setting as in [10], where the sequences for half of the subjects are used for training (i....

    [...]

  • ...Related Work: Traditional research on general action recognition focuses mainly on recognition accuracy using hidden Markov models, and more recently conditional random fields [14, 22], and less on reducing observational latency [1, 10, 20]....

    [...]

  • ...Neural networks [13], motion templates [15], HMMs [12] and action graphs [10] need sufficient data to learn models....

    [...]

  • ...Recurrent Neural Network [13] 42.5 Dynamic Temporal Warping [15] 54 Hidden Markov Model [12] 63 Latent-Dynamic CRF [14] 64.8 Canonical Poses [7] 65.7 Action Graph on Bag of 3D Points [10] 74.7 Latent-Dynamic CRF [14] + MP 74.9 EigenJoints [25] 81.4 Actionlet Ensemble [24] 88.2 MP (Ours) 91.7 system (see Table 1) improves over the current state-of-theart by 3.5%....

    [...]

Proceedings Article•
28 Jun 2011
TL;DR: This work solves the long-outstanding problem of how to effectively train recurrent neural networks on complex and difficult sequence modeling problems which may contain long-term data dependencies and offers a new interpretation of the generalized Gauss-Newton matrix of Schraudolph which is used within the HF approach of Martens.
Abstract: In this work we resolve the long-outstanding problem of how to effectively train recurrent neural networks (RNNs) on complex and difficult sequence modeling problems which may contain long-term data dependencies Utilizing recent advances in the Hessian-free optimization approach (Martens, 2010), together with a novel damping scheme, we successfully train RNNs on two sets of challenging problems First, a collection of pathological synthetic datasets which are known to be impossible for standard optimization approaches (due to their extremely long-term dependencies), and second, on three natural and highly complex real-world sequence datasets where we find that our method significantly outperforms the previous state-of-the-art method for training neural sequence models: the Long Short-term Memory approach of Hochreiter and Schmidhuber (1997) Additionally, we offer a new interpretation of the generalized Gauss-Newton matrix of Schraudolph (2002) which is used within the HF approach of Martens

635 citations


"The Moving Pose: An Efficient 3D Ki..." refers background in this paper

  • ...Neural networks [13], motion templates [15], HMMs [12] and action graphs [10] need sufficient data to learn models....

    [...]