Recognizing Human Action at a Distance in Video by Key Poses
05 Apr 2011-IEEE Transactions on Circuits and Systems for Video Technology (IEEE)-Vol. 21, Iss: 9, pp 1228-1241
TL;DR: A graph theoretic technique for recognizing human actions at a distance in a video by modeling the visual senses associated with poses and introduces a “meaningful” threshold on centrality measure that selects key poses for each action type.
Abstract: In this paper, we propose a graph theoretic technique for recognizing human actions at a distance in a video by modeling the visual senses associated with poses. The proposed methodology follows a bag-of-word approach that starts with a large vocabulary of poses (visual words) and derives a refined and compact codebook of key poses using centrality measure of graph connectivity. We introduce a “meaningful” threshold on centrality measure that selects key poses for each action type. Our contribution includes a novel pose descriptor based on histogram of oriented optical flow evaluated in a hierarchical fashion on a video frame. This pose descriptor combines both pose information and motion pattern of the human performer into a multidimensional feature vector. We evaluate our methodology on four standard activity-recognition datasets demonstrating the superiority of our method over the state-of-the-art.
Citations
More filters
[...]
TL;DR: The thrust of this survey is on the utilization of depth cameras and inertial sensors as these two types of sensors are cost-effective, commercially available, and more significantly they both provide 3D human action data.
Abstract: A number of review or survey articles have previously appeared on human action recognition where either vision sensors or inertial sensors are used individually. Considering that each sensor modality has its own limitations, in a number of previously published papers, it has been shown that the fusion of vision and inertial sensor data improves the accuracy of recognition. This survey article provides an overview of the recent investigations where both vision and inertial sensors are used together and simultaneously to perform human action recognition more effectively. The thrust of this survey is on the utilization of depth cameras and inertial sensors as these two types of sensors are cost-effective, commercially available, and more significantly they both provide 3D human action data. An overview of the components necessary to achieve fusion of data from depth and inertial sensors is provided. In addition, a review of the publicly available datasets that include depth and inertial data which are simultaneously captured via depth and inertial sensors is presented.
244 citations
Cites methods from "Recognizing Human Action at a Dista..."
[...]
[...]
TL;DR: STIP-based detectors are robust in detecting interest points from video in spatio-temporal domain and related public datasets useful for comparing performances of various techniques are summarized.
Abstract: Over the past two decades, human action recognition from video has been an important area of research in computer vision. Its applications include surveillance systems, human---computer interactions and various real-world applications where one of the actor is a human being. A number of review works have been done by several researchers in the context of human action recognition. However, it is found that there is a gap in literature when it comes to methodologies of STIP-based detector for human action recognition. This paper presents a comprehensive review on STIP-based methods for human action recognition. STIP-based detectors are robust in detecting interest points from video in spatio-temporal domain. This paper also summarizes related public datasets useful for comparing performances of various techniques.
120 citations
Cites methods from "Recognizing Human Action at a Dista..."
[...]
[...]
TL;DR: A new performance metric addressing and unifying the qualitative and quantitative aspects of the performance measures is proposed, which has been tested on several activity recognition algorithms participating in the ICPR 2012 HARL competition.
Abstract: Evaluating the performance of computer vision algorithms is classically done by reporting classification error or accuracy, if the problem at hand is the classification of an object in an image, the recognition of an activity in a video or the categorization and labeling of the image or video. If in addition the detection of an item in an image or a video, and/or its localization are required, frequently used metrics are Recall and Precision, as well as ROC curves. These metrics give quantitative performance values which are easy to understand and to interpret even by non-experts. However, an inherent problem is the dependency of quantitative performance measures on the quality constraints that we need impose on the detection algorithm. In particular, an important quality parameter of these measures is the spatial or spatio-temporal overlap between a ground-truth item and a detected item, and this needs to be taken into account when interpreting the results. We propose a new performance metric addressing and unifying the qualitative and quantitative aspects of the performance measures. The performance of a detection and recognition algorithm is illustrated intuitively by performance graphs which present quantitative performance values, like Recall, Precision and F-Score, depending on quality constraints of the detection. In order to compare the performance of different computer vision algorithms, a representative single performance measure is computed from the graphs, by integrating out all quality parameters. The evaluation method can be applied to different types of activity detection and recognition algorithms. The performance metric has been tested on several activity recognition algorithms participating in the ICPR 2012 HARL competition.
63 citations
Cites methods from "Recognizing Human Action at a Dista..."
[...]
[...]
TL;DR: A Bag of Expression (BoE) framework, based on the bag of words method, for recognizing human action in simple and realistic scenarios, which outperforms existing Bag of Words based approaches, when evaluated using the same performance evaluation methods.
Abstract: The Bag of Words (BoW) approach has been widely used for human action recognition in recent state-of-the-art methods. In this paper, we introduce what we call a Bag of Expression (BoE) framework, based on the bag of words method, for recognizing human action in simple and realistic scenarios. The proposed approach includes space time neighborhood information in addition to visual words. The main focus is to enhance the existing strengths of the BoW approach like view independence, scale invariance and occlusion handling. BOE includes independent pairs of neighbors for building expressions, therefore it is tolerant to occlusion and capable of handling view independence up to some extent in realistic scenarios. Our main contribution includes learning a class specific visual words extraction approach for establishing a relationship between these extracted visual words in both space and time dimension. Finally, we have carried out a set of experiments to optimize different parameters and compare its performance with recent state-of-the-art-methods. Our approach outperforms existing Bag of Words based approaches, when evaluated using the same performance evaluation methods. We tested our approach on four publicly available datasets for human action recognition i.e. UCF-Sports, KTH, UCF11 and UCF50 and achieve significant results i.e. 97.3%, 99.5%, 96.7% and 93.42% respectively in terms of average accuracy.
34 citations
Cites background from "Recognizing Human Action at a Dista..."
[...]
[...]
TL;DR: A sparse representation based dictionary learning technique is used to address dance classification as a new problem in computer vision and to present a new action descriptor to represent a dance video which overcomes the problem of the “Bags-of-Words” model.
Abstract: In this paper, we address an interesting application of computer vision technique, namely classification of Indian Classical Dance (ICD). With the best of our knowledge, the problem has not been addressed so far in computer vision domain. To deal with this problem, we use a sparse representation based dictionary learning technique. First, we represent each frame of a dance video by a pose descriptor based on histogram of oriented optical flow (HOOF), in a hierarchical manner. The pose basis is learned using an on-line dictionary learning technique. Finally each video is represented sparsely as a dance descriptor by pooling pose descriptor of all the frames. In this work, dance videos are classified using support vector machine (SVM) with intersection kernel. Our contribution here are two folds. First, to address dance classification as a new problem in computer vision and second, to present a new action descriptor to represent a dance video which overcomes the problem of the “Bags-of-Words” model. We have tested our algorithm on our own ICD dataset created from the videos collected from YouTube. An accuracy of 86.67% is achieved on this dataset. Since we have proposed a new action descriptor too, we have tested our algorithm on well known KTH dataset. The performance of the system is comparable to the state-of-the-art.
31 citations
Cites background or methods from "Recognizing Human Action at a Dista..."
[...]
[...]
[...]
[...]
References
More filters
[...]
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
28,803 citations
"Recognizing Human Action at a Dista..." refers background or result in this paper
[...]
[...]
[...]
[...]
[...]
Book•
[...]
TL;DR: Probability Distributions, linear models for Regression, Linear Models for Classification, Neural Networks, Graphical Models, Mixture Models and EM, Sampling Methods, Continuous Latent Variables, Sequential Data are studied.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.
22,762 citations
Book•
[...]
TL;DR: This paper presents mathematical representation of social networks in the social and behavioral sciences through the lens of Dyadic and Triadic Interaction Models, which describes the relationships between actor and group measures and the structure of networks.
Abstract: Part I. Introduction: Networks, Relations, and Structure: 1. Relations and networks in the social and behavioral sciences 2. Social network data: collection and application Part II. Mathematical Representations of Social Networks: 3. Notation 4. Graphs and matrixes Part III. Structural and Locational Properties: 5. Centrality, prestige, and related actor and group measures 6. Structural balance, clusterability, and transitivity 7. Cohesive subgroups 8. Affiliations, co-memberships, and overlapping subgroups Part IV. Roles and Positions: 9. Structural equivalence 10. Blockmodels 11. Relational algebras 12. Network positions and roles Part V. Dyadic and Triadic Methods: 13. Dyads 14. Triads Part VI. Statistical Dyadic Interaction Models: 15. Statistical analysis of single relational networks 16. Stochastic blockmodels and goodness-of-fit indices Part VII. Epilogue: 17. Future directions.
17,092 citations
[...]
TL;DR: This book covers a broad range of topics for regular factorial designs and presents all of the material in very mathematical fashion and will surely become an invaluable resource for researchers and graduate students doing research in the design of factorial experiments.
Abstract: (2007). Pattern Recognition and Machine Learning. Technometrics: Vol. 49, No. 3, pp. 366-366.
16,640 citations
"Recognizing Human Action at a Dista..." refers methods in this paper
[...]
Related Papers (5)
[...]
[...]
[...]
[...]