scispace - formally typeset
Search or ask a question
Book ChapterDOI

Human Action Recognition Using Dominant Pose Duplet

06 Jul 2015-pp 488-497
TL;DR: A Bag-of-Words BoW based technique for human action recognition in videos containing challenges like illumination changes, background changes and camera shaking is proposed, based on the gradient-weighted optical flow GWOF measure.
Abstract: We propose a Bag-of-Words BoW based technique for human action recognition in videos containing challenges like illumination changes, background changes and camera shaking. We build the pose descriptors corresponding to the actions, based on the gradient-weighted optical flow GWOF measure, to minimize the noise related to camera shaking. The pose descriptors are clustered and stored in a dictionary of poses. We further generate a reduced dictionary, where words are termed as pose duplet. The pose duplets are constructed by a graphical approach, considering the probability of occurrence of two poses sequentially, during an action. Here, poses of the initial dictionary, are considered as the nodes of a weighted directed graph called the duplet graph. Weight of each edge of the duplet graph is calculated based on the probability of the destination node of the edge to appear after the source node of the edge. The concatenation of the source and destination pose vectors is called pose duplet. We rank the pose duplets according to the weight of the edge between them. We form the reduced dictionary with the pose duplets with high edge weights called dominant pose duplet. We construct the action descriptors for each actions, using the dominant pose duplets and recognize the actions. The efficacy of the proposed approach is tested on standard datasets.
Citations
More filters
Proceedings ArticleDOI
18 Dec 2016
TL;DR: It is shown that the dense trajectory features based on the proposed GF-STIP descriptors enhance the efficacy of the event recognition system in egocentric videos.
Abstract: This paper proposes an approach for event recognition in Egocentric videos using dense trajectories over Gradient Flow - Space Time Interest Point (GF-STIP) feature. We focus on recognizing events of diverse categories (including indoor and outdoor activities, sports and social activities and adventures) in egocentric videos. We introduce a dataset with diverse egocentric events, as all the existing egocentric activity recognition datasets consist of indoor videos only. The dataset introduced in this paper contains 102 videos with 9 different events (containing indoor and outdoor videos with varying lighting conditions). We extract Space Time Interest Points (STIP) from each frame of the video. The interest points are taken as the lead pixels and Gradient-Weighted Optical Flow (GWOF) features are calculated on the lead pixels by multiplying the optical flow measure and the magnitude of gradient at the pixel, to obtain the GF-STIP feature. We construct pose descriptors with the GF-STIP feature. We use the GF-STIP descriptors for recognizing events in egocentric videos with three different approaches: following a Bag of Words (BoW) model, implementing Fisher Vectors and obtaining dense trajectories for the videos. We show that the dense trajectory features based on the proposed GF-STIP descriptors enhance the efficacy of the event recognition system in egocentric videos.

9 citations


Cites background from "Human Action Recognition Using Domi..."

  • ...The unavailability of pose information and a large camera shake make the event recognition task challenging....

    [...]

Posted Content
TL;DR: A 3-Dimensional deep CNN is proposed to extract the spatio-temporal features and follows Long Short-Term Memory (LSTM) to recognize human actions and is shown to outperform state-of-the-art deep learning based techniques.
Abstract: We propose a novel scheme for human action recognition in videos, using a 3-dimensional Convolutional Neural Network (3D CNN) based classifier. Traditionally in deep learning based human activity recognition approaches, either a few random frames or every $k^{th}$ frame of the video is considered for training the 3D CNN, where $k$ is a small positive integer, like 4, 5, or 6. This kind of sampling reduces the volume of the input data, which speeds-up training of the network and also avoids over-fitting to some extent, thus enhancing the performance of the 3D CNN model. In the proposed video sampling technique, consecutive $k$ frames of a video are aggregated into a single frame by computing a Gaussian-weighted summation of the $k$ frames. The resulting frame (aggregated frame) preserves the information in a better way than the conventional approaches and experimentally shown to perform better. In this paper, a 3D CNN architecture is proposed to extract the spatio-temporal features and follows Long Short-Term Memory (LSTM) to recognize human actions. The proposed 3D CNN architecture is capable of handling the videos where the camera is placed at a distance from the performer. Experiments are performed with KTH and WEIZMANN human actions datasets, whereby it is shown to produce comparable results with the state-of-the-art techniques.

7 citations


Cites background from "Human Action Recognition Using Domi..."

  • ...In [11], the effect of background clutter is reduced by multiplying...

    [...]

  • ...recognition, is optical flow [10], [11], [12], [13]....

    [...]

Journal ArticleDOI
TL;DR: The proposed unified descriptor is a 168-dimensional vector obtained from each video sequence by statistically analyzing the motion patterns of the 3D joint locations of the human body, which has shown its efficacy compared to the state-of-the-art techniques.
Abstract: We propose a unified method for recognizing human action and human related events in a realistic video. We use an efficient pipeline of (a) a 3D representation of the Improved Dense Trajectory Feature (DTF) and (b) Fisher Vector (FV). Further, a novel descriptor is proposed, capable of representing human actions and human related events based on the FV representation of the input video. The proposed unified descriptor is a 168-dimensional vector obtained from each video sequence by statistically analyzing the motion patterns of the 3D joint locations of the human body. The proposed descriptor is trained using binary Support Vector Machine (SVM) for recognizing human actions or human related events. We evaluate the proposed approach on two challenging action recognition datasets: UCF sports and CMU Mocap datasets. In addition to the two action recognition dataset, the proposed approach is tested on the Hollywood2 event recognition dataset. On all the benchmark datasets for both action and event recognition, the proposed approach has shown its efficacy compared to the state-of-the-art techniques.

3 citations


Cites background or methods from "Human Action Recognition Using Domi..."

  • ...Row 5 shows the results of applying the proposed motion features on a different fusion scheme, fusing the pose and appearance features as described in [19]....

    [...]

  • ...In [19], pose and appearance based features are fused to get a hybrid motion feature for action recognition....

    [...]

  • ...The idea of pose graph is extended in [19], to emphasize the sequencial occurrance of the poses during action, rather than relying only on the frequency of occurrance of the poses....

    [...]

Book ChapterDOI
16 Dec 2017
TL;DR: This work proposes a deep learning based technique to classify actions based on Long Short Term Memory networks, and extends the proposed framework with an efficient motion feature, to enable handling significant camera motion.
Abstract: We propose a deep learning based technique to classify actions based on Long Short Term Memory (LSTM) networks. The proposed scheme first learns spatial temporal features from the video, using an extension of the Convolutional Neural Networks (CNN) to 3D. A Recurrent Neural Network (RNN) is then trained to classify each sequence considering the temporal evolution of the learned features for each time step. Experimental results on the CMU MoCap, UCF 101, Hollywood 2 dataset show the efficacy of the proposed approach. We extend the proposed framework with an efficient motion feature, to enable handling significant camera motion. The proposed approach outperforms the existing deep models for each dataset.

3 citations

Proceedings ArticleDOI
05 Mar 2021
TL;DR: In this paper, an integrated feature approach using Histogram of Gradient (HOG) local feature descriptor and Principal Component Analysis (PCA) as a global feature is proposed in order to automatically recognize human activity from a video sequence.
Abstract: Lately, human activity recognition has pulled in an expanding measure of consideration from research and industry networks. Human activity recognition is the consistently sprouting exploration zone as it finds magnificent utilizations in surveillance, healthcare, and many real-life problems. This paper presents a technique to automatically recognize human activity from a video sequence. An integrated features approach using Histogram of Gradient (HOG) local feature descriptor and Principal Component Analysis (PCA) as a global feature is proposed in this paper. Optimized Support Vector Machine(SVM), Artificial Neural Network (ANN) used as a classifier. The proposed model is trained and tested on Benchmark KTH dataset, results obtained are comparable with existing methods. The proposed technique achieved the activity recognition accuracy of 99.21 percent. The experimental results confirms that the embedded feature approach and optimization techniques for classifier improves the performance of human activity recognition

2 citations

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations

Proceedings ArticleDOI
23 Jun 2008
TL;DR: A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.
Abstract: The aim of this paper is to address recognition of natural human actions in diverse and realistic video settings. This challenging but important subject has mostly been ignored in the past due to several problems one of which is the lack of realistic and annotated video datasets. Our first contribution is to address this limitation and to investigate the use of movie scripts for automatic annotation of human actions in videos. We evaluate alternative methods for action retrieval from scripts and show benefits of a text-based classifier. Using the retrieved action samples for visual learning, we next turn to the problem of action classification in video. We present a new method for video classification that builds upon and extends several recent ideas including local space-time features, space-time pyramids and multi-channel non-linear SVMs. The method is shown to improve state-of-the-art results on the standard KTH action dataset by achieving 91.8% accuracy. Given the inherent problem of noisy labels in automatic annotation, we particularly investigate and show high tolerance of our method to annotation errors in the training set. We finally apply the method to learning and classifying challenging action classes in movies and show promising results.

3,833 citations

Proceedings ArticleDOI
01 Dec 2013
TL;DR: Dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets are improved by taking into account camera motion to correct them.
Abstract: Recently dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets. This paper improves their performance by taking into account camera motion to correct them. To estimate camera motion, we match feature points between frames using SURF descriptors and dense optical flow, which are shown to be complementary. These matches are, then, used to robustly estimate a homography with RANSAC. Human motion is in general different from camera motion and generates inconsistent matches. To improve the estimation, a human detector is employed to remove these matches. Given the estimated camera motion, we remove trajectories consistent with it. We also use this estimation to cancel out camera motion from the optical flow. This significantly improves motion-based descriptors, such as HOF and MBH. Experimental results on four challenging action datasets (i.e., Hollywood2, HMDB51, Olympic Sports and UCF50) significantly outperform the current state of the art.

3,487 citations

Book ChapterDOI
07 May 2006
TL;DR: A detector for standing and moving people in videos with possibly moving cameras and backgrounds is developed, testing several different motion coding schemes and showing empirically that orientated histograms of differential optical flow give the best overall performance.
Abstract: Detecting humans in films and videos is a challenging problem owing to the motion of the subjects, the camera and the background and to variations in pose, appearance, clothing, illumination and background clutter. We develop a detector for standing and moving people in videos with possibly moving cameras and backgrounds, testing several different motion coding schemes and showing empirically that orientated histograms of differential optical flow give the best overall performance. These motion-based descriptors are combined with our Histogram of Oriented Gradient appearance descriptors. The resulting detector is tested on several databases including a challenging test set taken from feature films and containing wide ranges of pose, motion and background variations, including moving cameras and backgrounds. We validate our results on two challenging test sets containing more than 4400 human examples. The combined detector reduces the false alarm rate by a factor of 10 relative to the best appearance-based detector, for example giving false alarm rates of 1 per 20,000 windows tested at 8% miss rate on our Test Set 1.

1,812 citations