A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences
Summary (3 min read)
Introduction
- A survey on current deep learning methodologies for action and gesture recognition in image sequences.
- The authors introduce a taxonomy that summarizes important aspects of deep learning for approaching both tasks.
- In 1997, authors’ effort led to the development of the long short-term memory (LSTM) [40] cells for RNNs.
- The amount of research that has been generated in these topic within the last few years is astounding.
- The remainder of this paper is organized as follows.
II. TAXONOMY
- Fig. 1 illustrates a taxonomy of the main works performing action and gesture recognition using deep learning approaches.
- Note that with recognition the authors refer to either classification of pre-segmented video segments or localization of actions in long untrimmed videos.
A. Architectures
- The most crucial challenge in deep-based human action and gesture recognition is how to deal with the temporal dimension.
- Based on that, the authors categorize approaches into different three groups.
- The third group combines a 2D (or 3D) CNN applied at individual (or stacks of) frames with a temporal sequence modeling.
- Recurrent Neural Network (RNN) [26] is one of the most used networks for this task, which can take into account the temporal data using recurrent connections in hidden layers.
B. Fusion strategies
- Information fusion is common in deep learning methods for action and gesture recognition.
- At times, fusion is used to combine the information from parts of a segmented video sequence [51, 115], although, it is more common to fuse information from multiple cues (e.g. RGB and motion, depth, and/or audio) [32], as well combining models trained with different data samples and learning parameters [68].
- There are three main variants for information fusion in deep learning models: early (before the data is feed into the model, or the model fuses information directly from multiple sources), late (outputs of deep learning models are combined) and middle (intermediate layers fuse information) fusions [68, 69].
- An example of the latter is shown in Fig. 2. Modifications and variants of these schemes have been proposed as well, for instance, see the variants introduced in [51] for fusing information in the temporal dimension.
- Moreover, ensembles or stacked networks are also considered as fusion strategies [115, 105, 68].
D. Challenges
- Every year computer vision organizations arrange compe- titions providing useful annotated datasets.
- Table V shows 5 main challenge series in computer vision.
- For each, the authors report the year in which it took place, the name of the dataset along with the task to be faced (either action- or gesture-related), the associated event’s name, the winner participant, and the more recent results on the challenge’s associated dataset.
III. ACTION/ACTIVITY RECOGNITION
- This section reviews deep methods to address action recog- nition divided on how they treat the temporal dimension: 3D convolutions, pre-computed motion features, or temporal modeling.
- The larger number of parameters w.r.t. 2D models, make them harder to train.
- Other authors focused on further improving accuracy of 3D CNNs. [32] performs 3D convolutions over stacks of optical flow maps. [95] uses multiple 3D CNNs in a multi-stage (proposal generation, classification, and fine-grained localization) framework for temporal action localization in long untrimmed videos.
- The authors also find 3D CNN models being combined with sequence modeling methods [7] or hand-crafted feature descriptors (VLAD [30] or iDTs [129]).
B. Motion-based features
- Neural networks and CNNs based on hand and body pose estimation as well as motion features have been widely applied for gesture recognition.
- For gesture style recognition in biometrics, [126] proposes a two-stream (spatio-temporal) CNN.
- The authors use raw depth data as the input of spatial network and optical flow as the input of temporal one.
- For articulated human pose estimation in videos the authors of [43] propose a Convolutional Network architecture for estimating the 2D location of human joints in video, with an RGB image and a set of motion features as the input data of this network.
- The authors of [117] use three representations of dynamic depth image (DDI), dynamic depth normal image (DDNI) and dynamic depth motion normal image for gesture recognition.
D. Deep learning with fusion strategies
- Some methods have used diverse fusion schemes to improve recognition performance of action recognition. [37] proposes a novel Subdivision-Fusion Model (SFM), where features extracted with CNN are clustered and grouped into subcategories. [22] learns an end-to-end hierarchical RNN using skeleton data divided into five parts, each of which is feed into a different network.
- The final decision is taken by single-layer network. [99] faces the problem of first person action recognition using a multi-stream CNN (ego-CNN, temporal, and spatial).
- [119] focuses on the changes that an action brings into the environment and propose a siamese CNN architecture to fuse precondition and effect information from the environment. [20] proposes a CNN which uses mid-level discriminative visual elements.
- The method, called DeepPattern, is able to learn discriminative patches by exploring human body parts as well as scene context. [76] proposes DeepConvLSTM, based on convolutional and LSTM recurrent units, which is suitable for multimodal wearable sensors.
IV. GESTURE RECOGNITION
- Mainly driven by the areas of human computer, machine, and robot interaction.the authors.
- A. 3D Convolutional Neural Networks Several 3D CNNs have been proposed for gesture recognition, most notably [64, 41, 63]. [41] proposes a 3D CNN for sign language recognition.
- The CNN automatically learns a representation from raw video, and processes multimodal information (RGB-D+Skeleton data).
- Similar in spirit, [63] introduces a 3D CNN for driver hand gesture recognition from depth and intensity data. [64] extends a 3D CNN with a recurrent mechanism for detection and classification of dynamic hand gestures.
- It consists of a 3D-CNN for spatiotemporal feature extraction, followed by a recurrent layer for global temporal modeling and a softmax layer for predicting class-conditional gesture probabilities.
D. Deep Learning with fusion strategies
- Multimodality has been widely exploited for gesture recog- nition.
- [124] proposes a semi-supervised hierarchical dynamic framework based on a HMM for simultaneous gesture segmentation and recognition using skeleton joint information, depth and RGB images.
- Separate CNNs are considered for each modality at the beginning of the model structure with increasingly shared layers and a final prediction layer.
- The authors exploited early and middle fusion methods to integrate the models. [54] proposes a CNN that learns to score pairs of input images and human poses .
- The authors then calculate score function by dot-product between the two embeddings; i.e. late fusion.
V. DISCUSSION
- The authors presented a comprehensive overview of deep-based models for action and gesture recognition.
- It has been also shown that using training networks on precomputed motion features is an effective way to save them from implicit learning of motion features.
- Taking into account the full temporal scale, results in a huge amount of weights for learning.
- Yet another trick to improve the result of deep-based models is data fusion.
- One valuable cue is spatial structure of actions/gestures. [112] takes advantage of iDTs to pool relevant CNN features along trajectories in video frames. [12], takes advantage of human body spatial constraints, by aggregating convolutional activations of a 3D CNN into descriptors based on joint positions.
Did you find this useful? Give us your feedback
Citations
270 citations
261 citations
Cites background from "A Survey on Deep Learning Based App..."
...Video Action Analysis Detailed reviews can be found in recent surveys [65,42,2,9,3,31]....
[...]
219 citations
Cites background from "A Survey on Deep Learning Based App..."
...Please refer to recent surveys [1, 3, 22, 19] for a detailed review....
[...]
216 citations
Cites background from "A Survey on Deep Learning Based App..."
...Asadi-Aghbolaghi et al.[23] surveyed deep learning based approaches for action and gesture recognition in image sequences, and discussed deep learning techniques applied to action and gesture recognition....
[...]
...Asadi-Aghbolaghi et al.[23] surveyed deep learning based approaches for action and gesture recognition in image sequences, and discussed deep learning techniques applied to action and gesture...
[...]
184 citations
Cites background from "A Survey on Deep Learning Based App..."
...Reviews on deep-learning based methods of human activities recognition were provided in [16, 132, 210, 225]....
[...]
References
72,897 citations
"A Survey on Deep Learning Based App..." refers background in this paper
...[72] proposes a multi-stream model, called MRNN, which extends RNN capabilities with LSTM cells in order to facilitate the handling of variable-length gestures....
[...]
...Hence, the most crucial advantage of approaches in the third group (i.e. temporal models like RNN, LSTM) is that they are able to cope with longer-range temporal relations....
[...]
...[42] uses an LSTM to model each individual’s actions and a second-level LSTM aggregates the outputs of individual LSTMs....
[...]
...To solve this problem Long Short-Term Memory (LSTM) [33] was proposed, and it is usually used as a hidden layer of RNN, as seen in Fig....
[...]
...We are aware of [67], where the authors propose a multimodal (depth video, skeleton, and speech) human gesture recognition system based on RNN. [25] presents a Convolutional Long Short-Term Memory Recurrent Neural Network (CNNLSTM) able to successfully learn gesture varying in duration and complexity....
[...]
12,531 citations
10,264 citations
"A Survey on Deep Learning Based App..." refers background or methods in this paper
...[130] introduces a fully end-to-end approach based on a RNN agent....
[...]
...Temporal methods 2D Models + RNN [130, 26] + LSTM [33, 7, 83, 21, 76, 71, 60]...
[...]
...To solve this problem Long Short-Term Memory (LSTM) [33] was proposed, and it is usually used as a hidden layer of RNN, as seen in Fig....
[...]
...For the latter, researchers tried to exploit recurrent neural networks (RNN) in the past [108]....
[...]
...We are aware of [67], where the authors propose a multimodal (depth video, skeleton, and speech) human gesture recognition system based on RNN. [25] presents a Convolutional Long Short-Term Memory Recurrent Neural Network (CNNLSTM) able to successfully learn gesture varying in duration and complexity....
[...]
10,161 citations
"A Survey on Deep Learning Based App..." refers background in this paper
...Among the popular ones are Caffe [49], CNTK [131], TensorFlow [1], and Theano [3]....
[...]
7,309 citations
"A Survey on Deep Learning Based App..." refers background in this paper
...However, these models typically faced some major mathematical difficulties identified by Hochreiter [39] and Bengio et al [9]....
[...]
Related Papers (5)
Frequently Asked Questions (18)
Q2. What is the valuable cue in deep learning?
One valuable cue is spatial structure of actions/gestures. [112] takes advantage of iDTs to pool relevant CNN features along trajectories in video frames. [12], takes advantage of human body spatial constraints, by aggregating convolutional activations of a 3D CNN into descriptors based on joint positions.
Q3. What is the used network for this task?
Recurrent Neural Network (RNN) [26] is one of the most used networks for this task, which can take into account the temporal data using recurrent connections in hidden layers.
Q4. What are the main reasons why the authors are focusing on deep learning?
The authors anticipate deep learning will prevail in emerging applications/areas like social signal processing, affective computing, and personality analysis, among others.
Q5. What are some other successful extensions of RNN in recognizing human actions?
2. Bidirectional RNN (BRNN) [83], Hierarchical RNN (H-RNN) [22], and Differential RNN (D-RNN) [106] are some other successful extensions of RNN in recognizing human actions.
Q6. What is the common way to combine a recurrent model with a localized?
Contextual cues have also been considered for action/gesture recognition. [4] proposes novel multi-stage recurrent architecture consisting of two stages: in a first stage, the model focuses on global context-aware features, and then combines the resulting representation with the localized, actionaware. [46] enriches their motion representation by encoding a set of 15,000 objects from ImageNet and computing their likelihood in frames.
Q7. What is the main problem of the authors?
Other authors focused on further improving accuracy of 3D CNNs. [32] performs 3D convolutions over stacks of optical flow maps. [95] uses multiple 3D CNNs in a multi-stage (proposal generation, classification, and fine-grained localization) framework for temporal action localization in long untrimmed videos.
Q8. What are the main variants for information fusion in deep learning?
There are three main variants for information fusion in deep learning models: early (before the data is feed into the model, or the model fuses information directly from multiple sources), late (outputs of deep learning models are combined) and middle (intermediate layers fuse information) fusions [68, 69].
Q9. What is the main problem with the NKTM?
R-NKTM is learned using bag-offeatures from dense trajectories of synthetic 3D human models and generalizes to real videos of human actions.
Q10. What is the main problem with the R-NKTM?
[112] pools and normalizes CNN feature maps along improved dense trajectories. [78] concatenates iDTs (HOG, HOF, MBHx, MBHy descriptors with fisher vector encoding) and CNN feature (VGG19) descriptors. [86] presents a Robust Nonlinear Knowledge Transfer Model (R-NKTM) based on a deep fully-connected network that transfers human actions from any view to a canonical one.
Q11. What are the main applications of deep learning?
Regarding applications, deep learning techniques have been successfully used in traditional ones (e.g. surveillance, health care, robotics), improving performance in action and gesture recognition for human computer-robot or -machine interaction.
Q12. How does the winner method get the accuracy on UCF101?
[135] could get the best accuracy on UCF101 by using Trajectory pooling to pool the extracted convolutional features from from the optical flow nets of Two-Stream ConvNets and the frame-diff layers of spatial network to get local descriptors.
Q13. How does the winner method achieve good performance?
[105] achieves good performance by using a two-stream network (RGB and motion) with extended temporal resolution respect to previous works (from 16 to 60 frames).
Q14. What are the main problems that will receive attention in the next years?
As such, the authors envision newer problems like early recognition [28], multi-task learning [127], captioning, recognition from low resolution sequences [66] and lifelog devices [87] will receive attention in the next years.
Q15. What is the problem with the weights of a 3D CNN?
To alleviate this problem, [61] initializes the weights of a 3D CNN by using 2D weights learned from ImageNET, while [102] proposes a 3D CNN (FstCN ) that factorizes the 3D convolutional kernel learning as a sequential process of learning 2D spatial and 1D temporal kernels in different layers.
Q16. What is the trick to improve the performance of deep learning?
To address this problem and decrease the number of weights, a good trick is to decrease the spatial resolution while increasing the temporal length.
Q17. What is the final decision taken by a single-layer network?
The final decision is taken by single-layer network. [99] faces the problem of first person action recognition using a multi-stream CNN (ego-CNN, temporal, and spatial).
Q18. What are the main reasons for the use of 3D CNNs?
The authors also find 3D CNN models being combined with sequence modeling methods [7] or hand-crafted feature descriptors (VLAD [30] or iDTs [129]).