Proceedings Article•DOI•

A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences

Maryam Asadi-Aghbolaghi, Albert Clapés¹, Marco Bellantonio², Hugo Jair Escalante³, Víctor Ponce-López¹, Xavier Baró⁴, Isabelle Guyon, Shohreh Kasaei, Sergio Escalera¹ - Show less +5 more•Institutions (4)

University of Barcelona¹, Polytechnic University of Catalonia², National Institute of Astrophysics, Optics and Electronics³, Open University of Catalonia⁴

01 May 2017-pp 476-483

TL;DR: A taxonomy that summarizes important aspects of deep learning for approaching both action and gesture recognition in image sequences is introduced, and the main works proposed so far are summarized.

read less

Abstract: The interest in action and gesture recognition has grown considerably in the last years. In this paper, we present a survey on current deep learning methodologies for action and gesture recognition in image sequences. We introduce a taxonomy that summarizes important aspects of deep learning for approaching both tasks. We review the details of the proposed architectures, fusion strategies, main datasets, and competitions. We summarize and discuss the main works proposed so far with particular interest on how they treat the temporal dimension of data, discussing their main features and identify opportunities and challenges for future research.

...read moreread less

Summary (3 min read)

Jump to: [Introduction] – [II. TAXONOMY] – [A. Architectures] – [B. Fusion strategies] – [D. Challenges] – [III. ACTION/ACTIVITY RECOGNITION] – [B. Motion-based features] – [D. Deep learning with fusion strategies] – [IV. GESTURE RECOGNITION] – [D. Deep Learning with fusion strategies] and [V. DISCUSSION]

Introduction

A survey on current deep learning methodologies for action and gesture recognition in image sequences.
The authors introduce a taxonomy that summarizes important aspects of deep learning for approaching both tasks.
In 1997, authors’ effort led to the development of the long short-term memory (LSTM) [40] cells for RNNs.
The amount of research that has been generated in these topic within the last few years is astounding.
The remainder of this paper is organized as follows.

II. TAXONOMY

Fig. 1 illustrates a taxonomy of the main works performing action and gesture recognition using deep learning approaches.
Note that with recognition the authors refer to either classification of pre-segmented video segments or localization of actions in long untrimmed videos.

A. Architectures

The most crucial challenge in deep-based human action and gesture recognition is how to deal with the temporal dimension.
Based on that, the authors categorize approaches into different three groups.
The third group combines a 2D (or 3D) CNN applied at individual (or stacks of) frames with a temporal sequence modeling.
Recurrent Neural Network (RNN) [26] is one of the most used networks for this task, which can take into account the temporal data using recurrent connections in hidden layers.

B. Fusion strategies

Information fusion is common in deep learning methods for action and gesture recognition.
At times, fusion is used to combine the information from parts of a segmented video sequence [51, 115], although, it is more common to fuse information from multiple cues (e.g. RGB and motion, depth, and/or audio) [32], as well combining models trained with different data samples and learning parameters [68].
There are three main variants for information fusion in deep learning models: early (before the data is feed into the model, or the model fuses information directly from multiple sources), late (outputs of deep learning models are combined) and middle (intermediate layers fuse information) fusions [68, 69].
An example of the latter is shown in Fig. 2. Modifications and variants of these schemes have been proposed as well, for instance, see the variants introduced in [51] for fusing information in the temporal dimension.
Moreover, ensembles or stacked networks are also considered as fusion strategies [115, 105, 68].

D. Challenges

Every year computer vision organizations arrange compe- titions providing useful annotated datasets.
Table V shows 5 main challenge series in computer vision.
For each, the authors report the year in which it took place, the name of the dataset along with the task to be faced (either action- or gesture-related), the associated event’s name, the winner participant, and the more recent results on the challenge’s associated dataset.

III. ACTION/ACTIVITY RECOGNITION

This section reviews deep methods to address action recog- nition divided on how they treat the temporal dimension: 3D convolutions, pre-computed motion features, or temporal modeling.
The larger number of parameters w.r.t. 2D models, make them harder to train.
Other authors focused on further improving accuracy of 3D CNNs. [32] performs 3D convolutions over stacks of optical flow maps. [95] uses multiple 3D CNNs in a multi-stage (proposal generation, classification, and fine-grained localization) framework for temporal action localization in long untrimmed videos.
The authors also find 3D CNN models being combined with sequence modeling methods [7] or hand-crafted feature descriptors (VLAD [30] or iDTs [129]).

B. Motion-based features

Neural networks and CNNs based on hand and body pose estimation as well as motion features have been widely applied for gesture recognition.
For gesture style recognition in biometrics, [126] proposes a two-stream (spatio-temporal) CNN.
The authors use raw depth data as the input of spatial network and optical flow as the input of temporal one.
For articulated human pose estimation in videos the authors of [43] propose a Convolutional Network architecture for estimating the 2D location of human joints in video, with an RGB image and a set of motion features as the input data of this network.
The authors of [117] use three representations of dynamic depth image (DDI), dynamic depth normal image (DDNI) and dynamic depth motion normal image for gesture recognition.

D. Deep learning with fusion strategies

Some methods have used diverse fusion schemes to improve recognition performance of action recognition. [37] proposes a novel Subdivision-Fusion Model (SFM), where features extracted with CNN are clustered and grouped into subcategories. [22] learns an end-to-end hierarchical RNN using skeleton data divided into five parts, each of which is feed into a different network.
The final decision is taken by single-layer network. [99] faces the problem of first person action recognition using a multi-stream CNN (ego-CNN, temporal, and spatial).
[119] focuses on the changes that an action brings into the environment and propose a siamese CNN architecture to fuse precondition and effect information from the environment. [20] proposes a CNN which uses mid-level discriminative visual elements.
The method, called DeepPattern, is able to learn discriminative patches by exploring human body parts as well as scene context. [76] proposes DeepConvLSTM, based on convolutional and LSTM recurrent units, which is suitable for multimodal wearable sensors.

IV. GESTURE RECOGNITION

Mainly driven by the areas of human computer, machine, and robot interaction.the authors.
A. 3D Convolutional Neural Networks Several 3D CNNs have been proposed for gesture recognition, most notably [64, 41, 63]. [41] proposes a 3D CNN for sign language recognition.
The CNN automatically learns a representation from raw video, and processes multimodal information (RGB-D+Skeleton data).
Similar in spirit, [63] introduces a 3D CNN for driver hand gesture recognition from depth and intensity data. [64] extends a 3D CNN with a recurrent mechanism for detection and classification of dynamic hand gestures.
It consists of a 3D-CNN for spatiotemporal feature extraction, followed by a recurrent layer for global temporal modeling and a softmax layer for predicting class-conditional gesture probabilities.

D. Deep Learning with fusion strategies

Multimodality has been widely exploited for gesture recog- nition.
[124] proposes a semi-supervised hierarchical dynamic framework based on a HMM for simultaneous gesture segmentation and recognition using skeleton joint information, depth and RGB images.
Separate CNNs are considered for each modality at the beginning of the model structure with increasingly shared layers and a final prediction layer.
The authors exploited early and middle fusion methods to integrate the models. [54] proposes a CNN that learns to score pairs of input images and human poses .
The authors then calculate score function by dot-product between the two embeddings; i.e. late fusion.

V. DISCUSSION

The authors presented a comprehensive overview of deep-based models for action and gesture recognition.
It has been also shown that using training networks on precomputed motion features is an effective way to save them from implicit learning of motion features.
Taking into account the full temporal scale, results in a huge amount of weights for learning.
Yet another trick to improve the result of deep-based models is data fusion.
One valuable cue is spatial structure of actions/gestures. [112] takes advantage of iDTs to pool relevant CNN features along trajectories in video frames. [12], takes advantage of human body spatial constraints, by aggregating convolutional activations of a 3D CNN into descriptors based on joint positions.

Did you find this useful? Give us your feedback

Figures (7)

Figure 1: Taxonomy of deep learning approaches for gesture and action recognition.

Figure 2: The different architectures and fusion strategies. Top-left: 3D convolution. Top-right: motion pre-computation. Bottom-left: sequential modeling via LSTM. Bottom-right: fusion into a spatio-temporal stream.

Content maybe subject to copyright Report

A survey on deep learning based approaches for
action and gesture recognition in image sequences
Maryam Asadi-Aghbolaghi
1,2,3
, Albert Clap
´
es
2,3
, Marco Bellantonio
4
, Hugo Jair Escalante
5
,
V
´
ıctor Ponce-L
´
opez
2,3,6
, Xavier Bar
´
o
6
, Isabelle Guyon
7
, Shohreh Kasaei
1
, Sergio Escalera
2,3
1
Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
2
Department of Mathematics and Informatics, University of Barcelona, Barcelona, Spain
3
Computer Vision Center, Autonomous University of Barcelona, Barcelona, Spain
4
Facultat d’Informatica, Polytechnic University of Barcelona, Barcelona, Spain
5
Instituto Nacional de Astrof
´
ısica,
´
Optica y Electr
´
onica Puebla, Mexico
6
EIMT, Open University of Catalonia, Barcelona, Spain
7
Universit
´
e Paris-Saclay, Paris, France
masadia@ce.sharif.edu
Abstract—The interest in action and gesture recognition has
grown considerably in the last years. In this paper, we present
a survey on current deep learning methodologies for action
and gesture recognition in image sequences. We introduce a
taxonomy that summarizes important aspects of deep learning
for approaching both tasks. We review the details of the proposed
architectures, fusion strategies, main datasets, and competitions.
We summarize and discuss the main works proposed so far with
particular interest on how they treat the temporal dimension of
data, discussing their main features and identify opportunities
and challenges for future research.
I. INTRODUCTION
Action and gesture recognition have been studied for a while
within the ﬁelds of computer vision and pattern recognition
and substantial progress has been reported for both tasks in
the last two decades. Recently, deep learning has irrupted in
these ﬁelds achieving outstanding results and outperforming
“non-deep” state-of-the-art methods [97, 112, 31].
The temporal dimension in sequences typically causes ac-
tion/gesture recognition to be a challenging problem in terms
of both amounts of data to be processed and model complexity
– which in particular are crucial aspects for training large
parametric deep learning networks. In this context, authors
proposed several strategies, such as frame sub-sampling, ag-
gregation of local frame-level features into mid-level video
representations, or temporal sequence modeling, just to name a
few. For the latter, researchers tried to exploit recurrent neural
networks (RNN) in the past [108]. However, these models
typically faced some major mathematical difﬁculties identiﬁed
by Hochreiter [39] and Bengio et al [9]. In 1997, authors’
effort led to the development of the long short-term memory
(LSTM) [40] cells for RNNs. Today, LSTMs are an important
part of deep models for image sequence modeling for human
action/gesture recognition [98, 92]. These, along with implicit
An extended version of this paper will be made available as: Asadi-
Aghbolaghi et al. Deep learning for action and gesture recognition in
image sequences: a survey. Book chapter in Springer Series on Challenges
in Machine Learning, forthcoming 2018.
modeling of spatiotemporal features using 3D convolutional
nets [47], pre-computed motion-based features [97], and the
combination of multiple visual [98], resulted in fast and
reliable state-of-the-art methods for action/gesture recognition.
Although the application of deep learning to action and
gesture recognition is relatively new, the amount of research
that has been generated in these topic within the last few years
is astounding. Even so, to the best of our knowledge, there is
no previous survey that collects and reviews all of the existent
work on deep learning for action and gesture recognition.
This paper aims at capturing a snapshot of current trends in
this direction, including an in depth analysis of different deep
models, with special interest on how they treat the temporal
dimension of the data.
The remainder of this paper is organized as follows. Sec-
tion II presents a taxonomy in this ﬁeld of research. Next,
Section III reviews the literature on human action/activity
recognition with deep learning models. Section IV summarizes
the state-of-the-art on deep learning for gesture recognition.
Finally, Section V discusses the main features of the reviewed
deep learning for the both studied problems.
II. TAXONOMY
Fig. 1 illustrates a taxonomy of the main works performing
action and gesture recognition using deep learning approaches.
Note that with recognition we refer to either classiﬁcation of
pre-segmented video segments or localization of actions in
long untrimmed videos.
A. Architectures
The most crucial challenge in deep-based human action
and gesture recognition is how to deal with the temporal
dimension. Based on that, we categorize approaches into
different three groups. The ﬁrst group uses 3D ﬁlters in the
convolutional layer [7, 47, 58, 105]. The 3D convolution and
3D pooling in CNN layers allow to capture discriminative
features along both spatial and temporal dimensions while
maintaining a certain temporal structure. In the second group,978-1-5090-4023-0/17/$31.00
c
2017 IEEE

Action and gesture
recognition approaches
3D Models (3D conv a pool)
[48, 58, 102, 30, 61,
7, 129, 32, 95, 105]
Motion-based input features
[105, 97, 112, 34]
Temporal methods
2D Models + RNN [130, 26] + LSTM [33, 7, 83, 21, 76, 71,
60]
2D Models + B-RNN[98] + LSTM [83, 98]
2D Models + H-RNN[22] + LSTM [22]
2D Models + D-RNN + LSTM [106]
2D Models + HMM [124]
2D/3D Models + Auxiliary outputs [48]
2D/3D Models + Hand-crafted features [112]
Figure 1: Taxonomy of deep learning approaches for gesture and action recognition.
motion features like 2D dense optical ﬂow maps are pre-
computed and input to the networks [105, 97, 112, 34, 102,
122]. Extracted motion features can be fed to the network as
additional channels to the appearance ones [105] or input to
a secondary network (later combined with former one) [97].
Fig. 2 illustrates these ﬁrst two groups. The third group
combines a 2D (or 3D) CNN applied at individual (or stacks
of) frames with a temporal sequence modeling. Recurrent
Neural Network (RNN) [26] is one of the most used networks
for this task, which can take into account the temporal data
using recurrent connections in hidden layers. The drawback
of this network is its short-term memory which is insufﬁcient
for real world actions. To solve this problem Long Short-Term
Memory (LSTM) [33] was proposed, and it is usually used as a
hidden layer of RNN, as seen in Fig. 2. Bidirectional RNN (B-
RNN) [83], Hierarchical RNN (H-RNN) [22], and Differential
RNN (D-RNN) [106] are some other successful extensions of
RNN in recognizing human actions. Other temporal modeling
tools like HMM are also applied [124].
For all methods in the three groups, the performance of a
deep model can be boosted by combination with hand-crafted
features, e.g. improved dense trajectories (iDT) [112].
B. Fusion strategies
Information fusion is common in deep learning methods
for action and gesture recognition. At times, fusion is used
to combine the information from parts of a segmented video
sequence [51, 115], although, it is more common to fuse
information from multiple cues (e.g. RGB and motion, depth,
and/or audio) [32], as well combining models trained with
different data samples and learning parameters [68].
There are three main variants for information fusion in deep
learning models: early (before the data is feed into the model,
or the model fuses information directly from multiple sources),
late (outputs of deep learning models are combined) and
middle (intermediate layers fuse information) fusions [68, 69].
An example of the latter is shown in Fig. 2. Modiﬁcations
and variants of these schemes have been proposed as well,
for instance, see the variants introduced in [51] for fusing
information in the temporal dimension. Moreover, ensembles
or stacked networks are also considered as fusion strate-
gies [115, 105, 68].
RNN
CNN CNN CNN
…
…
…
LSTM
LSTM
LSTM
Output frame-level predictions
…
Spatio-temporal stream
Temporal stream
…
…
…
Input frame
Input frames
Input frames (stack)
3D
 convolution 
Input frame
CNN
motion
Figure 2: The different architectures and fusion strategies. Top-left: 3D con-
volution. Top-right: motion pre-computation. Bottom-left: sequential modeling
via LSTM. Bottom-right: fusion into a spatio-temporal stream.
C. Datasets
We list the most relevant datasets according to action (or
activity) and gesture recognition in Table I and II respectively.
For each dataset we specify year of creation, problems for
which the dataset was deﬁned (either classiﬁcation or tempo-
ral/spatiotemporal localization), data modalities available for
the task, involved body parts, the number of classes, and state-
of-the-art performances to this date (which provide a hint of
how difﬁcult the datasets are).
Table III and Table IV summarize the recent approaches
which obtained remarkable results against two of the most
well-known and challenging datasets in action recognition, re-
spectively, UCF-101 and THUMOS-14. Reviewing top ranked
methods on UCF-101 dataset, we ﬁnd that the most signiﬁcant
difference among them is the strategy for splitting video data
and combine sub-sequence results. [119] encodes the changes
in the environment by dividing the input sequence into two
parts, pre-condition and effect, and model the actions as linear
transformation from one to another. [55] processes the input

Table I: Action datasets
Year Dataset Problem Body Parts Modality No.classes Performance
2004 KTH AC F I 6 98.67% Acc [136]
2006 IXMAS AC F RGB, A 13 98.79% Acc [104]
2007 HDM05 AC F S 100 98.17% Acc [14]
2008 HOHA (Hollywood 1) AC, TL F, U, L RGB 8
71.90% Acc [91],
0.787@0.5 mAP [62]
2008 UCF Sports AC, STL F RGB 10
95.80% Acc [94],
0.789@0.5 mAP [62]
2009 Hollywood 2 AC F, U, L RGB 12 78.50 mAP [56]
2009 UCF11 (YouTube Action) AC, STL F RGB 11 93.77% Acc [82], -
2010 Highﬁve AC, STL F,U RGB 4 69.40 mAP [109], 0.466 IoU [6]
2010 MSRAction3D AC F D, S 20 97.30% Acc [59]
2010 MSRAction II STL F RGB 3 85.00@0.125% mAP [17]
2010 Olympic Sports AC F RGB 16 96.60% Acc [55]
2011 Collective Activity (Extended) AC F RGB 6 90.23% Acc [5]
2011 HMDB51 AC F, U, L RGB 51 73.60% Acc [110]
2012 MPII Cooking AC, TL F, U RGB 65 72.40 mAP [137], -
2012 MSRDailyActivity3D AC F,U RGB, D, S 16 97.50% Acc [93]
2012 UCF101 AC,TL F, U, L RGB 101
94.20% Acc [115],
46.77@0.2 mAP (split 1) [122]
2012 UCF50 AC F, U, L RGB 50 97.90% Acc [24]
2012 UTKinect-Action3D AC F RGB, D, S 10 98.80% Acc [52]
2013 J-HMDB AC, STL F, U, L RGB, S 21
71.08 Acc [79],
73.1@0.5 mAP [91]
2013 Berkeley MHAD AC F RGB, D, S, A 11 100.00% Acc [14]
2014 N-UCLA Multiview Action3D AC F RGB, D, S 10 90.80% Acc [52]
2014 Sports 1-Million AC F, U, L RGB 487 73.10% Acc [133]
2014 THUMOS-14 AC, TL F, U, L RGB 101, 20 *
71.60 mAP [46],
0.190@0.5 mAP [95]
2015 THUMOS-15 AC, TL F, U, L RGB 101, 20 *
80.80 mAP [55],
0.183@0.5 mAP (a)
2015 ActivityNet AC, TL F, U, L RGB 200
93.23 mAP (b),
0.594@0.5 mAP [65]
2016 NTU RGB+D AC F RGB, D, S, IR 60 {69.20, 77.70}
1
Acc [57]
Problems: action classiﬁcation (AC), temporal localization (TL), and spatiotemporal localization (STL). Body parts: full
body (F), upper body (B), and lower body (L). Modalities: audio (A), depth (D), grayscale intensity (I), infrared (IR),
skeleton (S), and color (RGB).
Performance: Acc (accuracy), mAP (mean average precision), IoU (intersection-over-union).
*
A different no. classes used for different problems. For TL/STL, “@” indicates amount overlap with groundtruth
considered for positive localization.
(a) Winner method from (http://activity-net.org/challenges/2016/program.html#leaderboard).
(b) Winner method from http://www.thumos.info/results.html.
1
{cross-subject accuracy, cross-view accuracy}.
Table II: Gesture datasets
Year Dataset Problem Body Parts Modality No.classes Performance
2011 ChaLearn Gesture GC F, U RGB, D 15 -
2012 MSR-Gesture3D GC F, H RGB, D 12 98.50% Acc [16]
2014 ChaLearn (Track 3) GC, TL U RGB, D, S 20 98.20 Acc [64], 0.870 IoU [69]
2015 VIVA Hand Gesture GC H RGB 19 77.50% Acc [63]
2016
ChaLearn conGD TL
U RGB, D 249
0.315 IoU [11]
ChaLearn isoGD GC 67.19% Acc [23]
Problems: gesture classiﬁcation (GC). Body parts: hands (H).
Check also tablenotes on Table I for additional notation.
video as a hierarchical structure over the time in 3 levels,
i.e. short-term, medium-range, and long-range. [105] achieves
good performance by using a two-stream network (RGB and
motion) with extended temporal resolution respect to previous
works (from 16 to 60 frames). [135] could get the best
accuracy on UCF101 by using Trajectory pooling to pool the
extracted convolutional features from from the optical ﬂow
nets of Two-Stream ConvNets and the frame-diff layers of
spatial network to get local descriptors.
Looking at the top ranked deep models on THUMOS 2014
challenge, almost all winner methods combined appearance
and motion features. For appearance, most of the methods
extract frame-level CNN descriptors, and video representation
is generated using a pooling method over the sequence. On
the other hand, motion-based approaches in the top ranked
methods can be divided into three groups, FlowNet, 3D CNN,
and iDTs. In [84], we provide a comparison of those showing
3D CNN achieves the best result.
D. Challenges
Every year computer vision organizations arrange compe-
titions providing useful annotated datasets. Table V shows 5
main challenge series in computer vision. For each, we report
the year in which it took place, the name of the dataset along
with the task to be faced (either action- or gesture-related), the
Table III: UCF-101 dataset results
Ref. Year Features Architecture Score
[135] 2015 CNN, IDT 2 CNN + iDT pooling 93.78%
[105] 2016 Opt. Flow, 3D CNN, IDT LTC-CNN 92.7%
[32] 2016 conv5, 3D pool VGG-16, VGG-M, 3D CNN 92.5%
[119] 2016 CNN Siamese VGG-16 92.4%
[55] 2016 CNN fc7 2 CNNs (spatial + temporal) 92.2%
[112] 2015 CNN, Hog/Hof/Mbh 2-stream CNN 91.5%
[61] 2015 CNN feat 3D CNN 89.7%
[10] 2016 Dynamic feat maps BVLC CaffeNet 89.1%
[46] 2015 H/H/M, IDT, FV+PCA+GMM 8-layer CNN 88.5%
[102] 2015 CNN F
ST
CN: 2 CNNs (spat + temp) 88.1%
[97] 2014 CNN Two-stream CNN (CNN-M-2048) 88.0%
[60] 2016 eLSTM, DCNN fc7 eLSTM, DCNN+LSTM 86.9%
[134] 2016 CNN 2 CNNs (spatial + temporal) 86.4%
[129] 2016 dense trajectory, C3D RNN, LSTM, 3DCNN 85.4%
[78] 2015 CNN fc6, HOG/HOF/MBH VGG19 Conv5 79.52%±1.1% (tr2)
[78] 2015 CNN fc6, HOG/HOF/MBH VGG19 Conv5 66.64% (tr1)
[51] 2014 CNN features 2 CNN converge to 2 fc layers 65.4%, 68% mAP
[45] 2015 ImageNet CNN, word2vec GMM CNN 63.9%
[122] 2015 CNN Spat + motion CNN 54.28% mAP
Table IV: THUMOS-14 dataset results
Ref. Year Features Architecture Score
[46] 2015 H/H/M, IDT, FV+PCA+GMM. 8-layer CNN 71.6%
[134] 2016 CNN 2 CNNs (spatial + temporal) 61.5%
[45] 2015 ImageNet CNN, word2vec GMM CNN 56.3%
[95] 2016 CNN fc6, fc7, fc8 3D CNN, Segment-CNN 19% mAP
[130] 2015 CNN fc7 VGG-16, 3-layer LSTM 17.1% mAP
[30] 2016 fc7 3D CNN C3D CNN net
.084% mAP@50
.121% mAP@100
.139% mAP@200
.125% mAP@500
associated event’s name, the winner participant, and the more
recent results on the challenge’s associated dataset.
III. ACTION/ACTIVITY RECOGNITION
This section reviews deep methods to address action recog-
nition divided on how they treat the temporal dimension:
3D convolutions, pre-computed motion features, or temporal
modeling.
A. 3D Convolutional Neural Networks
In order to capture temporal information, one approach
consists in extending the convolution along the temporal axis,
in what is known as 3D CNN [47, 7, 103, 61, 58, 102, 30, 129,
32, 95]. The larger number of parameters w.r.t. 2D models,
make them harder to train. To alleviate this problem, [61]
initializes the weights of a 3D CNN by using 2D weights
learned from ImageNET, while [102] proposes a 3D CNN
(F
st
CN) that factorizes the 3D convolutional kernel learning
as a sequential process of learning 2D spatial and 1D temporal
kernels in different layers. Other authors focused on further
improving accuracy of 3D CNNs. [32] performs 3D convo-
lutions over stacks of optical ﬂow maps. [95] uses multiple
3D CNNs in a multi-stage (proposal generation, classiﬁca-
tion, and ﬁne-grained localization) framework for temporal
action localization in long untrimmed videos. We also ﬁnd
3D CNN models being combined with sequence modeling
methods [7] or hand-crafted feature descriptors (VLAD [30]
or iDTs [129]).
B. Motion-based features
In recent years, many approaches have focused on incorpo-
rating pre-computed temporal features within the deep model,

Table V: Challenges
Challenge Year Dataset Task Event Winner Results
ChaLearn
2012 CGD G - Alfnie [53]* [27]
2013 Montalbano G - [125] [8]
2014
HuPBA 8K+ A
ECCV
[80] -
Montalbano G [68] [83][69][96]
2015 HuPBA 8K+ A CVPR [121] -
2016 isoGD, conGD G ICPR [13] [51], [117]
HAL 2012 LIRIS A ICPR [70] [123][35]*
Opportunity 2011 Opportunity A - CSTA [90][15][89]
ROSE 2016 NTU RGB+D A ACCV SEARCH [92]
THUMOS
2013 UCF101 A ICCV [75] [101][100][81][50]
2014 THUMOS-14 A ECCV [44] [46][95][111][88]
2015 THUMOS-15 A CVPR [128] [114][132]
VIVA 2015 VIVA G CVPR [63] [63][74]
VIRAT 2012 VIRAT DB A CVPR - [107][73]
* Non-deep learning method
e.g. dense optical ﬂow maps or iDTs. [122] detects frame
proposals and scores them with a combination of static and
motion CNN features for action localization. [97] presents a
two-stream CNN which incorporates both spatial (still image)
and temporal networks (multi-frame dense optical ﬂow). [134]
exploit a motion vector from video compression (instead
of optical ﬂow). [34] localizes actions in space and time
using a (spatial-temporal) two-stream CNNs whose predictions
are late-fused with SVM. [105] extends the convolutions in
time, aiming at capturing long-term temporal convolutions, at
expenses of spatial resolution. [116] uses view-invariant multi-
scale depth maps as a input motion descriptor for CNN. [51]
proposes a multi-scale foveated CNN for large-scale video
classiﬁcation. Differently, [85] uses CNNs to obtain canonical
human poses for action recognition. [113] simply estimates
actionness maps from appearance and motion cues. In the
same line, [138] introduces a deep-learning method to identify
key volumes and classify them simultaneously.
In the literature there exist several methods which extend
the CNN capabilities using trajectory features. [112] pools
and normalizes CNN feature maps along improved dense
trajectories. [78] concatenates iDTs (HOG, HOF, MBHx,
MBHy descriptors with ﬁsher vector encoding) and CNN
feature (VGG19) descriptors. [86] presents a Robust Non-
linear Knowledge Transfer Model (R-NKTM) based on a deep
fully-connected network that transfers human actions from any
view to a canonical one. R-NKTM is learned using bag-of-
features from dense trajectories of synthetic 3D human models
and generalizes to real videos of human actions. [120] bases on
iDT Descriptors and two-Stream CNN features, using a non-
action classiﬁer to down-weight irrelevant video segments.
[18] presents a new Pose-based CNN descriptor which ag-
gregates motion and appearance information along tracks of
human body parts. [55] proposes VLAD
3
to model long-range
dynamic information. It captures short-term dynamics with
deep CNN features, relying on linear dynamic systems (LDS)
to model medium-range dynamics.
C. Temporal deep learning models: RNN and LSTM
We also ﬁnd approaches which combine CNN with temporal
sequence modeling techniques, such as RNNs or LSTMs.
[106] introduces a differential gating scheme for LSTM to
emphasize on the change in information gain caused by the
salient motions between successive frames. [71] proposes a
RNN to perform interactional parsing of objects. The object
parsings are used to form object speciﬁc action representations
for ﬁne grained action detection. [98] presents a multi-stream
bi-directional RNN. A tracking algorithm locates a bounding
box around a person and two streams (motion and appearance)
cropped to the tracked bounding box are trained along with
full-frame streams. The CNN is followed by a bidirectional
LSTM layer. [130] introduces a fully end-to-end approach
based on a RNN agent. The agent observes video frames and
decides both where to look next and when to emit a prediction.
[60] proposes a deep architecture which uses 3D skeleton
sequences to regularize an LSTM network (LSTM+CNN) on
the video frames.
D. Deep learning with fusion strategies
Some methods have used diverse fusion schemes to improve
recognition performance of action recognition. [37] proposes
a novel Subdivision-Fusion Model (SFM), where features
extracted with CNN are clustered and grouped into subcat-
egories. [22] learns an end-to-end hierarchical RNN using
skeleton data divided into ﬁve parts, each of which is feed into
a different network. The ﬁnal decision is taken by single-layer
network. [99] faces the problem of ﬁrst person action recog-
nition using a multi-stream CNN (ego-CNN, temporal, and
spatial). [119] focuses on the changes that an action brings into
the environment and propose a siamese CNN architecture to
fuse precondition and effect information from the environment.
[20] proposes a CNN which uses mid-level discriminative
visual elements. The method, called DeepPattern, is able to
learn discriminative patches by exploring human body parts as
well as scene context. [76] proposes DeepConvLSTM, based
on convolutional and LSTM recurrent units, which is suitable
for multimodal wearable sensors.
IV. GESTURE RECOGNITION
In this section we review recent deep-learning approaches
for gesture recognition in videos, mainly driven by the areas
of human computer, machine, and robot interaction.
A. 3D Convolutional Neural Networks
Several 3D CNNs have been proposed for gesture recog-
nition, most notably [64, 41, 63]. [41] proposes a 3D CNN
for sign language recognition. The CNN automatically learns
a representation from raw video, and processes multimodal
information (RGB-D+Skeleton data). Similar in spirit, [63]
introduces a 3D CNN for driver hand gesture recognition
from depth and intensity data. [64] extends a 3D CNN with
a recurrent mechanism for detection and classiﬁcation of
dynamic hand gestures. It consists of a 3D-CNN for spatio-
temporal feature extraction, followed by a recurrent layer for
global temporal modeling and a softmax layer for predicting
class-conditional gesture probabilities.

B. Motion-based features
Neural networks and CNNs based on hand and body pose
estimation as well as motion features have been widely ap-
plied for gesture recognition. For gesture style recognition
in biometrics, [126] proposes a two-stream (spatio-temporal)
CNN. The authors use raw depth data as the input of spatial
network and optical ﬂow as the input of temporal one. For
articulated human pose estimation in videos the authors of
[43] propose a Convolutional Network (ConvNet) architecture
for estimating the 2D location of human joints in video, with
an RGB image and a set of motion features as the input data
of this network. The authors of [117] use three representations
of dynamic depth image (DDI), dynamic depth normal image
(DDNI) and dynamic depth motion normal image (DDMNI)
for gesture recognition.
[118] ﬁrst identiﬁes the start and end frames of each
gesture based on quantity of movement (QOM), and then they
construct Improved Depth Motion Map (IDMM) by calculating
the absolute depth difference between current frame and the
start frame for each gesture segment, as the input data of deep
learning network.
C. Temporal deep learning models: RNN and LSTM
This kind of models have not been widely used for gesture
recognition, despite being a promising venue for research. We
are aware of [67], where the authors propose a multimodal
(depth video, skeleton, and speech) human gesture recognition
system based on RNN. [25] presents a Convolutional Long
Short-Term Memory Recurrent Neural Network (CNNLSTM)
able to successfully learn gesture varying in duration and com-
plexity. [72] proposes a multi-stream model, called MRNN,
which extends RNN capabilities with LSTM cells in order to
facilitate the handling of variable-length gestures.
D. Deep Learning with fusion strategies
Multimodality has been widely exploited for gesture recog-
nition. [124] proposes a semi-supervised hierarchical dynamic
framework based on a HMM for simultaneous gesture seg-
mentation and recognition using skeleton joint information,
depth and RGB images. The authors applied intermediate
(middle) and late fusion to get the ﬁnal result. [69] proposes a
multimodal multi-stream CNN for gesture spotting. Separate
CNNs are considered for each modality at the beginning of
the model structure with increasingly shared layers and a ﬁnal
prediction layer. The authors fuse the result of each network
by a meta-classiﬁer independently at each scale; i.e., late
fusion. [77] presents a deep learning model to fuse multiple
information sources for human pose estimation. The deep
model takes as input the output of a state-of-the-art human
pose estimator. The authors exploited early and middle fusion
methods to integrate the models.
[54] proposes a CNN that learns to score pairs of input
images and human poses (joints). The model is formed by
two subnetworks: a CNN learns a feature embedding for
the input images, and a two layer subnetwork learns an
embedding for the human pose. The authors then calculate
score function by dot-product between the two embeddings;
i.e. late fusion. Similarly, [43] proposes a CNN for estimating
2D joints location. The CNN incorporates RGB image and
motion features. The authors utilize early fusion to integrate
these two kinds of features.
V. D ISCUSSION
We presented a comprehensive overview of deep-based
models for action and gesture recognition. We deﬁned a
taxonomy covering most of basic and crucial information
about human action and gesture analysis and then we reviewed
recent methods. Key topics identiﬁed include architectures,
fusion strategies, datasets, and challenges.
Generally, there are two main issues when comparing the
methods; i.e. how does a method deal with temporal informa-
tion? and how can such a large network be trained with small
datasets? As discussed, methods can learn motion features
by 3D ﬁlters in their 3D convolutional and pooling layers. It
has been shown that 3D networks over a long sequence are
able to learn complex temporal patterns [105]. Because of the
required amount of data, the problem of weights initialization
has been investigated. The transformation of 2D Convolutional
Weights into 3D ones yield models to achieve better accuracy
than training scratch [61].
It has been also shown that using training networks on pre-
computed motion features is an effective way to save them
from implicit learning of motion features. Moreover, ﬁne-
tuning motion-based networks with spatial data (ImageNet)
proved to be effective. Allowing networks which are ﬁne-tuned
on stacked optical ﬂow frames to achieve good performance
in spite of having limited training data.
Still, both groups can only exploit limited (local) temporal
information. Hence, the most crucial advantage of approaches
in the third group (i.e. temporal models like RNN, LSTM) is
that they are able to cope with longer-range temporal relations.
These models are mostly used with the skeletal data. These
models are preferred when dealing with skeletal data. Since
skeleton features are low-dimensional, these networks have
fewer weights, and thus, can be trained with fewer data.
Regardless of the model, performance is dependent on the
amount of data. The community is nowadays putting efforts
on building larger data sets that can cope with huge parametric
deep models (e.g. [2, 38]) and on challenge organization (with
novel data sets and well deﬁned evaluation protocols) that can
advance the state-of-the-art of the ﬁeld and make easier the
comparison among deep learning architectures (e.g. [92, 29]).
Strategies for data augmentation and pre-training are com-
mon. Likewise, training mechanisms to avoid overﬁtting (e.g.
dropout) and to control the learning rate (e.g. extensions to
SGD and Nesterov momentum) have been proposed. Taking
into account the full temporal scale, results in a huge amount
of weights for learning. To address this problem and decrease
the number of weights, a good trick is to decrease the spatial
resolution while increasing the temporal length.
Yet another trick to improve the result of deep-based models
is data fusion. Individual networks can be trained on different

HTML Viewer

Frequently Asked Questions (18)

Q1. What have the authors contributed in "A survey on deep learning based approaches for action and gesture recognition in image sequences" ?

In this paper, the authors present a survey on current deep learning methodologies for action and gesture recognition in image sequences. The authors introduce a taxonomy that summarizes important aspects of deep learning for approaching both tasks. The authors review the details of the proposed architectures, fusion strategies, main datasets, and competitions. The authors summarize and discuss the main works proposed so far with particular interest on how they treat the temporal dimension of data, discussing their main features and identify opportunities and challenges for future research.

Q2. What is the valuable cue in deep learning?

One valuable cue is spatial structure of actions/gestures. [112] takes advantage of iDTs to pool relevant CNN features along trajectories in video frames. [12], takes advantage of human body spatial constraints, by aggregating convolutional activations of a 3D CNN into descriptors based on joint positions.

Q3. What is the used network for this task?

Recurrent Neural Network (RNN) [26] is one of the most used networks for this task, which can take into account the temporal data using recurrent connections in hidden layers.

Q4. What are the main reasons why the authors are focusing on deep learning?

The authors anticipate deep learning will prevail in emerging applications/areas like social signal processing, affective computing, and personality analysis, among others.

Q5. What are some other successful extensions of RNN in recognizing human actions?

2. Bidirectional RNN (BRNN) [83], Hierarchical RNN (H-RNN) [22], and Differential RNN (D-RNN) [106] are some other successful extensions of RNN in recognizing human actions.

Q6. What is the common way to combine a recurrent model with a localized?

Contextual cues have also been considered for action/gesture recognition. [4] proposes novel multi-stage recurrent architecture consisting of two stages: in a first stage, the model focuses on global context-aware features, and then combines the resulting representation with the localized, actionaware. [46] enriches their motion representation by encoding a set of 15,000 objects from ImageNet and computing their likelihood in frames.

Q7. What is the main problem of the authors?

Other authors focused on further improving accuracy of 3D CNNs. [32] performs 3D convolutions over stacks of optical flow maps. [95] uses multiple 3D CNNs in a multi-stage (proposal generation, classification, and fine-grained localization) framework for temporal action localization in long untrimmed videos.

Q8. What are the main variants for information fusion in deep learning?

There are three main variants for information fusion in deep learning models: early (before the data is feed into the model, or the model fuses information directly from multiple sources), late (outputs of deep learning models are combined) and middle (intermediate layers fuse information) fusions [68, 69].

Q9. What is the main problem with the NKTM?

R-NKTM is learned using bag-offeatures from dense trajectories of synthetic 3D human models and generalizes to real videos of human actions.

Q10. What is the main problem with the R-NKTM?

[112] pools and normalizes CNN feature maps along improved dense trajectories. [78] concatenates iDTs (HOG, HOF, MBHx, MBHy descriptors with fisher vector encoding) and CNN feature (VGG19) descriptors. [86] presents a Robust Nonlinear Knowledge Transfer Model (R-NKTM) based on a deep fully-connected network that transfers human actions from any view to a canonical one.

Q11. What are the main applications of deep learning?

Regarding applications, deep learning techniques have been successfully used in traditional ones (e.g. surveillance, health care, robotics), improving performance in action and gesture recognition for human computer-robot or -machine interaction.

Q12. How does the winner method get the accuracy on UCF101?

[135] could get the best accuracy on UCF101 by using Trajectory pooling to pool the extracted convolutional features from from the optical flow nets of Two-Stream ConvNets and the frame-diff layers of spatial network to get local descriptors.

Q13. How does the winner method achieve good performance?

[105] achieves good performance by using a two-stream network (RGB and motion) with extended temporal resolution respect to previous works (from 16 to 60 frames).

Q14. What are the main problems that will receive attention in the next years?

As such, the authors envision newer problems like early recognition [28], multi-task learning [127], captioning, recognition from low resolution sequences [66] and lifelog devices [87] will receive attention in the next years.

Q15. What is the problem with the weights of a 3D CNN?

To alleviate this problem, [61] initializes the weights of a 3D CNN by using 2D weights learned from ImageNET, while [102] proposes a 3D CNN (FstCN ) that factorizes the 3D convolutional kernel learning as a sequential process of learning 2D spatial and 1D temporal kernels in different layers.

Q16. What is the trick to improve the performance of deep learning?

To address this problem and decrease the number of weights, a good trick is to decrease the spatial resolution while increasing the temporal length.

Q17. What is the final decision taken by a single-layer network?

The final decision is taken by single-layer network. [99] faces the problem of first person action recognition using a multi-stream CNN (ego-CNN, temporal, and spatial).

Q18. What are the main reasons for the use of 3D CNNs?

The authors also find 3D CNN models being combined with sequence modeling methods [7] or hand-crafted feature descriptors (VLAD [30] or iDTs [129]).

A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences

Summary (3 min read)

Introduction

II. TAXONOMY

A. Architectures

B. Fusion strategies

D. Challenges

III. ACTION/ACTIVITY RECOGNITION

B. Motion-based features

D. Deep learning with fusion strategies

IV. GESTURE RECOGNITION

D. Deep Learning with fusion strategies

V. DISCUSSION

Figures (7)

Citations

Cites background from "A Survey on Deep Learning Based App..."

Cites background from "A Survey on Deep Learning Based App..."

Cites background from "A Survey on Deep Learning Based App..."

Cites background from "A Survey on Deep Learning Based App..."

References

"A Survey on Deep Learning Based App..." refers background in this paper

"A Survey on Deep Learning Based App..." refers background or methods in this paper

"A Survey on Deep Learning Based App..." refers background in this paper

"A Survey on Deep Learning Based App..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (18)

Q1. What have the authors contributed in "A survey on deep learning based approaches for action and gesture recognition in image sequences" ?

Q2. What is the valuable cue in deep learning?

Q3. What is the used network for this task?

Q4. What are the main reasons why the authors are focusing on deep learning?

Q5. What are some other successful extensions of RNN in recognizing human actions?

Q6. What is the common way to combine a recurrent model with a localized?

Q7. What is the main problem of the authors?

Q8. What are the main variants for information fusion in deep learning?

Q9. What is the main problem with the NKTM?

Q10. What is the main problem with the R-NKTM?

Q11. What are the main applications of deep learning?

Q12. How does the winner method get the accuracy on UCF101?

Q13. How does the winner method achieve good performance?

Q14. What are the main problems that will receive attention in the next years?

Q15. What is the problem with the weights of a 3D CNN?

Q16. What is the trick to improve the performance of deep learning?

Q17. What is the final decision taken by a single-layer network?

Q18. What are the main reasons for the use of 3D CNNs?