scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences

TL;DR: A taxonomy that summarizes important aspects of deep learning for approaching both action and gesture recognition in image sequences is introduced, and the main works proposed so far are summarized.
Abstract: The interest in action and gesture recognition has grown considerably in the last years. In this paper, we present a survey on current deep learning methodologies for action and gesture recognition in image sequences. We introduce a taxonomy that summarizes important aspects of deep learning for approaching both tasks. We review the details of the proposed architectures, fusion strategies, main datasets, and competitions. We summarize and discuss the main works proposed so far with particular interest on how they treat the temporal dimension of data, discussing their main features and identify opportunities and challenges for future research.

Summary (3 min read)

Introduction

  • A survey on current deep learning methodologies for action and gesture recognition in image sequences.
  • The authors introduce a taxonomy that summarizes important aspects of deep learning for approaching both tasks.
  • In 1997, authors’ effort led to the development of the long short-term memory (LSTM) [40] cells for RNNs.
  • The amount of research that has been generated in these topic within the last few years is astounding.
  • The remainder of this paper is organized as follows.

II. TAXONOMY

  • Fig. 1 illustrates a taxonomy of the main works performing action and gesture recognition using deep learning approaches.
  • Note that with recognition the authors refer to either classification of pre-segmented video segments or localization of actions in long untrimmed videos.

A. Architectures

  • The most crucial challenge in deep-based human action and gesture recognition is how to deal with the temporal dimension.
  • Based on that, the authors categorize approaches into different three groups.
  • The third group combines a 2D (or 3D) CNN applied at individual (or stacks of) frames with a temporal sequence modeling.
  • Recurrent Neural Network (RNN) [26] is one of the most used networks for this task, which can take into account the temporal data using recurrent connections in hidden layers.

B. Fusion strategies

  • Information fusion is common in deep learning methods for action and gesture recognition.
  • At times, fusion is used to combine the information from parts of a segmented video sequence [51, 115], although, it is more common to fuse information from multiple cues (e.g. RGB and motion, depth, and/or audio) [32], as well combining models trained with different data samples and learning parameters [68].
  • There are three main variants for information fusion in deep learning models: early (before the data is feed into the model, or the model fuses information directly from multiple sources), late (outputs of deep learning models are combined) and middle (intermediate layers fuse information) fusions [68, 69].
  • An example of the latter is shown in Fig. 2. Modifications and variants of these schemes have been proposed as well, for instance, see the variants introduced in [51] for fusing information in the temporal dimension.
  • Moreover, ensembles or stacked networks are also considered as fusion strategies [115, 105, 68].

D. Challenges

  • Every year computer vision organizations arrange compe- titions providing useful annotated datasets.
  • Table V shows 5 main challenge series in computer vision.
  • For each, the authors report the year in which it took place, the name of the dataset along with the task to be faced (either action- or gesture-related), the associated event’s name, the winner participant, and the more recent results on the challenge’s associated dataset.

III. ACTION/ACTIVITY RECOGNITION

  • This section reviews deep methods to address action recog- nition divided on how they treat the temporal dimension: 3D convolutions, pre-computed motion features, or temporal modeling.
  • The larger number of parameters w.r.t. 2D models, make them harder to train.
  • Other authors focused on further improving accuracy of 3D CNNs. [32] performs 3D convolutions over stacks of optical flow maps. [95] uses multiple 3D CNNs in a multi-stage (proposal generation, classification, and fine-grained localization) framework for temporal action localization in long untrimmed videos.
  • The authors also find 3D CNN models being combined with sequence modeling methods [7] or hand-crafted feature descriptors (VLAD [30] or iDTs [129]).

B. Motion-based features

  • Neural networks and CNNs based on hand and body pose estimation as well as motion features have been widely applied for gesture recognition.
  • For gesture style recognition in biometrics, [126] proposes a two-stream (spatio-temporal) CNN.
  • The authors use raw depth data as the input of spatial network and optical flow as the input of temporal one.
  • For articulated human pose estimation in videos the authors of [43] propose a Convolutional Network architecture for estimating the 2D location of human joints in video, with an RGB image and a set of motion features as the input data of this network.
  • The authors of [117] use three representations of dynamic depth image (DDI), dynamic depth normal image (DDNI) and dynamic depth motion normal image for gesture recognition.

D. Deep learning with fusion strategies

  • Some methods have used diverse fusion schemes to improve recognition performance of action recognition. [37] proposes a novel Subdivision-Fusion Model (SFM), where features extracted with CNN are clustered and grouped into subcategories. [22] learns an end-to-end hierarchical RNN using skeleton data divided into five parts, each of which is feed into a different network.
  • The final decision is taken by single-layer network. [99] faces the problem of first person action recognition using a multi-stream CNN (ego-CNN, temporal, and spatial).
  • [119] focuses on the changes that an action brings into the environment and propose a siamese CNN architecture to fuse precondition and effect information from the environment. [20] proposes a CNN which uses mid-level discriminative visual elements.
  • The method, called DeepPattern, is able to learn discriminative patches by exploring human body parts as well as scene context. [76] proposes DeepConvLSTM, based on convolutional and LSTM recurrent units, which is suitable for multimodal wearable sensors.

IV. GESTURE RECOGNITION

  • Mainly driven by the areas of human computer, machine, and robot interaction.the authors.
  • A. 3D Convolutional Neural Networks Several 3D CNNs have been proposed for gesture recognition, most notably [64, 41, 63]. [41] proposes a 3D CNN for sign language recognition.
  • The CNN automatically learns a representation from raw video, and processes multimodal information (RGB-D+Skeleton data).
  • Similar in spirit, [63] introduces a 3D CNN for driver hand gesture recognition from depth and intensity data. [64] extends a 3D CNN with a recurrent mechanism for detection and classification of dynamic hand gestures.
  • It consists of a 3D-CNN for spatiotemporal feature extraction, followed by a recurrent layer for global temporal modeling and a softmax layer for predicting class-conditional gesture probabilities.

D. Deep Learning with fusion strategies

  • Multimodality has been widely exploited for gesture recog- nition.
  • [124] proposes a semi-supervised hierarchical dynamic framework based on a HMM for simultaneous gesture segmentation and recognition using skeleton joint information, depth and RGB images.
  • Separate CNNs are considered for each modality at the beginning of the model structure with increasingly shared layers and a final prediction layer.
  • The authors exploited early and middle fusion methods to integrate the models. [54] proposes a CNN that learns to score pairs of input images and human poses .
  • The authors then calculate score function by dot-product between the two embeddings; i.e. late fusion.

V. DISCUSSION

  • The authors presented a comprehensive overview of deep-based models for action and gesture recognition.
  • It has been also shown that using training networks on precomputed motion features is an effective way to save them from implicit learning of motion features.
  • Taking into account the full temporal scale, results in a huge amount of weights for learning.
  • Yet another trick to improve the result of deep-based models is data fusion.
  • One valuable cue is spatial structure of actions/gestures. [112] takes advantage of iDTs to pool relevant CNN features along trajectories in video frames. [12], takes advantage of human body spatial constraints, by aggregating convolutional activations of a 3D CNN into descriptors based on joint positions.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A survey on deep learning based approaches for
action and gesture recognition in image sequences
Maryam Asadi-Aghbolaghi
1,2,3
, Albert Clap
´
es
2,3
, Marco Bellantonio
4
, Hugo Jair Escalante
5
,
V
´
ıctor Ponce-L
´
opez
2,3,6
, Xavier Bar
´
o
6
, Isabelle Guyon
7
, Shohreh Kasaei
1
, Sergio Escalera
2,3
1
Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
2
Department of Mathematics and Informatics, University of Barcelona, Barcelona, Spain
3
Computer Vision Center, Autonomous University of Barcelona, Barcelona, Spain
4
Facultat d’Informatica, Polytechnic University of Barcelona, Barcelona, Spain
5
Instituto Nacional de Astrof
´
ısica,
´
Optica y Electr
´
onica Puebla, Mexico
6
EIMT, Open University of Catalonia, Barcelona, Spain
7
Universit
´
e Paris-Saclay, Paris, France
masadia@ce.sharif.edu
Abstract—The interest in action and gesture recognition has
grown considerably in the last years. In this paper, we present
a survey on current deep learning methodologies for action
and gesture recognition in image sequences. We introduce a
taxonomy that summarizes important aspects of deep learning
for approaching both tasks. We review the details of the proposed
architectures, fusion strategies, main datasets, and competitions.
We summarize and discuss the main works proposed so far with
particular interest on how they treat the temporal dimension of
data, discussing their main features and identify opportunities
and challenges for future research.
I. INTRODUCTION
Action and gesture recognition have been studied for a while
within the fields of computer vision and pattern recognition
and substantial progress has been reported for both tasks in
the last two decades. Recently, deep learning has irrupted in
these fields achieving outstanding results and outperforming
“non-deep” state-of-the-art methods [97, 112, 31].
The temporal dimension in sequences typically causes ac-
tion/gesture recognition to be a challenging problem in terms
of both amounts of data to be processed and model complexity
which in particular are crucial aspects for training large
parametric deep learning networks. In this context, authors
proposed several strategies, such as frame sub-sampling, ag-
gregation of local frame-level features into mid-level video
representations, or temporal sequence modeling, just to name a
few. For the latter, researchers tried to exploit recurrent neural
networks (RNN) in the past [108]. However, these models
typically faced some major mathematical difficulties identified
by Hochreiter [39] and Bengio et al [9]. In 1997, authors’
effort led to the development of the long short-term memory
(LSTM) [40] cells for RNNs. Today, LSTMs are an important
part of deep models for image sequence modeling for human
action/gesture recognition [98, 92]. These, along with implicit
An extended version of this paper will be made available as: Asadi-
Aghbolaghi et al. Deep learning for action and gesture recognition in
image sequences: a survey. Book chapter in Springer Series on Challenges
in Machine Learning, forthcoming 2018.
modeling of spatiotemporal features using 3D convolutional
nets [47], pre-computed motion-based features [97], and the
combination of multiple visual [98], resulted in fast and
reliable state-of-the-art methods for action/gesture recognition.
Although the application of deep learning to action and
gesture recognition is relatively new, the amount of research
that has been generated in these topic within the last few years
is astounding. Even so, to the best of our knowledge, there is
no previous survey that collects and reviews all of the existent
work on deep learning for action and gesture recognition.
This paper aims at capturing a snapshot of current trends in
this direction, including an in depth analysis of different deep
models, with special interest on how they treat the temporal
dimension of the data.
The remainder of this paper is organized as follows. Sec-
tion II presents a taxonomy in this field of research. Next,
Section III reviews the literature on human action/activity
recognition with deep learning models. Section IV summarizes
the state-of-the-art on deep learning for gesture recognition.
Finally, Section V discusses the main features of the reviewed
deep learning for the both studied problems.
II. TAXONOMY
Fig. 1 illustrates a taxonomy of the main works performing
action and gesture recognition using deep learning approaches.
Note that with recognition we refer to either classification of
pre-segmented video segments or localization of actions in
long untrimmed videos.
A. Architectures
The most crucial challenge in deep-based human action
and gesture recognition is how to deal with the temporal
dimension. Based on that, we categorize approaches into
different three groups. The first group uses 3D filters in the
convolutional layer [7, 47, 58, 105]. The 3D convolution and
3D pooling in CNN layers allow to capture discriminative
features along both spatial and temporal dimensions while
maintaining a certain temporal structure. In the second group,978-1-5090-4023-0/17/$31.00
c
2017 IEEE

Action and gesture
recognition approaches
3D Models (3D conv a pool)
[48, 58, 102, 30, 61,
7, 129, 32, 95, 105]
Motion-based input features
[105, 97, 112, 34]
Temporal methods
2D Models + RNN [130, 26] + LSTM [33, 7, 83, 21, 76, 71,
60]
2D Models + B-RNN[98] + LSTM [83, 98]
2D Models + H-RNN[22] + LSTM [22]
2D Models + D-RNN + LSTM [106]
2D Models + HMM [124]
2D/3D Models + Auxiliary outputs [48]
2D/3D Models + Hand-crafted features [112]
Figure 1: Taxonomy of deep learning approaches for gesture and action recognition.
motion features like 2D dense optical flow maps are pre-
computed and input to the networks [105, 97, 112, 34, 102,
122]. Extracted motion features can be fed to the network as
additional channels to the appearance ones [105] or input to
a secondary network (later combined with former one) [97].
Fig. 2 illustrates these first two groups. The third group
combines a 2D (or 3D) CNN applied at individual (or stacks
of) frames with a temporal sequence modeling. Recurrent
Neural Network (RNN) [26] is one of the most used networks
for this task, which can take into account the temporal data
using recurrent connections in hidden layers. The drawback
of this network is its short-term memory which is insufficient
for real world actions. To solve this problem Long Short-Term
Memory (LSTM) [33] was proposed, and it is usually used as a
hidden layer of RNN, as seen in Fig. 2. Bidirectional RNN (B-
RNN) [83], Hierarchical RNN (H-RNN) [22], and Differential
RNN (D-RNN) [106] are some other successful extensions of
RNN in recognizing human actions. Other temporal modeling
tools like HMM are also applied [124].
For all methods in the three groups, the performance of a
deep model can be boosted by combination with hand-crafted
features, e.g. improved dense trajectories (iDT) [112].
B. Fusion strategies
Information fusion is common in deep learning methods
for action and gesture recognition. At times, fusion is used
to combine the information from parts of a segmented video
sequence [51, 115], although, it is more common to fuse
information from multiple cues (e.g. RGB and motion, depth,
and/or audio) [32], as well combining models trained with
different data samples and learning parameters [68].
There are three main variants for information fusion in deep
learning models: early (before the data is feed into the model,
or the model fuses information directly from multiple sources),
late (outputs of deep learning models are combined) and
middle (intermediate layers fuse information) fusions [68, 69].
An example of the latter is shown in Fig. 2. Modifications
and variants of these schemes have been proposed as well,
for instance, see the variants introduced in [51] for fusing
information in the temporal dimension. Moreover, ensembles
or stacked networks are also considered as fusion strate-
gies [115, 105, 68].
RNN
CNN CNN CNN
LSTM
LSTM
LSTM
Output frame-level predictions
Spatio-temporal stream
Temporal stream
Input frame
Input frames
Input frames (stack)
3D
convolution
Input frame
CNN
motion
Figure 2: The different architectures and fusion strategies. Top-left: 3D con-
volution. Top-right: motion pre-computation. Bottom-left: sequential modeling
via LSTM. Bottom-right: fusion into a spatio-temporal stream.
C. Datasets
We list the most relevant datasets according to action (or
activity) and gesture recognition in Table I and II respectively.
For each dataset we specify year of creation, problems for
which the dataset was defined (either classification or tempo-
ral/spatiotemporal localization), data modalities available for
the task, involved body parts, the number of classes, and state-
of-the-art performances to this date (which provide a hint of
how difficult the datasets are).
Table III and Table IV summarize the recent approaches
which obtained remarkable results against two of the most
well-known and challenging datasets in action recognition, re-
spectively, UCF-101 and THUMOS-14. Reviewing top ranked
methods on UCF-101 dataset, we find that the most significant
difference among them is the strategy for splitting video data
and combine sub-sequence results. [119] encodes the changes
in the environment by dividing the input sequence into two
parts, pre-condition and effect, and model the actions as linear
transformation from one to another. [55] processes the input

Table I: Action datasets
Year Dataset Problem Body Parts Modality No.classes Performance
2004 KTH AC F I 6 98.67% Acc [136]
2006 IXMAS AC F RGB, A 13 98.79% Acc [104]
2007 HDM05 AC F S 100 98.17% Acc [14]
2008 HOHA (Hollywood 1) AC, TL F, U, L RGB 8
71.90% Acc [91],
0.787@0.5 mAP [62]
2008 UCF Sports AC, STL F RGB 10
95.80% Acc [94],
0.789@0.5 mAP [62]
2009 Hollywood 2 AC F, U, L RGB 12 78.50 mAP [56]
2009 UCF11 (YouTube Action) AC, STL F RGB 11 93.77% Acc [82], -
2010 Highfive AC, STL F,U RGB 4 69.40 mAP [109], 0.466 IoU [6]
2010 MSRAction3D AC F D, S 20 97.30% Acc [59]
2010 MSRAction II STL F RGB 3 85.00@0.125% mAP [17]
2010 Olympic Sports AC F RGB 16 96.60% Acc [55]
2011 Collective Activity (Extended) AC F RGB 6 90.23% Acc [5]
2011 HMDB51 AC F, U, L RGB 51 73.60% Acc [110]
2012 MPII Cooking AC, TL F, U RGB 65 72.40 mAP [137], -
2012 MSRDailyActivity3D AC F,U RGB, D, S 16 97.50% Acc [93]
2012 UCF101 AC,TL F, U, L RGB 101
94.20% Acc [115],
46.77@0.2 mAP (split 1) [122]
2012 UCF50 AC F, U, L RGB 50 97.90% Acc [24]
2012 UTKinect-Action3D AC F RGB, D, S 10 98.80% Acc [52]
2013 J-HMDB AC, STL F, U, L RGB, S 21
71.08 Acc [79],
73.1@0.5 mAP [91]
2013 Berkeley MHAD AC F RGB, D, S, A 11 100.00% Acc [14]
2014 N-UCLA Multiview Action3D AC F RGB, D, S 10 90.80% Acc [52]
2014 Sports 1-Million AC F, U, L RGB 487 73.10% Acc [133]
2014 THUMOS-14 AC, TL F, U, L RGB 101, 20 *
71.60 mAP [46],
0.190@0.5 mAP [95]
2015 THUMOS-15 AC, TL F, U, L RGB 101, 20 *
80.80 mAP [55],
0.183@0.5 mAP (a)
2015 ActivityNet AC, TL F, U, L RGB 200
93.23 mAP (b),
0.594@0.5 mAP [65]
2016 NTU RGB+D AC F RGB, D, S, IR 60 {69.20, 77.70}
1
Acc [57]
Problems: action classification (AC), temporal localization (TL), and spatiotemporal localization (STL). Body parts: full
body (F), upper body (B), and lower body (L). Modalities: audio (A), depth (D), grayscale intensity (I), infrared (IR),
skeleton (S), and color (RGB).
Performance: Acc (accuracy), mAP (mean average precision), IoU (intersection-over-union).
*
A different no. classes used for different problems. For TL/STL, “@” indicates amount overlap with groundtruth
considered for positive localization.
(a) Winner method from (http://activity-net.org/challenges/2016/program.html#leaderboard).
(b) Winner method from http://www.thumos.info/results.html.
1
{cross-subject accuracy, cross-view accuracy}.
Table II: Gesture datasets
Year Dataset Problem Body Parts Modality No.classes Performance
2011 ChaLearn Gesture GC F, U RGB, D 15 -
2012 MSR-Gesture3D GC F, H RGB, D 12 98.50% Acc [16]
2014 ChaLearn (Track 3) GC, TL U RGB, D, S 20 98.20 Acc [64], 0.870 IoU [69]
2015 VIVA Hand Gesture GC H RGB 19 77.50% Acc [63]
2016
ChaLearn conGD TL
U RGB, D 249
0.315 IoU [11]
ChaLearn isoGD GC 67.19% Acc [23]
Problems: gesture classification (GC). Body parts: hands (H).
Check also tablenotes on Table I for additional notation.
video as a hierarchical structure over the time in 3 levels,
i.e. short-term, medium-range, and long-range. [105] achieves
good performance by using a two-stream network (RGB and
motion) with extended temporal resolution respect to previous
works (from 16 to 60 frames). [135] could get the best
accuracy on UCF101 by using Trajectory pooling to pool the
extracted convolutional features from from the optical flow
nets of Two-Stream ConvNets and the frame-diff layers of
spatial network to get local descriptors.
Looking at the top ranked deep models on THUMOS 2014
challenge, almost all winner methods combined appearance
and motion features. For appearance, most of the methods
extract frame-level CNN descriptors, and video representation
is generated using a pooling method over the sequence. On
the other hand, motion-based approaches in the top ranked
methods can be divided into three groups, FlowNet, 3D CNN,
and iDTs. In [84], we provide a comparison of those showing
3D CNN achieves the best result.
D. Challenges
Every year computer vision organizations arrange compe-
titions providing useful annotated datasets. Table V shows 5
main challenge series in computer vision. For each, we report
the year in which it took place, the name of the dataset along
with the task to be faced (either action- or gesture-related), the
Table III: UCF-101 dataset results
Ref. Year Features Architecture Score
[135] 2015 CNN, IDT 2 CNN + iDT pooling 93.78%
[105] 2016 Opt. Flow, 3D CNN, IDT LTC-CNN 92.7%
[32] 2016 conv5, 3D pool VGG-16, VGG-M, 3D CNN 92.5%
[119] 2016 CNN Siamese VGG-16 92.4%
[55] 2016 CNN fc7 2 CNNs (spatial + temporal) 92.2%
[112] 2015 CNN, Hog/Hof/Mbh 2-stream CNN 91.5%
[61] 2015 CNN feat 3D CNN 89.7%
[10] 2016 Dynamic feat maps BVLC CaffeNet 89.1%
[46] 2015 H/H/M, IDT, FV+PCA+GMM 8-layer CNN 88.5%
[102] 2015 CNN F
ST
CN: 2 CNNs (spat + temp) 88.1%
[97] 2014 CNN Two-stream CNN (CNN-M-2048) 88.0%
[60] 2016 eLSTM, DCNN fc7 eLSTM, DCNN+LSTM 86.9%
[134] 2016 CNN 2 CNNs (spatial + temporal) 86.4%
[129] 2016 dense trajectory, C3D RNN, LSTM, 3DCNN 85.4%
[78] 2015 CNN fc6, HOG/HOF/MBH VGG19 Conv5 79.52%±1.1% (tr2)
[78] 2015 CNN fc6, HOG/HOF/MBH VGG19 Conv5 66.64% (tr1)
[51] 2014 CNN features 2 CNN converge to 2 fc layers 65.4%, 68% mAP
[45] 2015 ImageNet CNN, word2vec GMM CNN 63.9%
[122] 2015 CNN Spat + motion CNN 54.28% mAP
Table IV: THUMOS-14 dataset results
Ref. Year Features Architecture Score
[46] 2015 H/H/M, IDT, FV+PCA+GMM. 8-layer CNN 71.6%
[134] 2016 CNN 2 CNNs (spatial + temporal) 61.5%
[45] 2015 ImageNet CNN, word2vec GMM CNN 56.3%
[95] 2016 CNN fc6, fc7, fc8 3D CNN, Segment-CNN 19% mAP
[130] 2015 CNN fc7 VGG-16, 3-layer LSTM 17.1% mAP
[30] 2016 fc7 3D CNN C3D CNN net
.084% mAP@50
.121% mAP@100
.139% mAP@200
.125% mAP@500
associated event’s name, the winner participant, and the more
recent results on the challenge’s associated dataset.
III. ACTION/ACTIVITY RECOGNITION
This section reviews deep methods to address action recog-
nition divided on how they treat the temporal dimension:
3D convolutions, pre-computed motion features, or temporal
modeling.
A. 3D Convolutional Neural Networks
In order to capture temporal information, one approach
consists in extending the convolution along the temporal axis,
in what is known as 3D CNN [47, 7, 103, 61, 58, 102, 30, 129,
32, 95]. The larger number of parameters w.r.t. 2D models,
make them harder to train. To alleviate this problem, [61]
initializes the weights of a 3D CNN by using 2D weights
learned from ImageNET, while [102] proposes a 3D CNN
(F
st
CN) that factorizes the 3D convolutional kernel learning
as a sequential process of learning 2D spatial and 1D temporal
kernels in different layers. Other authors focused on further
improving accuracy of 3D CNNs. [32] performs 3D convo-
lutions over stacks of optical flow maps. [95] uses multiple
3D CNNs in a multi-stage (proposal generation, classifica-
tion, and fine-grained localization) framework for temporal
action localization in long untrimmed videos. We also find
3D CNN models being combined with sequence modeling
methods [7] or hand-crafted feature descriptors (VLAD [30]
or iDTs [129]).
B. Motion-based features
In recent years, many approaches have focused on incorpo-
rating pre-computed temporal features within the deep model,

Table V: Challenges
Challenge Year Dataset Task Event Winner Results
ChaLearn
2012 CGD G - Alfnie [53]* [27]
2013 Montalbano G - [125] [8]
2014
HuPBA 8K+ A
ECCV
[80] -
Montalbano G [68] [83][69][96]
2015 HuPBA 8K+ A CVPR [121] -
2016 isoGD, conGD G ICPR [13] [51], [117]
HAL 2012 LIRIS A ICPR [70] [123][35]*
Opportunity 2011 Opportunity A - CSTA [90][15][89]
ROSE 2016 NTU RGB+D A ACCV SEARCH [92]
THUMOS
2013 UCF101 A ICCV [75] [101][100][81][50]
2014 THUMOS-14 A ECCV [44] [46][95][111][88]
2015 THUMOS-15 A CVPR [128] [114][132]
VIVA 2015 VIVA G CVPR [63] [63][74]
VIRAT 2012 VIRAT DB A CVPR - [107][73]
* Non-deep learning method
e.g. dense optical flow maps or iDTs. [122] detects frame
proposals and scores them with a combination of static and
motion CNN features for action localization. [97] presents a
two-stream CNN which incorporates both spatial (still image)
and temporal networks (multi-frame dense optical flow). [134]
exploit a motion vector from video compression (instead
of optical flow). [34] localizes actions in space and time
using a (spatial-temporal) two-stream CNNs whose predictions
are late-fused with SVM. [105] extends the convolutions in
time, aiming at capturing long-term temporal convolutions, at
expenses of spatial resolution. [116] uses view-invariant multi-
scale depth maps as a input motion descriptor for CNN. [51]
proposes a multi-scale foveated CNN for large-scale video
classification. Differently, [85] uses CNNs to obtain canonical
human poses for action recognition. [113] simply estimates
actionness maps from appearance and motion cues. In the
same line, [138] introduces a deep-learning method to identify
key volumes and classify them simultaneously.
In the literature there exist several methods which extend
the CNN capabilities using trajectory features. [112] pools
and normalizes CNN feature maps along improved dense
trajectories. [78] concatenates iDTs (HOG, HOF, MBHx,
MBHy descriptors with fisher vector encoding) and CNN
feature (VGG19) descriptors. [86] presents a Robust Non-
linear Knowledge Transfer Model (R-NKTM) based on a deep
fully-connected network that transfers human actions from any
view to a canonical one. R-NKTM is learned using bag-of-
features from dense trajectories of synthetic 3D human models
and generalizes to real videos of human actions. [120] bases on
iDT Descriptors and two-Stream CNN features, using a non-
action classifier to down-weight irrelevant video segments.
[18] presents a new Pose-based CNN descriptor which ag-
gregates motion and appearance information along tracks of
human body parts. [55] proposes VLAD
3
to model long-range
dynamic information. It captures short-term dynamics with
deep CNN features, relying on linear dynamic systems (LDS)
to model medium-range dynamics.
C. Temporal deep learning models: RNN and LSTM
We also find approaches which combine CNN with temporal
sequence modeling techniques, such as RNNs or LSTMs.
[106] introduces a differential gating scheme for LSTM to
emphasize on the change in information gain caused by the
salient motions between successive frames. [71] proposes a
RNN to perform interactional parsing of objects. The object
parsings are used to form object specific action representations
for fine grained action detection. [98] presents a multi-stream
bi-directional RNN. A tracking algorithm locates a bounding
box around a person and two streams (motion and appearance)
cropped to the tracked bounding box are trained along with
full-frame streams. The CNN is followed by a bidirectional
LSTM layer. [130] introduces a fully end-to-end approach
based on a RNN agent. The agent observes video frames and
decides both where to look next and when to emit a prediction.
[60] proposes a deep architecture which uses 3D skeleton
sequences to regularize an LSTM network (LSTM+CNN) on
the video frames.
D. Deep learning with fusion strategies
Some methods have used diverse fusion schemes to improve
recognition performance of action recognition. [37] proposes
a novel Subdivision-Fusion Model (SFM), where features
extracted with CNN are clustered and grouped into subcat-
egories. [22] learns an end-to-end hierarchical RNN using
skeleton data divided into ve parts, each of which is feed into
a different network. The final decision is taken by single-layer
network. [99] faces the problem of first person action recog-
nition using a multi-stream CNN (ego-CNN, temporal, and
spatial). [119] focuses on the changes that an action brings into
the environment and propose a siamese CNN architecture to
fuse precondition and effect information from the environment.
[20] proposes a CNN which uses mid-level discriminative
visual elements. The method, called DeepPattern, is able to
learn discriminative patches by exploring human body parts as
well as scene context. [76] proposes DeepConvLSTM, based
on convolutional and LSTM recurrent units, which is suitable
for multimodal wearable sensors.
IV. GESTURE RECOGNITION
In this section we review recent deep-learning approaches
for gesture recognition in videos, mainly driven by the areas
of human computer, machine, and robot interaction.
A. 3D Convolutional Neural Networks
Several 3D CNNs have been proposed for gesture recog-
nition, most notably [64, 41, 63]. [41] proposes a 3D CNN
for sign language recognition. The CNN automatically learns
a representation from raw video, and processes multimodal
information (RGB-D+Skeleton data). Similar in spirit, [63]
introduces a 3D CNN for driver hand gesture recognition
from depth and intensity data. [64] extends a 3D CNN with
a recurrent mechanism for detection and classification of
dynamic hand gestures. It consists of a 3D-CNN for spatio-
temporal feature extraction, followed by a recurrent layer for
global temporal modeling and a softmax layer for predicting
class-conditional gesture probabilities.

B. Motion-based features
Neural networks and CNNs based on hand and body pose
estimation as well as motion features have been widely ap-
plied for gesture recognition. For gesture style recognition
in biometrics, [126] proposes a two-stream (spatio-temporal)
CNN. The authors use raw depth data as the input of spatial
network and optical flow as the input of temporal one. For
articulated human pose estimation in videos the authors of
[43] propose a Convolutional Network (ConvNet) architecture
for estimating the 2D location of human joints in video, with
an RGB image and a set of motion features as the input data
of this network. The authors of [117] use three representations
of dynamic depth image (DDI), dynamic depth normal image
(DDNI) and dynamic depth motion normal image (DDMNI)
for gesture recognition.
[118] first identifies the start and end frames of each
gesture based on quantity of movement (QOM), and then they
construct Improved Depth Motion Map (IDMM) by calculating
the absolute depth difference between current frame and the
start frame for each gesture segment, as the input data of deep
learning network.
C. Temporal deep learning models: RNN and LSTM
This kind of models have not been widely used for gesture
recognition, despite being a promising venue for research. We
are aware of [67], where the authors propose a multimodal
(depth video, skeleton, and speech) human gesture recognition
system based on RNN. [25] presents a Convolutional Long
Short-Term Memory Recurrent Neural Network (CNNLSTM)
able to successfully learn gesture varying in duration and com-
plexity. [72] proposes a multi-stream model, called MRNN,
which extends RNN capabilities with LSTM cells in order to
facilitate the handling of variable-length gestures.
D. Deep Learning with fusion strategies
Multimodality has been widely exploited for gesture recog-
nition. [124] proposes a semi-supervised hierarchical dynamic
framework based on a HMM for simultaneous gesture seg-
mentation and recognition using skeleton joint information,
depth and RGB images. The authors applied intermediate
(middle) and late fusion to get the final result. [69] proposes a
multimodal multi-stream CNN for gesture spotting. Separate
CNNs are considered for each modality at the beginning of
the model structure with increasingly shared layers and a final
prediction layer. The authors fuse the result of each network
by a meta-classifier independently at each scale; i.e., late
fusion. [77] presents a deep learning model to fuse multiple
information sources for human pose estimation. The deep
model takes as input the output of a state-of-the-art human
pose estimator. The authors exploited early and middle fusion
methods to integrate the models.
[54] proposes a CNN that learns to score pairs of input
images and human poses (joints). The model is formed by
two subnetworks: a CNN learns a feature embedding for
the input images, and a two layer subnetwork learns an
embedding for the human pose. The authors then calculate
score function by dot-product between the two embeddings;
i.e. late fusion. Similarly, [43] proposes a CNN for estimating
2D joints location. The CNN incorporates RGB image and
motion features. The authors utilize early fusion to integrate
these two kinds of features.
V. D ISCUSSION
We presented a comprehensive overview of deep-based
models for action and gesture recognition. We defined a
taxonomy covering most of basic and crucial information
about human action and gesture analysis and then we reviewed
recent methods. Key topics identified include architectures,
fusion strategies, datasets, and challenges.
Generally, there are two main issues when comparing the
methods; i.e. how does a method deal with temporal informa-
tion? and how can such a large network be trained with small
datasets? As discussed, methods can learn motion features
by 3D filters in their 3D convolutional and pooling layers. It
has been shown that 3D networks over a long sequence are
able to learn complex temporal patterns [105]. Because of the
required amount of data, the problem of weights initialization
has been investigated. The transformation of 2D Convolutional
Weights into 3D ones yield models to achieve better accuracy
than training scratch [61].
It has been also shown that using training networks on pre-
computed motion features is an effective way to save them
from implicit learning of motion features. Moreover, fine-
tuning motion-based networks with spatial data (ImageNet)
proved to be effective. Allowing networks which are fine-tuned
on stacked optical flow frames to achieve good performance
in spite of having limited training data.
Still, both groups can only exploit limited (local) temporal
information. Hence, the most crucial advantage of approaches
in the third group (i.e. temporal models like RNN, LSTM) is
that they are able to cope with longer-range temporal relations.
These models are mostly used with the skeletal data. These
models are preferred when dealing with skeletal data. Since
skeleton features are low-dimensional, these networks have
fewer weights, and thus, can be trained with fewer data.
Regardless of the model, performance is dependent on the
amount of data. The community is nowadays putting efforts
on building larger data sets that can cope with huge parametric
deep models (e.g. [2, 38]) and on challenge organization (with
novel data sets and well defined evaluation protocols) that can
advance the state-of-the-art of the field and make easier the
comparison among deep learning architectures (e.g. [92, 29]).
Strategies for data augmentation and pre-training are com-
mon. Likewise, training mechanisms to avoid overfitting (e.g.
dropout) and to control the learning rate (e.g. extensions to
SGD and Nesterov momentum) have been proposed. Taking
into account the full temporal scale, results in a huge amount
of weights for learning. To address this problem and decrease
the number of weights, a good trick is to decrease the spatial
resolution while increasing the temporal length.
Yet another trick to improve the result of deep-based models
is data fusion. Individual networks can be trained on different

Citations
More filters
Journal ArticleDOI
TL;DR: A detailed overview of recent advances in RGB-D-based motion recognition is presented in this paper, where the reviewed methods are broadly categorized into four groups, depending on the modality adopted for recognition: RGB-based, depth based, skeleton-based and RGB+D based.

270 citations

Book ChapterDOI
08 Sep 2018
TL;DR: A novel weakly-supervised TAL framework called AutoLoc is developed to directly predict the temporal boundary of each action instance and a novel Outer-Inner-Contrastive (OIC) loss is proposed to automatically discover the needed segment-level supervision for training such a boundary predictor.
Abstract: Temporal Action Localization (TAL) in untrimmed video is important for many applications. But it is very expensive to annotate the segment-level ground truth (action class and temporal boundary). This raises the interest of addressing TAL with weak supervision, namely only video-level annotations are available during training). However, the state-of-the-art weakly-supervised TAL methods only focus on generating good Class Activation Sequence (CAS) over time but conduct simple thresholding on CAS to localize actions. In this paper, we first develop a novel weakly-supervised TAL framework called AutoLoc to directly predict the temporal boundary of each action instance. We propose a novel Outer-Inner-Contrastive (OIC) loss to automatically discover the needed segment-level supervision for training such a boundary predictor. Our method achieves dramatically improved performance: under the IoU threshold 0.5, our method improves mAP on THUMOS’14 from 13.7% to 21.2% and mAP on ActivityNet from 7.4% to 27.3%. It is also very encouraging to see that our weakly-supervised method achieves comparable results with some fully-supervised methods.

261 citations


Cites background from "A Survey on Deep Learning Based App..."

  • ...Video Action Analysis Detailed reviews can be found in recent surveys [65,42,2,9,3,31]....

    [...]

Proceedings ArticleDOI
15 Jun 2019
TL;DR: This work identifies two underexplored problems posed by the weak supervision for temporal action localization, namely action completeness modeling and action-context separation, and proposes a multi-branch neural network in which branches are enforced to discover distinctive action parts.
Abstract: Temporal action localization is crucial for understanding untrimmed videos. In this work, we first identify two underexplored problems posed by the weak supervision for temporal action localization, namely action completeness modeling and action-context separation. Then by presenting a novel network architecture and its training strategy, the two problems are explicitly looked into. Specifically, to model the completeness of actions, we propose a multi-branch neural network in which branches are enforced to discover distinctive action parts. Complete actions can be therefore localized by fusing activations from different branches. And to separate action instances from their surrounding context, we generate hard negative data for training using the prior that motionless video clips are unlikely to be actions. Experiments performed on datasets THUMOS'14 and ActivityNet show that our framework outperforms state-of-the-art methods. In particular, the average mAP on ActivityNet v1.2 is significantly improved from 18.0% to 22.4%. Our code will be released soon.

219 citations


Cites background from "A Survey on Deep Learning Based App..."

  • ...Please refer to recent surveys [1, 3, 22, 19] for a detailed review....

    [...]

Journal ArticleDOI
TL;DR: A comprehensive survey of deep learning based human pose estimation methods and analyzes the methodologies employed and summarizes and discusses recent works with a methodology-based taxonomy.

216 citations


Cites background from "A Survey on Deep Learning Based App..."

  • ...Asadi-Aghbolaghi et al.[23] surveyed deep learning based approaches for action and gesture recognition in image sequences, and discussed deep learning techniques applied to action and gesture recognition....

    [...]

  • ...Asadi-Aghbolaghi et al.[23] surveyed deep learning based approaches for action and gesture recognition in image sequences, and discussed deep learning techniques applied to action and gesture...

    [...]

Journal ArticleDOI
TL;DR: Most computer vision applications such as human computer interaction, virtual reality, security, video surveillance and home monitoring are highly correlated to HAR tasks, which establishes new trend and milestone in the development cycle of HAR systems.
Abstract: Human activity recognition (HAR) systems attempt to automatically identify and analyze human activities using acquired information from various types of sensors. Although several extensive review papers have already been published in the general HAR topics, the growing technologies in the field as well as the multi-disciplinary nature of HAR prompt the need for constant updates in the field. In this respect, this paper attempts to review and summarize the progress of HAR systems from the computer vision perspective. Indeed, most computer vision applications such as human computer interaction, virtual reality, security, video surveillance and home monitoring are highly correlated to HAR tasks. This establishes new trend and milestone in the development cycle of HAR systems. Therefore, the current survey aims to provide the reader with an up to date analysis of vision-based HAR related literature and recent progress in the field. At the same time, it will highlight the main challenges and future directions.

184 citations


Cites background from "A Survey on Deep Learning Based App..."

  • ...Reviews on deep-learning based methods of human activities recognition were provided in [16, 132, 210, 225]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations


"A Survey on Deep Learning Based App..." refers background in this paper

  • ...[72] proposes a multi-stream model, called MRNN, which extends RNN capabilities with LSTM cells in order to facilitate the handling of variable-length gestures....

    [...]

  • ...Hence, the most crucial advantage of approaches in the third group (i.e. temporal models like RNN, LSTM) is that they are able to cope with longer-range temporal relations....

    [...]

  • ...[42] uses an LSTM to model each individual’s actions and a second-level LSTM aggregates the outputs of individual LSTMs....

    [...]

  • ...To solve this problem Long Short-Term Memory (LSTM) [33] was proposed, and it is usually used as a hidden layer of RNN, as seen in Fig....

    [...]

  • ...We are aware of [67], where the authors propose a multimodal (depth video, skeleton, and speech) human gesture recognition system based on RNN. [25] presents a Convolutional Long Short-Term Memory Recurrent Neural Network (CNNLSTM) able to successfully learn gesture varying in duration and complexity....

    [...]

Posted Content
TL;DR: Caffe as discussed by the authors is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
Abstract: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU ($\approx$ 2.5 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.

12,531 citations

Journal ArticleDOI
TL;DR: A proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks with a dynamic memory and suggests a method for representing lexical categories and the type/token distinction is developed.

10,264 citations


"A Survey on Deep Learning Based App..." refers background or methods in this paper

  • ...[130] introduces a fully end-to-end approach based on a RNN agent....

    [...]

  • ...Temporal methods 2D Models + RNN [130, 26] + LSTM [33, 7, 83, 21, 76, 71, 60]...

    [...]

  • ...To solve this problem Long Short-Term Memory (LSTM) [33] was proposed, and it is usually used as a hidden layer of RNN, as seen in Fig....

    [...]

  • ...For the latter, researchers tried to exploit recurrent neural networks (RNN) in the past [108]....

    [...]

  • ...We are aware of [67], where the authors propose a multimodal (depth video, skeleton, and speech) human gesture recognition system based on RNN. [25] presents a Convolutional Long Short-Term Memory Recurrent Neural Network (CNNLSTM) able to successfully learn gesture varying in duration and complexity....

    [...]

Proceedings ArticleDOI
03 Nov 2014
TL;DR: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
Abstract: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments.Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.

10,161 citations


"A Survey on Deep Learning Based App..." refers background in this paper

  • ...Among the popular ones are Caffe [49], CNTK [131], TensorFlow [1], and Theano [3]....

    [...]

Journal ArticleDOI
TL;DR: This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.
Abstract: Recurrent neural networks can be used to map input sequences to output sequences, such as for recognition, production or prediction problems. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals. We show why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases. These results expose a trade-off between efficient learning by gradient descent and latching on information for long periods. Based on an understanding of this problem, alternatives to standard gradient descent are considered. >

7,309 citations


"A Survey on Deep Learning Based App..." refers background in this paper

  • ...However, these models typically faced some major mathematical difficulties identified by Hochreiter [39] and Bengio et al [9]....

    [...]

Frequently Asked Questions (18)
Q1. What have the authors contributed in "A survey on deep learning based approaches for action and gesture recognition in image sequences" ?

In this paper, the authors present a survey on current deep learning methodologies for action and gesture recognition in image sequences. The authors introduce a taxonomy that summarizes important aspects of deep learning for approaching both tasks. The authors review the details of the proposed architectures, fusion strategies, main datasets, and competitions. The authors summarize and discuss the main works proposed so far with particular interest on how they treat the temporal dimension of data, discussing their main features and identify opportunities and challenges for future research. 

One valuable cue is spatial structure of actions/gestures. [112] takes advantage of iDTs to pool relevant CNN features along trajectories in video frames. [12], takes advantage of human body spatial constraints, by aggregating convolutional activations of a 3D CNN into descriptors based on joint positions. 

Recurrent Neural Network (RNN) [26] is one of the most used networks for this task, which can take into account the temporal data using recurrent connections in hidden layers. 

The authors anticipate deep learning will prevail in emerging applications/areas like social signal processing, affective computing, and personality analysis, among others. 

2. Bidirectional RNN (BRNN) [83], Hierarchical RNN (H-RNN) [22], and Differential RNN (D-RNN) [106] are some other successful extensions of RNN in recognizing human actions. 

Contextual cues have also been considered for action/gesture recognition. [4] proposes novel multi-stage recurrent architecture consisting of two stages: in a first stage, the model focuses on global context-aware features, and then combines the resulting representation with the localized, actionaware. [46] enriches their motion representation by encoding a set of 15,000 objects from ImageNet and computing their likelihood in frames. 

Other authors focused on further improving accuracy of 3D CNNs. [32] performs 3D convolutions over stacks of optical flow maps. [95] uses multiple 3D CNNs in a multi-stage (proposal generation, classification, and fine-grained localization) framework for temporal action localization in long untrimmed videos. 

There are three main variants for information fusion in deep learning models: early (before the data is feed into the model, or the model fuses information directly from multiple sources), late (outputs of deep learning models are combined) and middle (intermediate layers fuse information) fusions [68, 69]. 

R-NKTM is learned using bag-offeatures from dense trajectories of synthetic 3D human models and generalizes to real videos of human actions. 

[112] pools and normalizes CNN feature maps along improved dense trajectories. [78] concatenates iDTs (HOG, HOF, MBHx, MBHy descriptors with fisher vector encoding) and CNN feature (VGG19) descriptors. [86] presents a Robust Nonlinear Knowledge Transfer Model (R-NKTM) based on a deep fully-connected network that transfers human actions from any view to a canonical one. 

Regarding applications, deep learning techniques have been successfully used in traditional ones (e.g. surveillance, health care, robotics), improving performance in action and gesture recognition for human computer-robot or -machine interaction. 

[135] could get the best accuracy on UCF101 by using Trajectory pooling to pool the extracted convolutional features from from the optical flow nets of Two-Stream ConvNets and the frame-diff layers of spatial network to get local descriptors. 

[105] achieves good performance by using a two-stream network (RGB and motion) with extended temporal resolution respect to previous works (from 16 to 60 frames). 

As such, the authors envision newer problems like early recognition [28], multi-task learning [127], captioning, recognition from low resolution sequences [66] and lifelog devices [87] will receive attention in the next years. 

To alleviate this problem, [61] initializes the weights of a 3D CNN by using 2D weights learned from ImageNET, while [102] proposes a 3D CNN (FstCN ) that factorizes the 3D convolutional kernel learning as a sequential process of learning 2D spatial and 1D temporal kernels in different layers. 

To address this problem and decrease the number of weights, a good trick is to decrease the spatial resolution while increasing the temporal length. 

The final decision is taken by single-layer network. [99] faces the problem of first person action recognition using a multi-stream CNN (ego-CNN, temporal, and spatial). 

The authors also find 3D CNN models being combined with sequence modeling methods [7] or hand-crafted feature descriptors (VLAD [30] or iDTs [129]).