scispace - formally typeset
Open AccessJournal ArticleDOI

Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach

TLDR
This paper automatically learns spatio-temporal motion features for action recognition via an evolutionary method, i.e., genetic programming (GP), which evolves the motion feature descriptor on a population of primitive 3D operators (e.g., 3D-Gabor and wavelet).
Abstract
Extracting discriminative and robust features from video sequences is the first and most critical step in human action recognition. In this paper, instead of using handcrafted features, we automatically learn spatio-temporal motion features for action recognition. This is achieved via an evolutionary method, i.e., genetic programming (GP), which evolves the motion feature descriptor on a population of primitive 3D operators (e.g., 3D-Gabor and wavelet). In this way, the scale and shift invariant features can be effectively extracted from both color and optical flow sequences. We intend to learn data adaptive descriptors for different datasets with multiple layers, which makes fully use of the knowledge to mimic the physical structure of the human visual cortex for action recognition and simultaneously reduce the GP searching space to effectively accelerate the convergence of optimal solutions. In our evolutionary architecture, the average cross-validation classification error, which is calculated by an support-vector-machine classifier on the training set, is adopted as the evaluation criterion for the GP fitness function. After the entire evolution procedure finishes, the best-so-far solution selected by GP is regarded as the (near-)optimal action descriptor obtained. The GP-evolving feature extraction method is evaluated on four popular action datasets, namely KTH, HMDB51, UCF YouTube, and Hollywood2. Experimental results show that our method significantly outperforms other types of features, either hand-designed or machine-learned.

read more

Content maybe subject to copyright    Report

H
Learning Spatio-Temporal
Representations
for Action Recognition: A
Genetic
Programming
Approach
Li Liu, Ling Shao, Senior Member, IEEE, Xuelong Li, Fellow, IEEE, and Ke
Lu
AbstractExtracting discriminative and robust features from
video sequences is the first and most critical step in human
action recognition. In this paper, instead of using handcrafted
features, we automatically learn spatio-temporal motion fea-
tures for action recognition. This is achieved via an evolutionary
method, i.e., genetic programming (GP), which evolves the motion
feature descriptor on a population of primitive 3D operators
(e.g., 3D-Gabor and wavelet). In this way, the scale and shift
invariant features can be effectively extracted from both color
and optical flow sequences. We intend to learn data adaptive
descriptors for different datasets with multiple layers, which
makes fully use of the knowledge to mimic the physical structure
of the human visual cortex for action recognition and simulta-
neously reduce the GP searching space to effectively accelerate
the convergence of optimal solutions. In our evolutionary archi-
tecture, the average cross-validation classification error, which is
calculated by an
support-vector-machine
classifier on the training
set, is adopted as the evaluation criterion for the GP fitness func-
tion. After the entire evolution procedure finishes, the best-so-far
solution selected by GP is regarded as the (near-)optimal action
descriptor obtained. The GP-evolving feature extraction method
is evaluated on four popular action datasets, namely KTH,
HMDB51, UCF YouTube, and Hollywood2. Experimental results
show that our method significantly outperforms other types of
features, either hand-designed or machine-learned.
Index TermsAction recognition, feature extraction,
feature learning, genetic programming (GP), spatio-temporal
descriptors.
I.
I
NTRO
DUCTION
UMAN action recognition [1][3], as a hot research
area in computer vision, has many potential applications
Manuscript received August 28, 2014; revised December 5, 2014;
accepted January 30, 2015. Date of publication February 13, 2015; date
of current version December 14, 2015. This work was supported in part
by the Key Research Program of the Chinese Academy of Sciences under
Grant KGZD-EW-T03, and in part by the National Natural Science
Foundation of China under Grant 61125106. This paper was recommended
by Associate Editor S. X. Yang. (Corresponding author: Ling Shao.)
L. Liu and L. Shao are with the College of Electronic and
Information
Engineering, Nanjing University of Information Science and Technology,
Nanjing 210044, China, and also with the Department of Computer Science
and Digital Technologies, Northumbria University, Newcastle upon Tyne
NE1 8ST, U.K. (e-mail: ling.shao@ieee.org).
X. Li is with the Center for Optical Imagery Analysis and Learning,
State
Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics
and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119,
China.
K. Lu is with the University of Chinese Academy of Sciences,
Beijing
100049, China, and also with Beijing Center for Mathematics and Information
Interdisciplinary Sciences, Beijing, China.
such as video search and retrieval, intelligent surveillance sys-
tems, and human-computer interaction. Despite its popularity,
how to precisely distinguish different actions still remains
challenging, since variations in lighting conditions, intraclass
differences and complex backgrounds all pose as obstacles for
robust feature extraction and action classification.
Generally, the basic approach to action recognition contains
the following two stages: 1) feature extraction and represen-
tation and 2) action classification. For the first stage, there
are mainly two groups of methods: 1) local feature-based and
2) holistic feature-based.
Within local feature-based methods, unsupervised tech-
niques (e.g., cuboid detector [4] and 3D Harris corner detec-
tor [5]) are first applied to detect interest points around which
the most salient features, such as: histogram of 3D oriented
gradients (3DHOG) [6], 3D scale invariant feature trans-
forms [7], and histogram of “optical flow” (HOF) [8], are
extracted. Then the bag-of-features (BOF) scheme is utilized
to form a codebook and map obtained features in histogram
representations which are finally fed to a classifier for action
classification. The local feature-based methods tend to be
more robust to complex backgrounds and occlusion in real- istic
actions [9], however, this kind of sparse representation is
often not precise and informative because of the quan-
tization error during codebook construction and the loss of
structural configuration among local features. Another weak-
ness of local approaches is that the detected spatio-temporal
features are usually not distinctive and invariant enough,
because the 3D local feature detectors are extended from their
2-D counterparts without fully exploiting the intrinsic differ-
ences between static images and dynamic video sequences.
Because of these reasons, holistic feature-based methods have
recently attracted significant attention in action recognition
research.
On the other hand, the holistic approaches represent actions
using visual information from the whole sequence and have also
been utilized in a variety of applications. Commonly, shape,
intensity, and color features are used for the holis- tic
representation of an action. The structure and orienta- tion
information of texture and shape can be successfully
extracted by mimicking the biological mechanism of visual
cortex for human perception. Color features have the advan-
tage of being invariant with respect to scaling, rotation,
perspective, and partial occlusion. The classical approaches
to compute the holistic features for action recognition were

developed by [10][12], etc., which are able to encode more
visual information by preserving spatial and temporal struc-
tures of the human action occurring in a video sequence.
However, holistic representations are sensitive to geometric
and photometric distortions and shifting. Moreover, prepro-
cessing steps, such as background subtraction, spatial and
temporal alignments, segmentation and tracking, are often
required.
The methods introduced above are all based on handcrafted
techniques [13], [14] designed and tuned by human experts,
which, however, may achieve “good” performance in a par-
ticular given domain and often result in poor performance on
other applications. How to design an adaptive methodology to
extract spatio-temporal features with discriminative recogni-
tion capabilities for any user-defined application still remains
an open research question.
As an alternative to handcrafted solutions based on deep
domain knowledge, genetic programming (GP), a power-
ful evolutionary method inspired by natural evolution, can
be employed to automatically solve problems without prior
knowledge of the solutions. In the present setting, we wish
to identify the feature descriptor (i.e., the sequence of prim-
itive operations, the composition and order of which are
unknown) to maximize recognition performance on a human
action recognition task. This is an NP-hard search prob-
lem [15] that evolutionary methods may solve in a tractable
amount of computer time compared to the exhaustive enu-
merative search. GP has been used to address a wide range
of practical problems producing human-competitive results and
even patentable inventions. As a search frame- work, GP can
typically escape the local minima in the optimization landscape
which may trap deterministic search methods.
In this paper, we adopt GP for designing holistic descriptors
that are adaptive to action domains and robust to shift, scal-
ing and background cluttering for action recognition. Given a
group of primitive 3D processing operators and a set of labeled
training examples, GP evolves better-performing individuals
in the next generation. Eventually, a best-so-far individual
can be selected as the final solution (i.e., the near-optimal
descriptor). The GP-evolved spatio-temporal descriptors can
extract and fuse the meaningful information from the original
sequences and the corresponding optical flow motion data. We
systematically evaluate the method on the KTH, HMDB51,
YouTube, and Hollywood2 datasets to demonstrate its perfor-
mance and generalizability. For comparison, we also show that
the proposed method is superior to some previously-published
hand-crafted solutions.
The main contributions of this paper can be summarized as
follows.
1) GP is used to automatically evolve” spatio-temporal
feature descriptors that are adaptive and discriminative
for action recognition without profound knowledge of
the action datasets.
2) The GP-learned descriptors provide an effective way
to simultaneously extract and fuse the color and
motion (i.e., optical flow) information into one feature
representation.
This paper is organized as follows. In Section II, some
related work is summarized. The detailed architecture of our
method is presented in Section III, and relevant experiments and
results are described in Section IV. In Section V, we conclude
this paper and outline possible future work.
II. RELATED WORK
As this paper falls in the category of holistic representa-
tions, we mainly review methods of holistic spatio-temporal
representations for action recognition.
Bobick and Davis [10] presented motion templates through
projecting frames onto a single image, namely motion his-
tory images (MHI) and motion energy images. This kind of
motion templates can capture the motion patterns occurring
in a video sequence. However, this simple representation only
gives satisfactory performance where the background is rel-
atively static. Efros et al. [16] proposed a motion descriptor
based on smoothed and aggregated optical flows over a spatio-
temporal volume, which is centered on a moving figure. This
descriptor has been proven to be suitable for distant objects,
but the moving figure needs to be localized quite accurately.
Schindler and Van Gool [17] found that very short snippets (1
7 frames) are sufficient for basic action recognition. They
applied log-Gabor filters to raw frames to extract form fea-
tures and optical flow filters to extract motion features. In
addition, Gorelick et al. [18] extracted spatio-temporal fea-
tures, such as local space-time saliency, action dynamics,
shape structure, and orientation, based on the properties of
Poisson equation solutions. Moreover, some recent discrimi-
nant analysis methods have also shown superior performance
for action recognition, such as slow feature analysis (SFA) [19]
which extracts the slowly varying and relevant stable fea-
tures from the quickly changed action videos. SFA has been
proved to be effectively used in constructing the visual recep-
tive fields of the cortical neurons. General tensor discriminant
analysis and Gabor features originally proposed for gait recog-
nition [20] can be also applied to action recognition. These
handcrafted features usually involve a lot of engineering
work to design and tune and are not adaptive to different
datasets.
Besides handcrafted features, there have also been a few
works on learning feature representations for action recog-
nition. Le et al. [21] have proposed using unsupervised
feature learning as a direct way to learn invariant spatio-
temporal features from unlabeled video data. Furthermore,
Taylor et al. [12] have introduced a model that learns latent
representations of image sequences from pairs of successive
images. Similarly, Ji et al. [11] developed a 3D convolutional
neural network (CNN), which is directly extended from its
2-D counterpart, for feature extraction. In a 3D CNN, motion
information in multiple adjacent frames is captured through
performing convolutions over spatial and temporal dimensions.
The convolutional architecture of their model allows it to scale
to realistic video sizes whilst using a compact parametriza- tion.
Recently, deep belief network (DBN) [22] also shows its
capacity to automatically learn multiple layers of nonlin- ear
features from images and videos. However, the number of

parameters to be learned in those deep learning models [23]
is very large, sometimes too large relative to the available
number of training samples, which unfortunately restricts their
applicability.
Within the area of evolutionary computation, evolution-
based methods simulate biological evolution to automatically
generate solutions for
user-defined
tasks, such as: genetic algo-
rithms (GA), memetic algorithms (MA), particle swarm opti-
mization (PSO), ant-colony systems (ACS), and GP. Generally,
these are heuristic and population-based searching methods.
They all attempt to move from one population to another
population in a single iteration with probabilistic rules. In
particular, GA seeks the solution of a problem in the form
of a string of numbers (traditionally binary, although the best
representations are usually those that reflect something about
the problem being solved), by applying operators such as
recombination and mutation (sometimes one, sometimes both).
Bhanu et al. [24] have proposed an adaptive image segmen-
tation system based on a GA. In their method, the GA is
an effective way of searching the hyperspace of segmentation
parameter combinations to determine the set which maximizes
a segmentation quality criterion. Besides GA, PSO, which
is inspired by the social behavior of migrating birds trying
to reach an unknown destination, has been used for feature
selection and classification in computer vision tasks. In [25],
PSO is incorporated within an AdaBoost framework for face
detection. Dynamic clustering using PSO has been proposed for
unsupervised image classification in [26]. Additionally, a
multiobjective PSO for discriminative feature selection was
proposed in [27] for robust classification problems. Beyond
the above methods, MA [28] and ACS [29] have been adopted
in vision applications too.
However, since GA and MA are based on a fixed form
of gene expression during the whole optimization procedure,
the representations of the solution are relatively fixed and
limited, which heavily influence the effectiveness in com-
plex optimization problems. While, different from GA/MA,
PSO considers the birds’ social behavior and accordingly their
movements toward an optimal destination rather than cre-
ating new solutions within each generation. Compared with
other evolution-based methods, PSO achieves a final solution
in a linear search space and tends to be relatively efficient.
However, this kind of simple linear search cannot tackle
complex optimization problems.
To enable more flexible representations, another evolution-
ary approach, i.e., GP, has been proposed [15], [30]. GP
has been widely utilized in the computer vision domain and
proved to be more powerful than GA. It is more intuitive
for implementation and can effectively solve highly non-
linear optimization problems. Thus, in terms of obtaining
better results, this kind of flexible, nonlinear searching mech-
anism can help GP achieve better solutions. Poli [31] applied
GP to automatically select optimal filters for segmentation
of the brain in medical images. Following the same line,
Torres et al. [32] used GP for finding a combination of sim-
ilarity functions for image retrieval. Davis et al. [33] have
also employed GP for feature selection in multivariate data
analysis, where GP can automatically select a subset of the
most discriminative features without any prior information. In
addition, other researchers [34][36] have also successfully
applied GP to recognition tasks with improvements compared
with previous methods.
Recently, GP has been exploited to assemble low-level fea-
ture detectors for high-level analysis, such as: object detection,
3D reconstruction, image tracking, and matching. The first
work in this area employed GP to evolve an operator for
detecting interest points [37]. Trujillo and Olague [38] have also
used GP to generate feature extractors for computer vision
applications. In addition, a GP-based detector was proposed
by Howard et al. [39] for detecting ship wakes in synthetic
aperture radar images.
One most related work using GP to automatically gen-
erate low-level features for action recognition is introduced
in [40]. In this paper, some basic filters are successfully
evolved to construct spatio-temporal descriptors for represent-
ing action sequences. Although this framework is regarded as
the first attempt in using GP to learn holistic representations
for action recognition, some aspects in this framework can
still be improved. Specifically, the evolved structure is totally
random rather than mimicking the structure of the human
brain cortex with multiple tiersthis kind of random evolu-
tion may fail to find the best solutions in a limited number of
generations. Furthermore, the previous work attempts to learn
general-purpose representations, which tend to be less specific
and discriminative for different action data domains. Lastly, is
the method was only evaluated on “staged” action datasets
rather than realistic action datasets. We expect to solve all
these issues in this paper.
Inspired by the effectiveness of GP on flexible optimiza-
tion tasks and successful applications mentioned above, in
this paper, we use GP to automatically evolve more task-
specific spatio-temporal descriptors from a set of 3D filters
and operators for realistic human action recognition.
III.
E
VO
LUTIONARY
M
OT
I
O
N FEATURE EXTRACTION
Much of what is done in action recognition aims to achieve
what the human vision system is capable of. This has caused
many researchers to model systems and algorithms after vari-
ous aspects of the human vision system. In this paper, we also
attempt to simulate the human visual cortex system which is
made up of hierarchical layers of neurons with feedforward
and feedback connections that grow and evolve from birth as
the vision system develops. The prior stages of processing in
the visual cortex are sensitive to visual stimuli such as intensity
and orientations, spatial motion, and even colors. In a simi-
lar way that feedforward neural connections between these
visual cortex layers are created and evolved in humans, we
propose a domain-independent machine learning methodology
to automatically generate low-level spatio-temporal descriptors
for high-level action recognition using GP. In our architecture,
the original color and optical-flow sequences are regarded as
the inputs and a group of 3D operators are assembled to con-
struct an effective
problem-specific
descriptor which is capable
of selectively extracting features from input data. The final
evolved descriptor, combining the nice properties of those

TABLE
I
FUNCTION SET IN
GP
solution. In our method, each individual in GP represents a
candidate spatio-temporal descriptor and is evolved continu-
ously through generations. To establish the architecture of our
model, three significant concepts: function set, terminal set,
and fitness function should be first defined.
Fig. 1. Outline of our feature learning-based approach.
primitive 3D operators, can both extract meaningful
features
and form a compact action representation. We learn our
pro-
posed system over a training set, in which descriptors
are
evolved by maximizing the recognition accuracy through
a
fitness function, and further evaluate the GP-selected one
over
a testing set to demonstrate the performance of our
method.
The architecture of our proposed model is illustrated in Fig.
1.
Generally, GP programs can be represented as a
tree
structure, evolved (by selection, crossover, and
mutation)
through sexual reproduction with pairs of parents being
chosen
stochastically but biased in their fitness on the task at
hand,
and finally select the best performing individual as the
terminal
A. Function Set and Terminal Set
A key component of GP is the function set which consti-
tutes the internal nodes of the tree and is typically driven by
the nature of the problem. To make the GP evolution process
fast, more efficient operators that can extract meaningful infor-
mation from action sequences are preferred. Our function set
consists of 19 unary operators and 4 binary ones, including
processing filters and basic arithmetic functions, as illustrated
in Table I.
In our GP structure, we divide our function set into two
tiers: 1) filtering tier (bottom tier) and 2) max-pooling tier
(top tier). The order of these tiers in our structure is always
fixed. Specifically, we do not allow the filter operators in the
function set to be above the max-pooling functions. In our
implementation, when a descriptor is evolved, we will check
whether it is a wrongly ordered descriptor or not. A wrongly
ordered descriptor will be automatically discarded by our pro-
gram and a new correctly-ordered descriptor would be evolved

=
× × × ×
Fig. 2. Illustration of the mechanism of max-pooling filter.
to replace the discarded one. In this way, in any GP-evolved
program, the operators in the filtering tier must be located
below the operators in the max-pooling tier. In addition, not
all the operators listed in the function set have to be used in a
given tree and the same operator can be used more than once.
Therefore, the topology of the tree in each tier is essentially
unrestricted. This kind of tree structure makes fully use of the
knowledge to mimic the physical structure of the human visual
cortex [41], [42] by encoding orientation, intensity, and color
information of the targets and can effectively tolerate shifting,
translation, and scaling for action recognition and simultane-
ously reduce the GP searching space to effectively accelerate
the convergence of optimal solutions.
1) Filtering Tier: In the filtering tier, aiming to extract
meaningful features from dynamic actions, we adopt
3D Gaussian filters, 3D Laplacian filers, 3D Wavelet filters,
3*D Gabor filters, and some other sequence processing oper-
ators and basic arithmetic functions.
3D Gaussian filters are adopted due to their ability for
denoising and 3D Laplacian filters are used for separating sig-
nals into different spectral sub-bands. Laplacian of Gaussian
operators have been successfully applied to capture intensity
features for action recognition in [2] and [43]. Wavelet trans-
forms can perform multiresolution analysis and obtain the
contour information of human actions by using the 3D CDF
“9/7” [44] wavelet filters.
In this paper, these 3D filters are used for constructing the
sequence pyramid (i.e., GauPy, LapPy, Wavelet), which is a data
structure designed to support efficient scaled convolution
through reducing the resolution. It consists of a sequence of
copies of an original sequence in which both sampling den-
sity and resolution are decreased in regular steps. A pyramid
is a multiscale representation with a recursive method. Beyond
those, 3D Gabor filters are regarded as the most effective
method to obtain the orientation information in a sequence.
Following Riesenhuber and Poggio [41], we simulate the bio-
logical mechanism of the visual cortex to define our Gabor
filter-based operators. Firstly, we convolve an input action
sequence with Gabor filters at six different scales (7
×
7
×
7,
9
×
9
×
9, 11
×
11
×
11, 13
×
13
×
13, 15
×
15
×
15, and
17
×
17
×
17) under a certain orientation (i.e.,
0
,
45
,
90
,
or
135
); we further apply the max operation to pick the
maximum value across all six convolved sequences for that
particular orientation. Fig. 3 illustrates the procedure of our
Fig. 3. Outline of multiscale-max Gabor filter. This figure illustrates an
example of the multiscale-max Gabor filter with a fixed orientation of
45
.
multiscale-max Gabor filters for a certain orientation. The max
operation among different scales is defined as follows:
I
MAX
max
[I
7 7 7
(x, y, z,
θ
s
), I
9 9 9
(x, y, z,
θ
s
),
(
x
,
y
,
z
)
...,
I
15
×
15
×
15
(x, y, z,
θ
s
),
I
17
×
17
×
17
(x, y, z,
θ
s
)] (1)
where I
MAX
is the output of the multiscale-max Gabor fil-
ter.
I
i
×
i
×
i
(x, y, z,
θ
s
) denotes the convolved sequences with the
scale i
×
i
×
i and the orientation
θ
s
.
Moreover, several other 3D operators that are common for
feature extraction are added to the function set to increase the
variety of the selection for composing individuals during the
GP evolution. Basic arithmetic functions are chosen to real-
ize operations such as addition and subtraction of the internal
nodes of the tree to make the whole evolution procedure more
natural.
To ensure the closure property [15], we have only used func-
tions which map one or two 3D sequences to a single 3D
sequence with identical size (i.e., the input and the output of
each function node have the same size). In this way, a GP tree
can be an unrestricted composition of function nodes but still
always produce a semantically legal structure.
2) Max-Pooling Tier: In the max-pooling tier, we include
four functions listed in Table I, which are performed over
local neighborhoods with windows varying from 5
×
5
×
5
to 20
×
20
×
20 with a shifting step of 5 pixels. This max-
pooling operation (see Fig. 2) is a key mechanism for object

Citations
More filters
Journal ArticleDOI

A Deep Learning Approach for Intrusion Detection Using Recurrent Neural Networks

TL;DR: The experimental results show that RNN-IDS is very suitable for modeling a classification model with high accuracy and that its performance is superior to that of traditional machine learning classification methods in both binary and multiclass classification.
Posted Content

Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks

TL;DR: A compact, effective yet simple method to encode spatio-temporal information carried in 3D skeleton sequences into multiple 2D images, referred to as Joint Trajectory Maps (JTM), and ConvNets are adopted to exploit the discriminative features for real-time human action recognition.
Journal ArticleDOI

Vision-based human activity recognition: a survey

TL;DR: Most computer vision applications such as human computer interaction, virtual reality, security, video surveillance and home monitoring are highly correlated to HAR tasks, which establishes new trend and milestone in the development cycle of HAR systems.
Journal ArticleDOI

TSDL: A Two-Stage Deep Learning Model for Efficient Network Intrusion Detection

TL;DR: A novel two-stage deep learning model based on a stacked auto-encoder with a soft-max classifier for efficient network intrusion detection that has the potential to serve as a future benchmark for deep learning and network security research communities.
Journal ArticleDOI

Combination of Video Change Detection Algorithms by Genetic Programming

TL;DR: This paper investigates how state-of-the-art change detection algorithms can be combined and used to create a more robust algorithm leveraging their individual peculiarities and exploits genetic programming (GP) to automatically select the best algorithms, combine them in different ways, and perform the most suitable post-processing operations on the outputs of the algorithms.
References
More filters
Journal ArticleDOI

A fast learning algorithm for deep belief nets

TL;DR: A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.
Proceedings Article

An iterative image registration technique with an application to stereo vision

TL;DR: In this paper, the spatial intensity gradient of the images is used to find a good match using a type of Newton-Raphson iteration, which can be generalized to handle rotation, scaling and shearing.
Journal ArticleDOI

Speeded-Up Robust Features (SURF)

TL;DR: A novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
Journal ArticleDOI

3D Convolutional Neural Networks for Human Action Recognition

TL;DR: Wang et al. as mentioned in this paper developed a novel 3D CNN model for action recognition, which extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.
Proceedings Article

3D Convolutional Neural Networks for Human Action Recognition

TL;DR: A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What are the contributions in this paper?

In this paper, instead of using handcrafted features, the authors automatically learn spatio-temporal motion features for action recognition. 

In future work, the authors will mainly focus on the parallel and GPU computation to speed-up their methods. Besides, other more recent evolutionary methods ( e. g., PSO ) will be taken into consideration for leaning discriminative features. 

Within the area of evolutionary computation, evolutionbased methods simulate biological evolution to automatically generate solutions for user-defined tasks, such as: genetic algorithms (GA), memetic algorithms (MA), particle swarm optimization (PSO), ant-colony systems (ACS), and GP. 

The convolutional architecture of their model allows it to scale to realistic video sizes whilst using a compact parametriza- tion. 

In addition, the authors have also utilized two popular deep learning methods, i.e., DBN [22] and CNN [56], to learn hierarchical architectures for feature extraction on the combined learning and evaluation sets. 

To ensure the closure property [15], the authors have only used functions which map one or two 3D sequences to a single 3D sequence with identical size (i.e., the input and the output of each function node have the same size). 

As expected, the best GP-evolved feature descriptor achieves a recognition accuracy rate of 82.3% on the testing set using the SVM classifier, since this collection represents a natural poolof actions featured in a wide range of scenes and viewpoints with large intraclass variability. 

Once a descriptor is learned and selected, its structure is fixed and can be used on new data the same as a hand-crafted descriptor. 

The authors implement their proposed method using MATLAB 2011a (with the GP toolbox GPLAB3 ) on a server with a 12-core processor and 54GB of RAM running the Linux operating system. 

In their method, each training sample is a video sequence containing a large number of pixels and the fitness function has to be evaluated over the training set many times for whole population in each GP generation.