What are the contributions in this paper?

In this paper, instead of using handcrafted features, the authors automatically learn spatio-temporal motion features for action recognition.

What are the future works in this paper?

In future work, the authors will mainly focus on the parallel and GPU computation to speed-up their methods. Besides, other more recent evolutionary methods ( e. g., PSO ) will be taken into consideration for leaning discriminative features.

How do the authors learn the features of the combined learning and evaluation sets?

In addition, the authors have also utilized two popular deep learning methods, i.e., DBN [22] and CNN [56], to learn hierarchical architectures for feature extraction on the combined learning and evaluation sets.

How does the feature descriptor perform on the test set?

As expected, the best GP-evolved feature descriptor achieves a recognition accuracy rate of 82.3% on the testing set using the SVM classifier, since this collection represents a natural poolof actions featured in a wide range of scenes and viewpoints with large intraclass variability.

What is the structure of the descriptor?

Once a descriptor is learned and selected, its structure is fixed and can be used on new data the same as a hand-crafted descriptor.

How do the authors implement the proposed method?

The authors implement their proposed method using MATLAB 2011a (with the GP toolbox GPLAB3 ) on a server with a 12-core processor and 54GB of RAM running the Linux operating system.

What is the fitness function for each training sample?

In their method, each training sample is a video sequence containing a large number of pixels and the fitness function has to be evaluated over the training set many times for whole population in each GP generation.

(Open Access) Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach (2016) | Li Liu

Q: What is the main purpose of evolution-based methods?

Within the area of evolutionary computation, evolutionbased methods simulate biological evolution to automatically generate solutions for user-defined tasks, such as: genetic algorithms (GA), memetic algorithms (MA), particle swarm optimization (PSO), ant-colony systems (ACS), and GP.

Q: What is the function set that is used to ensure the closure property?

To ensure the closure property [15], the authors have only used functions which map one or two 3D sequences to a single 3D sequence with identical size (i.e., the input and the output of each function node have the same size).

Learning Spatio-Temporal

Representations

for Action Recognition: A

Genetic

Programming

Approach

Li Liu, Ling Shao, Senior Member, IEEE, Xuelong Li, Fellow, IEEE, and Ke

Abstract—Extracting discriminative and robust features from

video sequences is the first and most critical step in human

action recognition. In this paper, instead of using handcrafted

features, we automatically learn spatio-temporal motion fea-

tures for action recognition. This is achieved via an evolutionary

method, i.e., genetic programming (GP), which evolves the motion

feature descriptor on a population of primitive 3D operators

(e.g., 3D-Gabor and wavelet). In this way, the scale and shift

invariant features can be effectively extracted from both color

and optical flow sequences. We intend to learn data adaptive

descriptors for different datasets with multiple layers, which

makes fully use of the knowledge to mimic the physical structure

of the human visual cortex for action recognition and simulta-

neously reduce the GP searching space to effectively accelerate

the convergence of optimal solutions. In our evolutionary archi-

tecture, the average cross-validation classification error, which is

calculated by an

support-vector-machine

classifier on the training

set, is adopted as the evaluation criterion for the GP fitness func-

tion. After the entire evolution procedure finishes, the best-so-far

solution selected by GP is regarded as the (near-)optimal action

descriptor obtained. The GP-evolving feature extraction method

is evaluated on four popular action datasets, namely KTH,

HMDB51, UCF YouTube, and Hollywood2. Experimental results

show that our method significantly outperforms other types of

features, either hand-designed or machine-learned.

Index Terms—Action recognition, feature extraction,

feature learning, genetic programming (GP), spatio-temporal

descriptors.

NTRO

DUCTION

UMAN action recognition [1]–[3], as a hot research

area in computer vision, has many potential applications

Manuscript received August 28, 2014; revised December 5, 2014;

accepted January 30, 2015. Date of publication February 13, 2015; date

of current version December 14, 2015. This work was supported in part

by the Key Research Program of the Chinese Academy of Sciences under

Grant KGZD-EW-T03, and in part by the National Natural Science

Foundation of China under Grant 61125106. This paper was recommended

by Associate Editor S. X. Yang. (Corresponding author: Ling Shao.)

L. Liu and L. Shao are with the College of Electronic and

Information

Engineering, Nanjing University of Information Science and Technology,

Nanjing 210044, China, and also with the Department of Computer Science

and Digital Technologies, Northumbria University, Newcastle upon Tyne

NE1 8ST, U.K. (e-mail: ling.shao@ieee.org).

X. Li is with the Center for Optical Imagery Analysis and Learning,

State

Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics

and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119,

China.

K. Lu is with the University of Chinese Academy of Sciences,

Beijing

100049, China, and also with Beijing Center for Mathematics and Information

Interdisciplinary Sciences, Beijing, China.

such as video search and retrieval, intelligent surveillance sys-

tems, and human-computer interaction. Despite its popularity,

how to precisely distinguish different actions still remains

challenging, since variations in lighting conditions, intraclass

differences and complex backgrounds all pose as obstacles for

robust feature extraction and action classification.

Generally, the basic approach to action recognition contains

the following two stages: 1) feature extraction and represen-

tation and 2) action classification. For the first stage, there

are mainly two groups of methods: 1) local feature-based and

2) holistic feature-based.

Within local feature-based methods, unsupervised tech-

niques (e.g., cuboid detector [4] and 3D Harris corner detec-

tor [5]) are first applied to detect interest points around which

the most salient features, such as: histogram of 3D oriented

gradients (3DHOG) [6], 3D scale invariant feature trans-

forms [7], and histogram of “optical flow” (HOF) [8], are

extracted. Then the bag-of-features (BOF) scheme is utilized

to form a codebook and map obtained features in histogram

representations which are finally fed to a classifier for action

classification. The local feature-based methods tend to be

more robust to complex backgrounds and occlusion in real- istic

actions [9], however, this kind of sparse representation is

often not precise and informative because of the quan-

tization error during codebook construction and the loss of

structural configuration among local features. Another weak-

ness of local approaches is that the detected spatio-temporal

features are usually not distinctive and invariant enough,

because the 3D local feature detectors are extended from their

2-D counterparts without fully exploiting the intrinsic differ-

ences between static images and dynamic video sequences.

Because of these reasons, holistic feature-based methods have

recently attracted significant attention in action recognition

research.

On the other hand, the holistic approaches represent actions

using visual information from the whole sequence and have also

been utilized in a variety of applications. Commonly, shape,

intensity, and color features are used for the holis- tic

representation of an action. The structure and orienta- tion

information of texture and shape can be successfully

extracted by mimicking the biological mechanism of visual

cortex for human perception. Color features have the advan-

tage of being invariant with respect to scaling, rotation,

perspective, and partial occlusion. The classical approaches

to compute the holistic features for action recognition were

developed by [10]–[12], etc., which are able to encode more

visual information by preserving spatial and temporal struc-

tures of the human action occurring in a video sequence.

However, holistic representations are sensitive to geometric

and photometric distortions and shifting. Moreover, prepro-

cessing steps, such as background subtraction, spatial and

temporal alignments, segmentation and tracking, are often

required.

The methods introduced above are all based on handcrafted

techniques [13], [14] designed and tuned by human experts,

which, however, may achieve “good” performance in a par-

ticular given domain and often result in poor performance on

other applications. How to design an adaptive methodology to

extract spatio-temporal features with discriminative recogni-

tion capabilities for any user-defined application still remains

an open research question.

As an alternative to handcrafted solutions based on deep

domain knowledge, genetic programming (GP), a power-

ful evolutionary method inspired by natural evolution, can

be employed to automatically solve problems without prior

knowledge of the solutions. In the present setting, we wish

to identify the feature descriptor (i.e., the sequence of prim-

itive operations, the composition and order of which are

unknown) to maximize recognition performance on a human

action recognition task. This is an NP-hard search prob-

lem [15] that evolutionary methods may solve in a tractable

amount of computer time compared to the exhaustive enu-

merative search. GP has been used to address a wide range

of practical problems producing human-competitive results and

even patentable inventions. As a search frame- work, GP can

typically escape the local minima in the optimization landscape

which may trap deterministic search methods.

In this paper, we adopt GP for designing holistic descriptors

that are adaptive to action domains and robust to shift, scal-

ing and background cluttering for action recognition. Given a

group of primitive 3D processing operators and a set of labeled

training examples, GP evolves better-performing individuals

in the next generation. Eventually, a best-so-far individual

can be selected as the final solution (i.e., the near-optimal

descriptor). The GP-evolved spatio-temporal descriptors can

extract and fuse the meaningful information from the original

sequences and the corresponding optical flow motion data. We

systematically evaluate the method on the KTH, HMDB51,

YouTube, and Hollywood2 datasets to demonstrate its perfor-

mance and generalizability. For comparison, we also show that

the proposed method is superior to some previously-published

hand-crafted solutions.

The main contributions of this paper can be summarized as

follows.

1) GP is used to automatically “evolve” spatio-temporal

feature descriptors that are adaptive and discriminative

for action recognition without profound knowledge of

the action datasets.

2) The GP-learned descriptors provide an effective way

to simultaneously extract and fuse the color and

motion (i.e., optical flow) information into one feature

representation.

This paper is organized as follows. In Section II, some

related work is summarized. The detailed architecture of our

method is presented in Section III, and relevant experiments and

results are described in Section IV. In Section V, we conclude

this paper and outline possible future work.

II. RELATED WORK

As this paper falls in the category of holistic representa-

tions, we mainly review methods of holistic spatio-temporal

representations for action recognition.

Bobick and Davis [10] presented motion templates through

projecting frames onto a single image, namely motion his-

tory images (MHI) and motion energy images. This kind of

motion templates can capture the motion patterns occurring

in a video sequence. However, this simple representation only

gives satisfactory performance where the background is rel-

atively static. Efros et al. [16] proposed a motion descriptor

based on smoothed and aggregated optical flows over a spatio-

temporal volume, which is centered on a moving figure. This

descriptor has been proven to be suitable for distant objects,

but the moving figure needs to be localized quite accurately.

Schindler and Van Gool [17] found that very short snippets (1–

7 frames) are sufficient for basic action recognition. They

applied log-Gabor filters to raw frames to extract form fea-

tures and optical flow filters to extract motion features. In

addition, Gorelick et al. [18] extracted spatio-temporal fea-

tures, such as local space-time saliency, action dynamics,

shape structure, and orientation, based on the properties of

Poisson equation solutions. Moreover, some recent discrimi-

nant analysis methods have also shown superior performance

for action recognition, such as slow feature analysis (SFA) [19]

which extracts the slowly varying and relevant stable fea-

tures from the quickly changed action videos. SFA has been

proved to be effectively used in constructing the visual recep-

tive fields of the cortical neurons. General tensor discriminant

analysis and Gabor features originally proposed for gait recog-

nition [20] can be also applied to action recognition. These

handcrafted features usually involve a lot of engineering

work to design and tune and are not adaptive to different

datasets.

Besides handcrafted features, there have also been a few

works on learning feature representations for action recog-

nition. Le et al. [21] have proposed using unsupervised

feature learning as a direct way to learn invariant spatio-

temporal features from unlabeled video data. Furthermore,

Taylor et al. [12] have introduced a model that learns latent

representations of image sequences from pairs of successive

images. Similarly, Ji et al. [11] developed a 3D convolutional

neural network (CNN), which is directly extended from its

2-D counterpart, for feature extraction. In a 3D CNN, motion

information in multiple adjacent frames is captured through

performing convolutions over spatial and temporal dimensions.

The convolutional architecture of their model allows it to scale

to realistic video sizes whilst using a compact parametriza- tion.

Recently, deep belief network (DBN) [22] also shows its

capacity to automatically learn multiple layers of nonlin- ear

features from images and videos. However, the number of

parameters to be learned in those deep learning models [23]

is very large, sometimes too large relative to the available

number of training samples, which unfortunately restricts their

applicability.

Within the area of evolutionary computation, evolution-

based methods simulate biological evolution to automatically

generate solutions for

user-defined

tasks, such as: genetic algo-

rithms (GA), memetic algorithms (MA), particle swarm opti-

mization (PSO), ant-colony systems (ACS), and GP. Generally,

these are heuristic and population-based searching methods.

They all attempt to move from one population to another

population in a single iteration with probabilistic rules. In

particular, GA seeks the solution of a problem in the form

of a string of numbers (traditionally binary, although the best

representations are usually those that reflect something about

the problem being solved), by applying operators such as

recombination and mutation (sometimes one, sometimes both).

Bhanu et al. [24] have proposed an adaptive image segmen-

tation system based on a GA. In their method, the GA is

an effective way of searching the hyperspace of segmentation

parameter combinations to determine the set which maximizes

a segmentation quality criterion. Besides GA, PSO, which

is inspired by the social behavior of migrating birds trying

to reach an unknown destination, has been used for feature

selection and classification in computer vision tasks. In [25],

PSO is incorporated within an AdaBoost framework for face

detection. Dynamic clustering using PSO has been proposed for

unsupervised image classification in [26]. Additionally, a

multiobjective PSO for discriminative feature selection was

proposed in [27] for robust classification problems. Beyond

the above methods, MA [28] and ACS [29] have been adopted

in vision applications too.

However, since GA and MA are based on a fixed form

of gene expression during the whole optimization procedure,

the representations of the solution are relatively fixed and

limited, which heavily influence the effectiveness in com-

plex optimization problems. While, different from GA/MA,

PSO considers the birds’ social behavior and accordingly their

movements toward an optimal destination rather than cre-

ating new solutions within each generation. Compared with

other evolution-based methods, PSO achieves a final solution

in a linear search space and tends to be relatively efficient.

However, this kind of simple linear search cannot tackle

complex optimization problems.

To enable more flexible representations, another evolution-

ary approach, i.e., GP, has been proposed [15], [30]. GP

has been widely utilized in the computer vision domain and

proved to be more powerful than GA. It is more intuitive

for implementation and can effectively solve highly non-

linear optimization problems. Thus, in terms of obtaining

better results, this kind of flexible, nonlinear searching mech-

anism can help GP achieve better solutions. Poli [31] applied

GP to automatically select optimal filters for segmentation

of the brain in medical images. Following the same line,

Torres et al. [32] used GP for finding a combination of sim-

ilarity functions for image retrieval. Davis et al. [33] have

also employed GP for feature selection in multivariate data

analysis, where GP can automatically select a subset of the

most discriminative features without any prior information. In

addition, other researchers [34]–[36] have also successfully

applied GP to recognition tasks with improvements compared

with previous methods.

Recently, GP has been exploited to assemble low-level fea-

ture detectors for high-level analysis, such as: object detection,

3D reconstruction, image tracking, and matching. The first

work in this area employed GP to evolve an operator for

detecting interest points [37]. Trujillo and Olague [38] have also

used GP to generate feature extractors for computer vision

applications. In addition, a GP-based detector was proposed

by Howard et al. [39] for detecting ship wakes in synthetic

aperture radar images.

One most related work using GP to automatically gen-

erate low-level features for action recognition is introduced

in [40]. In this paper, some basic filters are successfully

evolved to construct spatio-temporal descriptors for represent-

ing action sequences. Although this framework is regarded as

the first attempt in using GP to learn holistic representations

for action recognition, some aspects in this framework can

still be improved. Specifically, the evolved structure is totally

random rather than mimicking the structure of the human

brain cortex with multiple tiers—this kind of random evolu-

tion may fail to find the best solutions in a limited number of

generations. Furthermore, the previous work attempts to learn

general-purpose representations, which tend to be less specific

and discriminative for different action data domains. Lastly, is

the method was only evaluated on “staged” action datasets

rather than realistic action datasets. We expect to solve all

these issues in this paper.

Inspired by the effectiveness of GP on flexible optimiza-

tion tasks and successful applications mentioned above, in

this paper, we use GP to automatically evolve more task-

specific spatio-temporal descriptors from a set of 3D filters

and operators for realistic human action recognition.

III.

LUTIONARY

N FEATURE EXTRACTION

Much of what is done in action recognition aims to achieve

what the human vision system is capable of. This has caused

many researchers to model systems and algorithms after vari-

ous aspects of the human vision system. In this paper, we also

attempt to simulate the human visual cortex system which is

made up of hierarchical layers of neurons with feedforward

and feedback connections that grow and evolve from birth as

the vision system develops. The prior stages of processing in

the visual cortex are sensitive to visual stimuli such as intensity

and orientations, spatial motion, and even colors. In a simi-

lar way that feedforward neural connections between these

visual cortex layers are created and evolved in humans, we

propose a domain-independent machine learning methodology

to automatically generate low-level spatio-temporal descriptors

for high-level action recognition using GP. In our architecture,

the original color and optical-flow sequences are regarded as

the inputs and a group of 3D operators are assembled to con-

struct an effective

problem-specific

descriptor which is capable

of selectively extracting features from input data. The final

evolved descriptor, combining the nice properties of those

TABLE

FUNCTION SET IN

solution. In our method, each individual in GP represents a

candidate spatio-temporal descriptor and is evolved continu-

ously through generations. To establish the architecture of our

model, three significant concepts: function set, terminal set,

and fitness function should be first defined.

Fig. 1. Outline of our feature learning-based approach.

primitive 3D operators, can both extract meaningful

features

and form a compact action representation. We learn our

pro-

posed system over a training set, in which descriptors

are

evolved by maximizing the recognition accuracy through

fitness function, and further evaluate the GP-selected one

over

a testing set to demonstrate the performance of our

method.

The architecture of our proposed model is illustrated in Fig.

Generally, GP programs can be represented as a

tree

structure, evolved (by selection, crossover, and

mutation)

through sexual reproduction with pairs of parents being

chosen

stochastically but biased in their fitness on the task at

hand,

and finally select the best performing individual as the

terminal

A. Function Set and Terminal Set

A key component of GP is the function set which consti-

tutes the internal nodes of the tree and is typically driven by

the nature of the problem. To make the GP evolution process

fast, more efficient operators that can extract meaningful infor-

mation from action sequences are preferred. Our function set

consists of 19 unary operators and 4 binary ones, including

processing filters and basic arithmetic functions, as illustrated

in Table I.

In our GP structure, we divide our function set into two

tiers: 1) filtering tier (bottom tier) and 2) max-pooling tier

(top tier). The order of these tiers in our structure is always

fixed. Specifically, we do not allow the filter operators in the

function set to be above the max-pooling functions. In our

implementation, when a descriptor is evolved, we will check

whether it is a wrongly ordered descriptor or not. A wrongly

ordered descriptor will be automatically discarded by our pro-

gram and a new correctly-ordered descriptor would be evolved

× × × ×

Fig. 2. Illustration of the mechanism of max-pooling filter.

to replace the discarded one. In this way, in any GP-evolved

program, the operators in the filtering tier must be located

below the operators in the max-pooling tier. In addition, not

all the operators listed in the function set have to be used in a

given tree and the same operator can be used more than once.

Therefore, the topology of the tree in each tier is essentially

unrestricted. This kind of tree structure makes fully use of the

knowledge to mimic the physical structure of the human visual

cortex [41], [42] by encoding orientation, intensity, and color

information of the targets and can effectively tolerate shifting,

translation, and scaling for action recognition and simultane-

ously reduce the GP searching space to effectively accelerate

the convergence of optimal solutions.

1) Filtering Tier: In the filtering tier, aiming to extract

meaningful features from dynamic actions, we adopt

3D Gaussian filters, 3D Laplacian filers, 3D Wavelet filters,

3*D Gabor filters, and some other sequence processing oper-

ators and basic arithmetic functions.

3D Gaussian filters are adopted due to their ability for

denoising and 3D Laplacian filters are used for separating sig-

nals into different spectral sub-bands. Laplacian of Gaussian

operators have been successfully applied to capture intensity

features for action recognition in [2] and [43]. Wavelet trans-

forms can perform multiresolution analysis and obtain the

contour information of human actions by using the 3D CDF

“9/7” [44] wavelet filters.

In this paper, these 3D filters are used for constructing the

sequence pyramid (i.e., GauPy, LapPy, Wavelet), which is a data

structure designed to support efficient scaled convolution

through reducing the resolution. It consists of a sequence of

copies of an original sequence in which both sampling den-

sity and resolution are decreased in regular steps. A pyramid

is a multiscale representation with a recursive method. Beyond

those, 3D Gabor filters are regarded as the most effective

method to obtain the orientation information in a sequence.

Following Riesenhuber and Poggio [41], we simulate the bio-

logical mechanism of the visual cortex to define our Gabor

filter-based operators. Firstly, we convolve an input action

sequence with Gabor filters at six different scales (7

9, 11

11, 13

13, 15

15, and

17) under a certain orientation (i.e.,

◦

135

◦

); we further apply the max operation to pick the

maximum value across all six convolved sequences for that

particular orientation. Fig. 3 illustrates the procedure of our

Fig. 3. Outline of multiscale-max Gabor filter. This figure illustrates an

example of the multiscale-max Gabor filter with a fixed orientation of

◦

multiscale-max Gabor filters for a certain orientation. The max

operation among different scales is defined as follows:

MAX

max

7 7 7

(x, y, z,

), I

9 9 9

(x, y, z,

(

)

...,

(x, y, z,

)] (1)

where I

MAX

is the output of the multiscale-max Gabor fil-

ter.

(x, y, z,

) denotes the convolved sequences with the

scale i

i and the orientation

Moreover, several other 3D operators that are common for

feature extraction are added to the function set to increase the

variety of the selection for composing individuals during the

GP evolution. Basic arithmetic functions are chosen to real-

ize operations such as addition and subtraction of the internal

nodes of the tree to make the whole evolution procedure more

natural.

To ensure the closure property [15], we have only used func-

tions which map one or two 3D sequences to a single 3D

sequence with identical size (i.e., the input and the output of

each function node have the same size). In this way, a GP tree

can be an unrestricted composition of function nodes but still

always produce a semantically legal structure.

2) Max-Pooling Tier: In the max-pooling tier, we include

four functions listed in Table I, which are performed over

local neighborhoods with windows varying from 5

to 20

20 with a shifting step of 5 pixels. This max-

pooling operation (see Fig. 2) is a key mechanism for object

Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach

Figures

Citations

A Deep Learning Approach for Intrusion Detection Using Recurrent Neural Networks

Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks

Vision-based human activity recognition: a survey

TSDL: A Two-Stage Deep Learning Model for Efficient Network Intrusion Detection

Combination of Video Change Detection Algorithms by Genetic Programming

References

A fast learning algorithm for deep belief nets

An iterative image registration technique with an application to stereo vision

Speeded-Up Robust Features (SURF)

3D Convolutional Neural Networks for Human Action Recognition

3D Convolutional Neural Networks for Human Action Recognition

Related Papers (5)

Action Recognition with Improved Trajectories

Learning realistic human actions from movies

HMDB: A large video database for human motion recognition

Behavior recognition via sparse spatio-temporal features

Recognizing human actions: a local SVM approach

Frequently Asked Questions (10)

Q1. What are the contributions in this paper?

Q2. What are the future works in this paper?

Q3. What is the main purpose of evolution-based methods?

Q4. What is the advantage of the convolutional architecture of the CNN?

Q5. How do the authors learn the features of the combined learning and evaluation sets?

Q6. What is the function set that is used to ensure the closure property?

Q7. How does the feature descriptor perform on the test set?

Q8. What is the structure of the descriptor?

Q9. How do the authors implement the proposed method?

Q10. What is the fitness function for each training sample?