This paper automatically learns spatio-temporal motion features for action recognition via an evolutionary method, i.e., genetic programming (GP), which evolves the motion feature descriptor on a population of primitive 3D operators (e.g., 3D-Gabor and wavelet).
Abstract:
Extracting discriminative and robust features from video sequences is the first and most critical step in human action recognition. In this paper, instead of using handcrafted features, we automatically learn spatio-temporal motion features for action recognition. This is achieved via an evolutionary method, i.e., genetic programming (GP), which evolves the motion feature descriptor on a population of primitive 3D operators (e.g., 3D-Gabor and wavelet). In this way, the scale and shift invariant features can be effectively extracted from both color and optical flow sequences. We intend to learn data adaptive descriptors for different datasets with multiple layers, which makes fully use of the knowledge to mimic the physical structure of the human visual cortex for action recognition and simultaneously reduce the GP searching space to effectively accelerate the convergence of optimal solutions. In our evolutionary architecture, the average cross-validation classification error, which is calculated by an support-vector-machine classifier on the training set, is adopted as the evaluation criterion for the GP fitness function. After the entire evolution procedure finishes, the best-so-far solution selected by GP is regarded as the (near-)optimal action descriptor obtained. The GP-evolving feature extraction method is evaluated on four popular action datasets, namely KTH, HMDB51, UCF YouTube, and Hollywood2. Experimental results show that our method significantly outperforms other types of features, either hand-designed or machine-learned.
TL;DR: The experimental results show that RNN-IDS is very suitable for modeling a classification model with high accuracy and that its performance is superior to that of traditional machine learning classification methods in both binary and multiclass classification.
TL;DR: A compact, effective yet simple method to encode spatio-temporal information carried in 3D skeleton sequences into multiple 2D images, referred to as Joint Trajectory Maps (JTM), and ConvNets are adopted to exploit the discriminative features for real-time human action recognition.
TL;DR: Most computer vision applications such as human computer interaction, virtual reality, security, video surveillance and home monitoring are highly correlated to HAR tasks, which establishes new trend and milestone in the development cycle of HAR systems.
TL;DR: A novel two-stage deep learning model based on a stacked auto-encoder with a soft-max classifier for efficient network intrusion detection that has the potential to serve as a future benchmark for deep learning and network security research communities.
TL;DR: This paper investigates how state-of-the-art change detection algorithms can be combined and used to create a more robust algorithm leveraging their individual peculiarities and exploits genetic programming (GP) to automatically select the best algorithms, combine them in different ways, and perform the most suitable post-processing operations on the outputs of the algorithms.
TL;DR: A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.
TL;DR: In this paper, the spatial intensity gradient of the images is used to find a good match using a type of Newton-Raphson iteration, which can be generalized to handle rotation, scaling and shearing.
TL;DR: A novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
TL;DR: Wang et al. as mentioned in this paper developed a novel 3D CNN model for action recognition, which extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.
TL;DR: A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.
In this paper, instead of using handcrafted features, the authors automatically learn spatio-temporal motion features for action recognition.
Q2. What are the future works in this paper?
In future work, the authors will mainly focus on the parallel and GPU computation to speed-up their methods. Besides, other more recent evolutionary methods ( e. g., PSO ) will be taken into consideration for leaning discriminative features.
Q3. What is the main purpose of evolution-based methods?
Within the area of evolutionary computation, evolutionbased methods simulate biological evolution to automatically generate solutions for user-defined tasks, such as: genetic algorithms (GA), memetic algorithms (MA), particle swarm optimization (PSO), ant-colony systems (ACS), and GP.
Q4. What is the advantage of the convolutional architecture of the CNN?
The convolutional architecture of their model allows it to scale to realistic video sizes whilst using a compact parametriza- tion.
Q5. How do the authors learn the features of the combined learning and evaluation sets?
In addition, the authors have also utilized two popular deep learning methods, i.e., DBN [22] and CNN [56], to learn hierarchical architectures for feature extraction on the combined learning and evaluation sets.
Q6. What is the function set that is used to ensure the closure property?
To ensure the closure property [15], the authors have only used functions which map one or two 3D sequences to a single 3D sequence with identical size (i.e., the input and the output of each function node have the same size).
Q7. How does the feature descriptor perform on the test set?
As expected, the best GP-evolved feature descriptor achieves a recognition accuracy rate of 82.3% on the testing set using the SVM classifier, since this collection represents a natural poolof actions featured in a wide range of scenes and viewpoints with large intraclass variability.
Q8. What is the structure of the descriptor?
Once a descriptor is learned and selected, its structure is fixed and can be used on new data the same as a hand-crafted descriptor.
Q9. How do the authors implement the proposed method?
The authors implement their proposed method using MATLAB 2011a (with the GP toolbox GPLAB3 ) on a server with a 12-core processor and 54GB of RAM running the Linux operating system.
Q10. What is the fitness function for each training sample?
In their method, each training sample is a video sequence containing a large number of pixels and the fitness function has to be evaluated over the training set many times for whole population in each GP generation.