Going deeper into action recognition

doi:10.1016/J.IMAVIS.2017.01.010

Home
/
Papers
/
Going deeper into action recognition

Journal Article•DOI•

Going deeper into action recognition

Samitha Herath¹, Mehrtash Harandi¹, Fatih Porikli¹•Institutions (1)

Australian National University¹

01 Apr 2017-Image and Vision Computing (Elsevier)-Vol. 60, pp 4-21

TL;DR: This survey provides a comprehensive review of the notable steps taken towards recognizing human actions, starting with the pioneering methods that use handcrafted representations, and then, navigating into the realm of deep learning based approaches.

read less

About: This article is published in Image and Vision Computing.The article was published on 2017-04-01 and is currently open access. It has received 452 citations till now. The article focuses on the topics: Action (philosophy).

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

The grand challenges of Science Robotics

[...]

Guang-Zhong Yang¹, James G. Bellingham², Pierre E. Dupont³, Peer Fischer⁴, Peer Fischer⁵, Luciano Floridi, Robert J. Full⁶, Neil Jacobstein⁷, Neil Jacobstein⁸, Vijay Kumar⁹, Marcia McNutt¹⁰, Robert Merrifield¹, Bradley J. Nelson¹¹, Brian Scassellati¹², Mariarosaria Taddeo¹³, Mariarosaria Taddeo¹⁴, Russell H. Taylor¹⁵, Manuela Veloso¹⁶, Zhong Lin Wang¹⁷, Robert J. Wood¹⁸, Robert J. Wood¹⁹ - Show less +17 more•Institutions (19)

Imperial College London¹, Woods Hole Oceanographic Institution², Boston Children's Hospital³, Max Planck Society⁴, University of Stuttgart⁵, University of California, Berkeley⁶, NASA Research Park⁷, Stanford University⁸, University of Pennsylvania⁹, National Academy of Sciences¹⁰, ETH Zurich¹¹, Yale University¹², The Turing Institute¹³, University of Oxford¹⁴, Johns Hopkins University¹⁵, Carnegie Mellon University¹⁶, Georgia Institute of Technology¹⁷, Wyss Institute for Biologically Inspired Engineering¹⁸, Harvard University¹⁹

31 Jan 2018

TL;DR: These 10 grand challenges may have major breakthroughs, research, and/or socioeconomic impacts in the next 5 to 10 years and represent underpinning technologies that have a wider impact on all application areas of robotics.

...read moreread less

Abstract: One of the ambitions of Science Robotics is to deeply root robotics research in science while developing novel robotic platforms that will enable new scientific discoveries. Of our 10 grand challenges, the first 7 represent underpinning technologies that have a wider impact on all application areas of robotics. For the next two challenges, we have included social robotics and medical robotics as application-specific areas of development to highlight the substantial societal and health impacts that they will bring. Finally, the last challenge is related to responsible innovation and how ethics and security should be carefully considered as we develop the technology further.

...read moreread less

791 citations

Journal Article•DOI•

Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features

[...]

Amin Ullah¹, Jamil Ahmad¹, Khan Muhammad¹, Muhammad Sajjad², Sung Wook Baik¹ - Show less +1 more•Institutions (2)

Sejong University¹, Islamia College University²

01 Jan 2018-IEEE Access

TL;DR: A novel action recognition method by processing the video data using convolutional neural network (CNN) and deep bidirectional LSTM (DB-LSTM) network that is capable of learning long term sequences and can process lengthy videos by analyzing features for a certain time interval.

...read moreread less

Abstract: Recurrent neural network (RNN) and long short-term memory (LSTM) have achieved great success in processing sequential multimedia data and yielded the state-of-the-art results in speech recognition, digital signal processing, video processing, and text data analysis. In this paper, we propose a novel action recognition method by processing the video data using convolutional neural network (CNN) and deep bidirectional LSTM (DB-LSTM) network. First, deep features are extracted from every sixth frame of the videos, which helps reduce the redundancy and complexity. Next, the sequential information among frame features is learnt using DB-LSTM network, where multiple layers are stacked together in both forward pass and backward pass of DB-LSTM to increase its depth. The proposed method is capable of learning long term sequences and can process lengthy videos by analyzing features for a certain time interval. Experimental results show significant improvements in action recognition using the proposed method on three benchmark data sets including UCF-101, YouTube 11 Actions, and HMDB51 compared with the state-of-the-art action recognition methods.

...read moreread less

529 citations

Cites background from "Going deeper into action recognitio..."

...For example, the legs motion for kicking a football is a simple action, while jumping for a head-shoot is a collective motion of legs, arms, head, and whole body [3]....
[...]

Proceedings Article•DOI•

Self-Supervised Video Representation Learning with Odd-One-Out Networks

[...]

Basura Fernando¹, Hakan Bilen, Efstratios Gavves², Stephen Gould¹•Institutions (2)

Australian National University¹, University of Amsterdam²

21 Jul 2017

TL;DR: A new self-supervised CNN pre-training technique based on a novel auxiliary task called odd-one-out learning, which learns temporal representations for videos that generalizes to other related tasks such as action recognition.

...read moreread less

Abstract: We propose a new self-supervised CNN pre-training technique based on a novel auxiliary task called odd-one-out learning. In this task, the machine is asked to identify the unrelated or odd element from a set of otherwise related elements. We apply this technique to self-supervised video representation learning where we sample subsequences from videos and ask the network to learn to predict the odd video subsequence. The odd video subsequence is sampled such that it has wrong temporal order of frames while the even ones have the correct temporal order. Therefore, to generate a odd-one-out question no manual annotation is required. Our learning machine is implemented as multi-stream convolutional neural network, which is learned end-to-end. Using odd-one-out networks, we learn temporal representations for videos that generalizes to other related tasks such as action recognition. On action classification, our method obtains 60.3% on the UCF101 dataset using only UCF101 data for training which is approximately 10% better than current state-of-the-art self-supervised learning methods. Similarly, on HMDB51 dataset we outperform self-supervised state-of-the art methods by 12.7% on action classification task.

...read moreread less

489 citations

Cites background from "Going deeper into action recognitio..."

...Most of the prior work in action recognition is dedicated to hand-crafted features [18] such as dense trajectory features [15, 21, 41, 42]....
[...]

Proceedings Article•DOI•

Interpretable 3D Human Action Analysis with Temporal Convolutional Networks

[...]

Tae Soo Kim¹, Austin Reiter¹•Institutions (1)

Johns Hopkins University¹

21 Jul 2017

TL;DR: This work proposes to use a new class of models known as Temporal Convolutional Neural Networks (TCN) for 3D human action recognition, and aims to take a step towards a spatio-temporal model that is easier to understand, explain and interpret.

...read moreread less

Abstract: The discriminative power of modern deep learning models for 3D human action recognition is growing ever so potent. In conjunction with the recent resurgence of 3D human action representation with 3D skeletons, the quality and the pace of recent progress have been significant. However, the inner workings of state-of-the-art learning based methods in 3D human action recognition still remain mostly black-box. In this work, we propose to use a new class of models known as Temporal Convolutional Neural Networks (TCN) for 3D human action recognition. TCN provides us a way to explicitly learn readily interpretable spatio-temporal representations for 3D human action recognition. Through this work, we wish to take a step towards a spatio-temporal model that is easier to understand, explain and interpret. The resulting model, Res-TCN, achieves state-of-the-art results on the largest 3D human action recognition dataset, NTU-RGBD.

...read moreread less

471 citations

Cites background from "Going deeper into action recognitio..."

...Traditionally, the community has focused on activity recognition in the domain of RGB videos [34, 13]....
[...]

Proceedings Article•DOI•

SST: Single-Stream Temporal Action Proposals

[...]

Shyamal Buch¹, Victor Escorcia², Chuanqi Shen¹, Bernard Ghanem², Juan Carlos Niebles¹ - Show less +1 more•Institutions (2)

Stanford University¹, King Abdullah University of Science and Technology²

01 Jul 2017

TL;DR: It is demonstrated empirically that the new Single-Stream Temporal Action Proposals model outperforms the state-of-the-art on the task of temporal action proposal generation, while achieving some of the fastest processing speeds in the literature.

...read moreread less

Abstract: Our paper presents a new approach for temporal detection of human actions in long, untrimmed video sequences. We introduce Single-Stream Temporal Action Proposals (SST), a new effective and efficient deep architecture for the generation of temporal action proposals. Our network can run continuously in a single stream over very long input video sequences, without the need to divide input into short overlapping clips or temporal windows for batch processing. We demonstrate empirically that our model outperforms the state-of-the-art on the task of temporal action proposal generation, while achieving some of the fastest processing speeds in the literature. Finally, we demonstrate that using SST proposals in conjunction with existing action classifiers results in improved state-of-the-art temporal action detection performance.

...read moreread less

391 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Deep Residual Learning for Image Recognition

[...]

Kaiming He¹, Xiangyu Zhang¹, Shaoqing Ren¹, Jian Sun¹•Institutions (1)

Microsoft¹

27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

...read moreread less

123,388 citations

Journal Article•DOI•

Long short-term memory

[...]

Sepp Hochreiter¹, Jürgen Schmidhuber²•Institutions (2)

Technische Universität München¹, Dalle Molle Institute for Artificial Intelligence Research²

01 Nov 1997-Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

...read moreread less

72,897 citations

"Going deeper into action recognitio..." refers background in this paper

...Another architecture based on LSTM is proposed by Donahue et al. (2015) to exploit end-toend training over the composite network as shown in Fig....
[...]
...Generative models for action recognition are expected to discover long-term cues and deep models with LSTM cells are natural choices....
[...]
...The LSTM autoencoder consists of two RNNs, namely the encoder LSTM and the decoder LSTM....
[...]
...The LSTM autoencoder can be used to predict the future of a sequence as well....
[...]
...Deep-generative architectures Vincent et al. (2008); Goodfellow et al. (2014); Hochreiter and Schmidhuber (1997) aim this goal, i.e., learning from temporal data in an unsupervised matter....
[...]

Journal Article•DOI•

Gradient-based learning applied to document recognition

[...]

Yann LeCun¹, Léon Bottou², Léon Bottou³, Yoshua Bengio⁴, Yoshua Bengio³, Yoshua Bengio⁵, Patrick Haffner³ - Show less +3 more•Institutions (5)

Bell Labs¹, École Normale Supérieure², AT&T³, École Polytechnique de Montréal⁴, Alcatel-Lucent⁵

01 Jan 1998

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.

...read moreread less

Abstract: Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.

...read moreread less

42,067 citations

Proceedings Article•DOI•

Going deeper with convolutions

[...]

Christian Szegedy¹, Wei Liu², Yangqing Jia¹, Pierre Sermanet¹, Scott Reed³, Dragomir Anguelov¹, Dumitru Erhan¹, Vincent Vanhoucke¹, Andrew Rabinovich - Show less +5 more•Institutions (3)

Google¹, University of North Carolina at Chapel Hill², University of Michigan³

07 Jun 2015

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

...read moreread less

40,257 citations

Journal Article•DOI•

Generative Adversarial Nets

[...]

Ian Goodfellow¹, Jean Pouget-Abadie¹, Mehdi Mirza¹, Bing Xu¹, David Warde-Farley¹, Sherjil Ozair², Aaron Courville¹, Yoshua Bengio¹ - Show less +4 more•Institutions (2)

Université de Montréal¹, Indian Institute of Technology Delhi²

08 Dec 2014

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.

...read moreread less

Abstract: We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to ½ everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

...read moreread less

38,211 citations

"Going deeper into action recognitio..." refers background or methods in this paper

...To sidestep various difficulties in training deep generative models, Goodfellow et al. (2014) introduced the adversarial networks where a generative model competes with a discriminative model known as an adversary. The discriminative model learns to determine whether a sample is coming from the generative model or the data itself. During training, the generative model learns to generate samples that share more similarities to the data to pass the adversary model’s test while adversary model improves its judgments on whether a given sample is authentic or not. To this end, Mathieu et al. (2015) adopted the adversarial methodology to train a multi-scale convolutional network for video prediction....
[...]
...Deep-generative architectures Vincent et al. (2008); Goodfellow et al. (2014); Hochreiter and Schmidhuber (1997) aim this goal, i.e., learning from temporal data in an unsupervised matter....
[...]
...To sidestep various difficulties in training deep generative models, Goodfellow et al. (2014) introduced the adversarial networks where a generative model competes with a discriminative model known as an adversary....
[...]
...(2008); Goodfellow et al. (2014); Hochreiter and Schmidhuber (1997) aim this goal, i....
[...]