scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Action Recognition Based on Efficient Deep Feature Learning in the Spatio-Temporal Domain

TL;DR: A simple, yet robust, 2-D convolutional neural network extended to a concatenated 3-D network that learns to extract features from the spatio-temporal domain of raw video data and is used for content-based recognition of videos.
Abstract: Hand-crafted feature functions are usually designed based on the domain knowledge of a presumably controlled environment and often fail to generalize, as the statistics of real-world data cannot always be modeled correctly. Data-driven feature learning methods, on the other hand, have emerged as an alternative that often generalize better in uncontrolled environments. We present a simple, yet robust, 2-D convolutional neural network extended to a concatenated 3-D network that learns to extract features from the spatio-temporal domain of raw video data. The resulting network model is used for content-based recognition of videos. Relying on a 2-D convolutional neural network allows us to exploit a pretrained network as a descriptor that yielded the best results on the largest and challenging ILSVRC-2014 dataset. Experimental results on commonly used benchmarking video datasets demonstrate that our results are state-of-the-art in terms of accuracy and computational time without requiring any preprocessing (e.g., optic flow) or a priori knowledge on data capture (e.g., camera motion estimation), which makes it more general and flexible than other approaches. Our implementation is made available.

Summary (3 min read)

Introduction

  • Action Recognition based on Efficient Deep Feature Learning in the Spatio-Temporal Domain Farzad Husain, Babette Dellen and Carme Torras Abstract—Hand-crafted feature functions are usually designed based on the domain knowledge of a presumably controlled environment and often fail to generalize, as the statistics of realworld data cannot always be modeled correctly.
  • Several attempts have been made to address the different perception aspects of such dynamic environments where the robot is meant to assist, such as tracking a hand-held object for grasping [1], capturing human motion [2], activity recognition [3] and sensing the human behaviors [4].
  • Recently, it has been shown that a CNN model trained from a large dataset can be transferred to other visual recognition tasks with limited training data and thereby leading to higher accuracy and shorter training period [24], [28].
  • Visual recognition methods have to interpret video data displaying a large degree of variability and complexity in order to arrive at a semantic description, i.e., the action class of the recorded scene.

III. PROBLEM FORMULATION

  • Given a set of videos, obtain a label for each video characterizing its content.
  • The video can be of arbitrary spatial and temporal dimensions.

IV. APPROACH

  • The authors use a single-image convolution model for individual frames of video data and perform volumetric convolution at a higher level of abstraction by temporally concatenating the output.
  • In this way the authors are able to initialize their network with the parameters learned from the ImageNet dataset [44].
  • Additionally, the authors freeze the learned network parameters up to the second-last fully-connected layer, combine the output with another pretrained network and build a new softmax model.

A. Feature map concatenation

  • These feature maps are the outputs from layer-16 of the 19- layer network defined in [45] and trained on the ImageNet dataset.
  • Afterwards the authors add one 3D convolutional layer followed by three fully-connected layers.
  • The K-way softmax function is applied to the output of the last fully-connected layer, where K is again the number of action categories.
  • Each image is assigned the same label as its corresponding video.
  • The learning rate is adjusted to get the maximum accuracy in a minimum number of iterations on a held-out validation set from the training set.

B. Combining multiple networks

  • A deep network learns different features at each level of the layer hierarchy.
  • The authors network learns changes that occur in the temporal domain at a more abstract level because the authors temporally concatenate the output of the convolutional layer-16 from the pretrained network of [45].
  • In order to palliate this deficiency, the authors concatenate the fully connected layer-9 feature vectors with a length of 4096, extracted from another deep network that was trained in 3D from the beginning [32].
  • The model is trained on the Sports-1M dataset [31] which contains about 1 million videos of different sports action categories.
  • After concatenation, the authors perform max-pooling to reduce the feature dimension and afterwards build a new softmax model.

V. EXPERIMENTS

  • The authors evaluate their approach on two publicly available benchmarking datasets, UCF-101 [47] and HMDB [48].
  • These datasets are challenging because many video samples include camera motion as well as a dynamic background.
  • The authors use the same evaluation protocol as proposed by the respective authors and provide an in-depth analysis of their approach using the UCF-101 dataset as a test case.
  • Additional qualitative results for both the datasets are available at http://www.iri.upc.edu/ people/shusain/actionrecognition.html.
  • Furthermore, the authors separate 10% percent of the samples from the training data and use them as validation data.

A. UCF-101 dataset

  • Compared with the baseline [31] the authors observe a considerable improvement.
  • It should be noted that the authors use the output from the layer-9 activation (C3D 1 net) and concatenate it with their trained model as described in Fig. 2, i.e., concatenating two networks, as opposed to [32], where the output from 3 networks that have been trained differently is combined.
  • The calculation of optical flow leads to a significant computational overhead.
  • Figure 3 shows the confusion matrix accumulated for all the three splits.

B. Evaluating different scenarios

  • The authors measure the performance of their spatial and spatiotemporal learning framework in different scenarios using the split-1 of UCF-101 dataset.
  • Table II presents the evaluation under different settings along with the comparison to other approaches.
  • The authors observed results similar to the spatial AlexNetstream [30].
  • The authors found better results for spatial VGG-stream and VGG-3D when training only the adaptation, i.e., the newly added layers.
  • Similar behavior was observed in [31], i.e., a drop in the accuracy when fine tuning all the layers.

C. Learning from temporal information

  • Due to the concatenation of the feature maps in the temporal domain, the 3D kernels should also be able to exploit the temporal information in the video.
  • Hence, if the authors randomly shuffle the video frames while training, they should see a drop in the accuracy due to temporal inconsistency.
  • Figure 4 shows the drop in the accuracy averaged for two training sessions of the method described in Sec. IV-A for split-1 of UCF-101 dataset.

D. HMDB dataset

  • The HMDB dataset [48] contains 6,849 labeled video samples with 51 action categories.
  • Table III shows a comparison with other approaches.
  • The methods from [30] and [34] perform better than ours, however, both require computation of dense per frame optical flow for each video.
  • Figure 5 shows the confusion matrix accumulated for all three splits.
  • It can be seen that similar actions such as “throw” and “swing baseball” are the most confused.

E. Qualitative analysis

  • Since the authors do not preprocess the data using techniques such as background subtraction or tracking a bounding box, their feature-learning approach is agnostic to such domainspecific information.
  • Other than similar background, actions may themselves be also visually confusing, which can affect feature learning.
  • Both activities “cartwheel” and “handstand” entail performing a similar motion.
  • These are inherent problems of feature learning when using only raw data and the resulting mislabelings have been named reasonable mistakes in [31].

VI. CONCLUSIONS

  • The authors tackled the problem of action recognition by using a spatio-temporal feature learning scheme.
  • The authors results are competitive with the state-of-the-art convolutional and strong feature-based baselines.
  • The authors are using the publicly available Torch7 library for their implementation which is optimized for fast processing on a CPU as well as a GPU.
  • So far, the authors concatenated the feature maps in the last convolutional layer.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2016 1
Action Recognition based on Efficient Deep Feature
Learning in the Spatio-Temporal Domain
Farzad Husain, Babette Dellen and Carme Torras
Abstract—Hand-crafted feature functions are usually designed
based on the domain knowledge of a presumably controlled
environment and often fail to generalize, as the statistics of real-
world data cannot always be modeled correctly. Data-driven
feature learning methods, on the other hand, have emerged
as an alternative that often generalize better in uncontrolled
environments. We present a simple, yet robust, 2D convolutional
neural network extended to a concatenated 3D network that
learns to extract features from the spatio-temporal domain of
raw video data. The resulting network model is used for content-
based recognition of videos. Relying on a 2D convolutional neural
network allows us to exploit a pretrained network as a descriptor
that yielded the best results on the largest and challenging
ILSVRC-2014 dataset. Experimental results on commonly used
benchmarking video datasets demonstrate that our results are
state-of-the-art in terms of accuracy and computational time
without requiring any preprocessing (e.g., optic flow) or a priori
knowledge on data capture (e.g., camera motion estimation),
which makes it more general and flexible than other approaches.
Our implementation is made available.
Index Terms—Computer vision for automation, recognition,
visual learning.
I. INTRODUCTION
B
UILDING personal robots for tasks involving assistance
and interaction with humans carries several challenges.
One key challenge is to perceive and interpret dynamic human
environments. This is necessary for the active engagement of
the robot. Several attempts have been made to address the
different perception aspects of such dynamic environments
where the robot is meant to assist, such as tracking a hand-held
object for grasping [1], capturing human motion [2], activity
recognition [3] and sensing the human behaviors [4].
One important objective is the detection and recognition of
daily human activities. Actions such as brushing hair, eating,
drinking, chewing, sitting, walking, standing, etc., implicitly
encompass the structure of a particular human environment.
Successful recognition of these actions simplifies several tasks
that are aimed for such robotic assistants. For example, assist-
ing the elderly in timely caregiving [5], [6], in the situation of
accidents [7] or in the daily life activities [8].
Manuscript received: August 31, 2015; Revised December 18, 2015;
Accepted January, 28, 2016. This paper was recommended for publication
by Editor Jana Kosecka upon evaluation of the reviewers’ comments. This
research is partially funded by the CSIC project TextilRob (201550E028),
and the project RobInstruct (TIN2014-58178-R).
F. Husain and C. Torras are with the Institut de Rob
`
otica i Inform
`
atica
Industrial, CSIC-UPC, Llorens i Artigas 4-6, 08028, Barcelona, Spain (e-mail:
{shusain, torras}@iri.upc.edu).
B. Dellen is with the RheinAhrCampus der Hochschule Koblenz, Joseph-
Rovan-Allee 2, 53424 Remagen, Germany (e-mail: dellen@hs-koblenz.de).
Digital Object Identifier (DOI): see top of this page.
Recognizing human activities for robots is conventionally
tackled using a pipeline approach, by first (i) modeling the dy-
namics of changing environments using a graphical model [9],
[10], [11] or identifying descriptive features [12], [13], [14],
[15], and then (ii) performing classification [16], [17]. The first
part requires extraction of motion information through some
mechanism. Possible approaches include the computation of
optic flow or motion modeling. However, recent benchmarks
have revealed that there is no universally accepted model that
could outperform others for all datasets [18]. The reason is that
the statistics of datasets can be considerably different, and a
particular model might perform better for one dataset than for
another. Many spatio-temporal descriptors are extensions from
single image descriptors such as SIFT3D [17], HOG3D [12]
and SURF3D [19]. However, such extensions also inherit the
limitations in performance generalization as shown in [20],
making clear the advantage of learned features over hand-
crafted ones.
Deep Convolutional Neural Networks (CNNs) [21] have
emerged as a state-of-the-art solution for a wide range
of computer vision problems, such as image segmenta-
tion/labeling [22], [23], object detection/localization [24] and
pose recovery [25], [26]. The main advantage over the con-
ventional pipeline approaches is that CNNs can be trained
end-to-end (from raw pixels to labels) in a fully supervised
way. One drawback of fully supervised deep learning is that
it requires a huge number of labeled training examples [27].
Recently, it has been shown that a CNN model trained from
a large dataset can be transferred to other visual recognition
tasks with limited training data and thereby leading to higher
accuracy and shorter training period [24], [28]. Since single-
image input-based models that have been trained over a million
labeled images are now readily available [29], we see attempts
to exploit these networks in the video domain [30], [31].
However we observe limited success when learning directly
in the temporal domain.
We also notice that weakly annotated video data is becom-
ing prevalent as time goes by. For example, 300 hours of video
are uploaded to Youtube every minute
a
. Such abundance of
video data opens up the opportunity to exploit the infinite
space of possible actions in the context of human action recog-
nition [31]. Visual recognition methods have to interpret video
data displaying a large degree of variability and complexity in
order to arrive at a semantic description, i.e., the action class
of the recorded scene.
We propose to recognize human actions using the trans-
fer learning technique. A pretrained single-image recognition
a
https://www.youtube.com/yt/press/statistics.html

2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2016
model is adapted for videos by temporally concatenating the
output of its deepest spatial convolution layer. The input to
the model are the individual frames of the video. Afterwards,
the concatenated output is used as an input to a network
comprising 3D convolutions that we train.
The feature representation becomes more abstract as we go
deeper in a network thereby obscuring the locally occurring
temporal changes in a video. This poses a limitation to the tem-
poral features that the network learns from the concatenated
output. We overcome this limitation by combining the output
of our learned network with another pretrained model [32]
which employs 3D convolutions from the beginning. The
complementary nature of the two features becomes evident
from the improved recognition accuracy in our experiments.
The combined output contains fewer trainable parameters
thereby allowing us to use a more efficient optimization
method (L-BFGS) [33]. Our model does not require any
pre-computation of features such as optic flow or any other
domain-specific processing, thereby making it generic and
computationally efficient.
Our main contributions are:
the introduction of a concatenation scheme in the tem-
poral domain to extend the usage of pretrained models
learned from a single image to the video domain,
combining our learned network with another action recog-
nition model, which yields improved results as compared
to the individual networks, and
evaluation and comparison with commonly used bench-
marking video datasets.
II. RELATED WORK
Several action recognition methods have been proposed in
the past. We roughly group them into two categories. First is
the conventional pipeline approach (descriptor followed by a
classifier) [34], [35], [17], [12], [19], [36] and second is the
convolutional model [20], [37], [38], [39], [31] which is the
basis of our approach.
In [34], improved dense trajectories are produced by re-
ducing the camera motion effect, which is estimated using
the SURF descriptor [40]. However, for recognizing human
actions, inconsistent matches from SURF are removed by ex-
ploiting domain knowledge, i.e., by adding a human detector.
A higher-level representation of activities, named as “action
bank, combined with a linear SVM classifier is proposed
in [35].
Another way of representing actions is through spatio-
temporal segmentation of dynamic scenes. The segmented
surfaces and how they change over time gives cues about
the kind of manipulation actions which are being carried out.
The manipulation actions can be encoded in the form of
“Semantic Event Chains“ [36]. These event chains represent
the spatial relations between objects in the scene. Any change
in the spatial relation serves as a decisive key point through
which a manipulation could be defined. Similarly, temporal
segmentation of a video into multiple events is proposed
in [41].
An unsupervised learning method based on convolution and
stacking has also been proposed [20]. The convolved output of
arbitrarily sized videos is made constant by dimensionality re-
duction using principal component analysis. The time-efficient
dimensionality reduction for long video clips has a relatively
larger memory requirement (up to 32 GB). In [37], a spatio-
temporal sparse auto-encoder is trained in an unsupervised
way for classifying video sequences. The convolutional gated
Restricted Boltzmann Machine architecture [42] has been
extended to 3D to learn relations between pairs of adjacent
images in videos and is used to extract features for activity
recognition [38].
A 3D CNN model has also been previously proposed [39].
In this model, features are learned simultaneously in the spatial
and temporal dimensions by performing 3D convolutions. The
model is applied to real-world environments to recognize
human actions. However, other than the raw images, a set of
hardwired kernels is created to generate the gradients and optic
flow which should be learned by the proposed convolutional
network. In addition, a human detector is introduced and
foreground extraction is performed. On the contrary, we feed
the raw image data directly to our network and do not compute
any handcrafted feature.
In [30], a two-stream network is proposed, where each
frame of the video is used as an individual image during
training. One stream is trained on raw images and the other is
trained with optical flow fields computed from the consecutive
video frames. Recognition is attained using a score aggregation
strategy across all the video frames of both streams.
Using pretrained models is also proposed in [31], [43],
[32]. Our model could be categorized as the late fusion
model from [31], in which multiple networks are fused in
the final fully connected layers. However, we use a different
network architecture which is pretrained on a single-image
database instead of a video database, thereby reducing the
computational cost. We get significantly better results without
requiring fine tuning of the pretrained models. Along the same
lines, different aggregation strategies for per frame image-
based features is investigated in [43].
Our approach is closely related to [32]. A 3D convnet is
defined with convolutional kernels up to the first 8 layers.
However, we use 3D convolutional kernels after extracting the
output from a very deep pretrained model and thereby learn
features in the temporal domain at a higher level of abstraction.
III. PROBLEM FORMULATION
Given a set of videos, obtain a label for each video charac-
terizing its content. The video can be of arbitrary spatial and
temporal dimensions.
IV. APPROACH
We use a single-image convolution model for individual
frames of video data and perform volumetric convolution
at a higher level of abstraction by temporally concatenating
the output. The method is illustrated in Fig. 1. Here, K is
the number of action categories. In this way we are able to
initialize our network with the parameters learned from the
ImageNet dataset [44]. Additionally, we freeze the learned
network parameters up to the second-last fully-connected

HUSAIN et al.: ACTION RECOGNITION BASED ON DEEP LEARNING 3
layer, combine the output with another pretrained network and
build a new softmax model. We refer to the CNN architecture
(19-layer network in [45]) as VGG-Net.
A. Feature map concatenation
We train a network which takes as input 3D feature maps.
These feature maps are the outputs from layer-16 of the 19-
layer network defined in [45] and trained on the ImageNet
dataset. Layer-16 is the last spatial convolution layer in [45].
This gives us a high-level feature descriptor in the spatial
domain. Afterwards we add one 3D convolutional layer fol-
lowed by three fully-connected layers. The K-way softmax
function is applied to the output of the last fully-connected
layer, where K is again the number of action categories. In
total, our network contains 20 layers and we train only the
last 4 layers. We use a dropout regularization ratio of 0.5 for
the fully-connected layers.
We take N = 30 uniformly spaced frames from each video
as input to the network. Each image is assigned the same
label as its corresponding video. The network is trained using
stochastic gradient descent and use the same momentum as
in [44]. The learning rate is adjusted to get the maximum
accuracy in a minimum number of iterations on a held-out
validation set from the training set.
B. Combining multiple networks
A deep network learns different features at each level of the
layer hierarchy. The activations in the initial layers tend to be
more sensitive to edge-like patterns and corners within their
receptive field, whereas activations at deeper levels have larger
receptive fields and capture more complex invariances [46].
Our network learns changes that occur in the temporal
domain at a more abstract level because we temporally con-
catenate the output of the convolutional layer-16 from the
pretrained network of [45]. Hence, our model lacks learning
in the temporal domain from locally occurring changes. In
order to palliate this deficiency, we concatenate the fully con-
nected layer-9 feature vectors with a length of 4096, extracted
from another deep network that was trained in 3D from the
beginning [32]. The model contains 8 3D-convolution, 5 max-
pooling, and 2 fully-connected layers. Deeper 3D convolution
layers are not possible, due to GPU memory restrictions. The
model is trained on the Sports-1M dataset [31] which contains
about 1 million videos of different sports action categories.
Figure 2 shows the combination scheme of the two feature
maps. We concatenate the output of the fully-connected layers
for the same video, hence having the same action category. We
also augment the data by cropping M = 10 patches from each
frame of the video for the VGG-3D network, hence the output
feature dimension is 4096 + (1024 × M). After concatenation,
we perform max-pooling to reduce the feature dimension and
afterwards build a new softmax model. Since we are learning
the parameters for the softmax layer only, we can use a more
efficient optimization approach instead of stochastic gradient
descent. We use an off-the-shelf implementation
b
of L-BFGS
b
http://www.cs.ubc.ca/
schmidtm/Software/minFunc.html
which has been shown to yield better results when the number
of trainable parameters is small [33].
V. EXPERIMENTS
We evaluate our approach on two publicly available bench-
marking datasets, UCF-101 [47] and HMDB [48]. These
datasets are challenging because many video samples include
camera motion as well as a dynamic background. We use the
same evaluation protocol as proposed by the respective authors
and provide an in-depth analysis of our approach using the
UCF-101 dataset as a test case. Additional qualitative results
for both the datasets are available at http://www.iri.upc.edu/
people/shusain/actionrecognition.html
The network is able to take only fixed-size input frames,
hence we resize all the videos so that the maximum dimension
is 256 pixels and crop 10 patches of size 224 × 224 pixels ac-
cording to the data augmentation scheme as proposed in [44].
Furthermore, we separate 10% percent of the samples from
the training data and use them as validation data. Such data
are needed to determine the number of iterations needed for
stochastic gradient descent.
A. UCF-101 dataset
The UCF-101 [47] dataset contains 13,320 labeled video
samples with 101 action categories. We use the 3-way train/test
split as provided by the authors. Table I shows a comparison
with other approaches. Compared with the baseline [31] we
observe a considerable improvement. Not surprisingly, we see
improved results as also compared to [32]. This shows the
complementary nature of the high-level (layer-19) and low-
level (layer-6) features. It should be noted that we use the
output from the layer-9 activation (C3D 1 net) and concate-
nate it with our trained model as described in Fig. 2, i.e.,
concatenating two networks, as opposed to [32], where the
output from 3 networks that have been trained differently is
combined. Our results are closer to [30], where optical flow
needs to be computed. However, the calculation of optical
flow leads to a significant computational overhead. As shown
in [32], Brox optical flow used in [30] takes 0.85-0.95s per
image pair which is 274x slower than C3D. Additionally,
storing the raw flow fields for this dataset requires a disk
space of 1.5 TB which needs data compression [30]. Figure 3
shows the confusion matrix accumulated for all the three
splits. Comparing the confusion matrix with that resulting
from the approach in [30] (Fig. 5 in [30]), it can be seen that
the actions “CricketBowling” and “CricketShot” have similar
levels of confusion, whereas our approach shows better results
for the action “YoYo”. Figure 6 shows the top-5 predictions for
selected test sequences from the UCF-101 dataset [47] with
101 action categories.
B. Evaluating different scenarios
We measure the performance of our spatial and spatio-
temporal learning framework in different scenarios using the
split-1 of UCF-101 dataset. Table II presents the evaluation
under different settings along with the comparison to other
approaches.

4 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2016
Fig. 1. Illustration of the network. We use the output from layer 16 of the VGG-Net (Table 1 in [45]), as a descriptor. The output is concatenated to form
512, 3D feature maps. The 3D feature maps are used as input for the network consisting of a volumetric convolutional layer followed by two fully-connected
layers.
Fig. 2. Illustration of how the different network outputs are combined, where VGG-3D-fc2 refers to the fc2 layer in Fig. 1
Fig. 3. Confusion matrix for the UCF-101 dataset accumulated for all three splits

HUSAIN et al.: ACTION RECOGNITION BASED ON DEEP LEARNING 5
TABLE I
AVERAGE ACCURACY ON THE UCF-101 DATASET (3-FOLD).
Algorithm Accuracy
CNN with transfer learning [31] 65.4%
LRCN (RGB) [49] 71.1%
Spatial stream ConvNet [30] 72.6%
LSTM composite model [50] 75.8%
Our approach (VGG-3D) 79.1%
C3D (1 net) [32] 82.3%
Temporal stream ConvNet [30] 83.7%
C3D (3 nets) [32] 85.2%
Combined ordered and improved trajectories [51] 85.4%
Stacking classifiers and CRF smoothing [52] 85.7%
Improved dense trajectories [34] 85.9%
Improved dense trajectories with human detection[53] 86.0%
Our approach (VGG-3D + C3D-fc6-1 net) 86.7%
Spatial and temporal stream fusion [30] 88.0%
TABLE II
CONVNET ACCURACY UNDER DIFFERENT SETTINGS FOR UCF-101
DATASET.
Scenario Accuracy
Fine tune top 3 layers (Sports 1M - pretrained) [31] 65.4% (3 fold)
Fine tune all layers (Sports 1M - pretrained) [31] 62.2% (3 fold)
Spatial AlexNet-stream (pretrained and last layer) [30] 72.7% (1 fold)
Spatial AlexNet-stream (pretrained and fine tuned) [30] 72.8% (1 fold)
Spatial VGG-stream (pretrained and fine tuned) 71.4% (1 fold)
VGG-3D (pretrained and fine tuned) 75.5% (1 fold)
Spatial VGG-stream (pretrained and adaptation layers) 76.3% (1 fold)
VGG-3D (pretrained and adaptation layers) 80.0% (1 fold)
VGG-3D (pretrained and fine tuned) + C3D-fc6-1 net 83.5% (1 fold)
VGG-3D (pretrained and adaptation layers) + C3D-fc7-1 net 84.8% (1 fold)
VGG-3D (pretrained and adaptation layers) + C3D-fc6-1 net 86.7% (1 fold)
In our spatial VGG-stream we obtained the label for a video
after averaging the scores for all the frames belonging to that
video. All the layers were pretrained on the ImageNet dataset
and fine tuned on the UCF-101 dataset, except the last layer
which was initialized randomly because of different number
of classes. We observed results similar to the spatial AlexNet-
stream [30]. We found better results for spatial VGG-stream
and VGG-3D when training only the adaptation, i.e., the newly
added layers. Similar behavior was observed in [31], i.e., a
drop in the accuracy when fine tuning all the layers. This
is because training such a huge network with a small dataset
results in overfitting. We observed the best result when training
the adaptation layers only combined with the fc6 layer from
C3D.
C. Learning from temporal information
Due to the concatenation of the feature maps in the temporal
domain, the 3D kernels should also be able to exploit the
temporal information in the video. Hence, if we randomly
shuffle the video frames while training, we should see a drop
in the accuracy due to temporal inconsistency. Figure 4 shows
the drop in the accuracy averaged for two training sessions
of the method described in Sec. IV-A for split-1 of UCF-101
dataset.
D. HMDB dataset
The HMDB dataset [48] contains 6,849 labeled video sam-
ples with 51 action categories. We use the 3-way train/test split
Fig. 4. Comparing accuracy for shuffling video sample frames.
TABLE III
AVERAGE ACCURACY ON THE HMDB DATASET (3-FOLD).
Algorithm Accuracy
Spatio-temporal HMAX network [54] 22.8%
Spatial stream ConvNet [30] 40.5%
Trajectory-Based Modeling [55] 40.7%
Our approach (VGG-3D) 46.9%
Decomposing visual motion [56] 52.1%
Our approach (VGG-3D + C3D-fc6-1 net) 53.9%
Temporal stream ConvNet [30] 54.6%
Improved dense trajectories [34] 57.2%
Spatial and temporal stream fusion [30] 59.4%
as provided by the authors. Table III shows a comparison with
other approaches. The methods from [30] and [34] perform
better than ours, however, both require computation of dense
per frame optical flow for each video. In addition, the method
in [34] also requires camera motion estimation. Figure 5 shows
the confusion matrix accumulated for all three splits. It can be
seen that similar actions such as “throw” and “swing baseball”
are the most confused. Figure 7 shows the top-5 predictions
for selected test sequences.
E. Qualitative analysis
Since we do not preprocess the data using techniques
such as background subtraction or tracking a bounding box,
our feature-learning approach is agnostic to such domain-
specific information. For this reason, wrong labels can be
seen, in Figs. 6 and 7, when different activities are performed
in visually similar environments. For example, Fig. 6(c6)
vs. Fig. 6(b3) and Fig. 6(b6) vs. Fig. 6(c2), share similar
environments and we see a high confidence of “HairCut” in
the “ShavingBeard” action and “PlayingFlute” got confused
with “PlayingViolin”. Similar observations can be made in
Fig. 7(b3) vs. Fig. 7(c6). However, sometimes background
plays an important role in correctly recognizing certain ac-
tions, for instance, “SkyDiving” (Fig. 6(e3)) and “Surfing”
(Fig. 6(e4)).
Other than similar background, actions may themselves be
also visually confusing, which can affect feature learning. For
example, Fig. 7(a2) vs. Fig. 7(c1). Both activities “cartwheel”
and “handstand” entail performing a similar motion.

Citations
More filters
Proceedings ArticleDOI
01 Aug 2016
TL;DR: This work proposes a vision-based solution to recognize the driver's behavior based on convolutional neural networks, namely R*CNN, which is able to provide abundant semantic information with sufficient discriminative capability.
Abstract: Traffic safety is a severe problem around the world. Many road accidents are normally related with the driver's unsafe driving behavior, e.g. eating while driving. In this work, we propose a vision-based solution to recognize the driver's behavior based on convolutional neural networks. Specifically, given an image, skin-like regions are extracted by Gaussian Mixture Model, which are passed to a deep convolutional neural networks model, namely R∗CNN, to generate action labels. The skin-like regions are able to provide abundant semantic information with sufficient discriminative capability. Also, R∗CNN is able to select the most informative regions from candidates to facilitate the final action recognition. We tested the proposed methods on Southeast University Driving-posture Dataset and achieve mean Average Precision(mAP) of 97.76% on the dataset which prove the proposed method is effective in drivers's action recognition.

59 citations


Cites background from "Action Recognition Based on Efficie..."

  • ...Most of the previously published works on action recognition were on video-based approaches [1] [2] [3]....

    [...]

Posted Content
TL;DR: A comprehensive overview of deep learning and its usage in computer vision is given, that includes a description of the most frequently used neural models and their main application areas, and a review of the principal work using deep learning in robot vision.
Abstract: Deep learning has allowed a paradigm shift in pattern recognition, from using hand-crafted features together with statistical classifiers to using general-purpose learning procedures for learning data-driven representations, features, and classifiers together. The application of this new paradigm has been particularly successful in computer vision, in which the development of deep learning methods for vision applications has become a hot research topic. Given that deep learning has already attracted the attention of the robot vision community, the main purpose of this survey is to address the use of deep learning in robot vision. To achieve this, a comprehensive overview of deep learning and its usage in computer vision is given, that includes a description of the most frequently used neural models and their main application areas. Then, the standard methodology and tools used for designing deep-learning based vision systems are presented. Afterwards, a review of the principal work using deep learning in robot vision is presented, as well as current and future trends related to the use of deep learning in robotics. This survey is intended to be a guide for the developers of robot vision systems.

46 citations


Cites background or methods from "Action Recognition Based on Efficie..."

  • ...Reports of research dealing with recognition of human actions include [229], [230], [231], [232], [233], [234], [105], [116], and [110], in which driver activity anticipation is performed....

    [...]

  • ..., 2016 [105] Temporal concatenation of the output of pre-trained VGG-16 into a 3D convolutional layer....

    [...]

  • ...Scene Representation and Classification (Camera re-localization) Husain et al., 2016 [124] CNN using layers from OverFeat Network with multiple pooling sizes....

    [...]

  • ...Spatiotemporal Vision (Object Understanding) Husain et al., 2016 [105] Temporal concatenation of the output of pre-trained VGG-16 into a 3D convolutional layer....

    [...]

Journal ArticleDOI
TL;DR: A review of the recent developments in deep learning and video sceneAnalysis problems is presented and a detailed overview of the particular challenges existed in real-time video scene analysis that has been contributed towards activity recognition, scene interpretation, and video description/captioning is provided.
Abstract: Video scene analysis is a recent research topic due to its vital importance in many applications such as real-time vehicle activity tracking, pedestrian detection, surveillance, and robotics. Despite its popularity, the video scene analysis is still an open challenging task and require more accurate algorithms. However, the advances in deep learning algorithms for video scene analysis have been emerged in last few years for solving the problem of real-time processing. In this paper, a review of the recent developments in deep learning and video scene analysis problems is presented. In addition, this paper also briefly describes the most recent used datasets along with their limitations. Moreover, this review provides a detailed overview of the particular challenges existed in real-time video scene analysis that has been contributed towards activity recognition, scene interpretation, and video description/captioning. Finally, the paper summarizes the future trends and challenges in video scene analysis tasks and our insights are provided to inspire further research efforts.

45 citations


Cites methods from "Action Recognition Based on Efficie..."

  • ...[35] presented a model for content/activity recognition that extended a 2D CNN to a 3D CNN....

    [...]

Journal ArticleDOI
TL;DR: A data-driven convolutional neural networks (CNNs)-based engagement recognition method that uses only facial images from input videos and indicates that the engagement level of children can be gauged automatically via deep learning even when the available database is deficient.
Abstract: Automatic engagement recognition is a technique that is used to measure the engagement level of people in a specific task. Although previous research has utilized expensive and intrusive devices such as physiological sensors and pressure-sensing chairs, methods using RGB video cameras have become the most common because of the cost efficiency and noninvasiveness of video cameras. Automatic engagement recognition methods using video cameras are usually based on hand-crafted features and a statistical temporal dynamics modeling algorithm. This paper proposes a data-driven convolutional neural networks (CNNs)-based engagement recognition method that uses only facial images from input videos. As the amount of data in a dataset of children's engagement is insufficient for deep learning, pre-trained CNNs are utilized for low-level feature extraction from each video frame. In particular, a new layer combination for temporal dynamics modeling is employed to extract high-level features from low-level features. Experimental results on a database created using images of children from kindergarten demonstrate that the performance of the proposed method is superior to that of previous methods. The results indicate that the engagement level of children can be gauged automatically via deep learning even when the available database is deficient.

37 citations

Proceedings ArticleDOI
19 Oct 2017
TL;DR: A 3D Convolutional Neural Network with body representations based on Euclidean Distance Matrices is combined with a novel architecture that simultaneously, and in an end-to-end manner, learns an optimal transformation of the joints, while optimizing the rest of parameters of the convolutional network.
Abstract: In this paper we are interested in recognizing human actions from sequences of 3D skeleton data. For this purpose we combine a 3D Convolutional Neural Network with body representations based on Euclidean Distance Matrices (EDMs), which have been recently shown to be very effective to capture the geometric structure of the human pose. One inherent limitation of the EDMs, however, is that they are defined up to a permutation of the skeleton joints, i.e., randomly shuffling the ordering of the joints yields many different representations. In oder to address this issue we introduce a novel architecture that simultaneously, and in an end-to-end manner, learns an optimal transformation of the joints, while optimizing the rest of parameters of the convolutional network. The proposed approach achieves state-of-the-art results on 3 benchmarks, including the recent NTU RGB-D dataset, for which we improve on previous LSTM-based methods by more than 10 percentage points, also surpassing other CNN-based methods while using almost 1000 times fewer parameters.

36 citations


Cites methods from "Action Recognition Based on Efficie..."

  • ...For instance, in [10] 2D CNNs are used to extract features, and 3D CNNs over the computed features fuse the spatial and temporal information....

    [...]

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Journal ArticleDOI
01 Jan 1998
TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Abstract: Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.

42,067 citations

Proceedings ArticleDOI
23 Jun 2014
TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

21,729 citations

Posted Content
TL;DR: This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.
Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012---achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at this http URL.

13,081 citations


"Action Recognition Based on Efficie..." refers background in this paper

  • ...Deep Convolutional Neural Networks (CNNs) [21] have emerged as a state-of-the-art solution for a wide range of computer vision problems, such as image segmentation/labeling [22], [23], object detection/localization [24] and pose recovery [25], [26]....

    [...]

Frequently Asked Questions (2)
Q1. What are the contributions in "Action recognition based on efficient deep feature learning in the spatio-temporal domain" ?

The authors present a simple, yet robust, 2D convolutional neural network extended to a concatenated 3D network that learns to extract features from the spatio-temporal domain of raw video data. Experimental results on commonly used benchmarking video datasets demonstrate that their results are state-of-the-art in terms of accuracy and computational time without requiring any preprocessing ( e. g., optic flow ) or a priori knowledge on data capture ( e. g., camera motion estimation ), which makes it more general and flexible than other approaches. 

In the future, the authors plan to explore possible modifications in the network design to further exploit learning in the temporal domain. One possibility would be to gradually increase the number of temporal connections along the sequence of layers. The authors also plan to investigate the effect on performance of gradually clipping the top layers of the network and evaluation on the recently introduced Sports-1M dataset [ 31 ] which contains over 1 million labeled sample videos.