Towards Understanding Action Recognition

doi:10.1109/ICCV.2013.396

Proceedings Article•DOI•

Towards Understanding Action Recognition

Hueihan Jhuang, Juergen Gall¹, Silvia Zuffi², Cordelia Schmid³, Michael J. Black - Show less +1 more•Institutions (3)

University of Bonn¹, Brown University², French Institute for Research in Computer Science and Automation³

01 Dec 2013-pp 3192-3199

TL;DR: It is found that high-level pose features greatly outperform low/mid level features, in particular, pose over time is critical, but current pose estimation algorithms are not yet reliable enough to provide this information.

read less

Abstract: Although action recognition in videos is widely studied, current methods often fail on real-world datasets. Many recent approaches improve accuracy and robustness to cope with challenging video sequences, but it is often unclear what affects the results most. This paper attempts to provide insights based on a systematic performance evaluation using thoroughly-annotated data of human actions. We annotate human Joints for the HMDB dataset (J-HMDB). This annotation can be used to derive ground truth optical flow and segmentation. We evaluate current methods using this dataset and systematically replace the output of various algorithms with ground truth. This enables us to discover what is important - for example, should we work on improving flow algorithms, estimating human bounding boxes, or enabling pose estimation? In summary, we find that high-level pose features greatly outperform low/mid level features, in particular, pose over time is critical, but current pose estimation algorithms are not yet reliable enough to provide this information. We also find that the accuracy of a top-performing action recognition framework can be greatly increased by refining the underlying low/mid level features, this suggests it is important to improve optical flow and human detection algorithms. Our analysis and J-HMDB dataset should facilitate a deeper understanding of action recognition algorithms.

...read moreread less

Summary (4 min read)

Jump to: [1. Introduction] – [2. Related Studies and Datasets] – [3.1. Selection] – [3.2. Annotation] – [3.3. Training and testing set generation] – [4. Study of low-level features] – [4.1. DT features] – [4.2. DT given puppet flow] – [5. Study of mid-level features] – [5.1. DT given foreground mask] – [5.2. DT given scale] – [6. Study of high-level features 6.1. Pose features] – [6.2. DT given joints] – [6.3. Summary] and [7. Discussion]

1. Introduction

Current computer vision algorithms fall far below human performance on activity recognition tasks.
Many things might be limiting current meth-ods: weak visual cues or lack of high-level cues for example.
Higher-level pose features require the knowledge of joints (h) but can be semantically interpreted.
While their main focus is to analyze the potential impact of different cues, the dataset is also valuable for evaluating human pose estimation and human detection in videos.
The authors preliminary results show that pose features estimated from [33] perform much worse than the ground truth pose features, but they outperform low/mid level features for action recognition on clips where the full body is visible.

3.1. Selection

The HMDB51 database [14] contains more than 5,100 clips of 51 different human actions collected from movies or the Internet.
Annotating this entire dataset is impractical so J-HMDB is a subset with fewer categories.
The authors excluded categories that contain mainly facial expressions like smiling, interactions with others such as shaking hands, and actions that can only be done in a specific way such as a cartwheel.
For the remaining clips, the authors further crop them in time such that the first and last frame roughly correspond to the beginning and end of an action.
In summary, there are 31,838 annotated frames in total.

3.2. Annotation

For annotation, the authors use a 2D puppet model [36] in which the human body is represented as a set of 10 body parts connected by 13 joints (shoulder, elbow, wrist, hip, knee, ankle, neck) and two landmarks (face and belly).
The authors built a graphical user interface to control the viewpoint and scale and in which the joints can be selected and moved in the image plane.
The annotation involves adjusting the joint position so that the contours of the puppet align with image information [36] .
The puppet mask (i.e. the region contained within the puppet) is also used to initialize GrabCut [23] to obtain a segmentation mask.
Details about the annotation interface and the distribution of joint locations, viewpoints, and scales of the annotations are provided on the website.

3.3. Training and testing set generation

For each action category, clips are randomly grouped into two sets with the constraint that the clips from the same video belong to the same set.
The authors iterate the grouping until the ratio of the number of clips in the two sets and the ratio of the number of distinct video sources in the two sets are both close to 7:3.
Three splits are randomly generated and the performance reported here is the average of the three splits.
Note that the number of training/testing clips is similar across categories and the authors report the per-video accuracy, which does not differ much from the per-class accuracy.

4. Study of low-level features

The authors focus their evaluation on the Dense Trajectories (DT) algorithm [30] since it is currently the best performing method on the HMDB51 database [14] and because it relies on video feature descriptors that are also used by other methods.
The authors first review DT in Sec. 4.1, and then they replace pieces of the algorithm with the ground truth data to provide low, mid, and high level information in Sec. 4.2, Sec. 5 and Sec. 6.2 respectively.

4.1. DT features

The DT algorithm [30] represents video data by dense trajectories along with motion and shape features around the trajectories.
Feature points are further pruned to keep the ones whose eigenvalues of the auto-correlation matrix are larger than some threshold.
Motion boundary histograms [6] are computed separately for the horizontal and vertical gradients of the optical flow (giving two descriptors), also known as MBH.
While this decreases the performance on their dataset by less than 1%, it is necessary to fairly evaluate the impact of the flow accuracy using the puppet flow, which is generated at the original video scale.
The multi-class classification is done by LIBSVM [4] using a one-vs-all approach.

4.2. DT given puppet flow

The authors can not evaluate the gain of having perfect dense optical flow, and therefore perfect trajectories.
Instead, the authors use the puppet flow as the ground truth motion in the foreground, i.e. within the puppet mask .
The authors also try to compute (5) with features from the whole frame.
It is now clear that the flow-related descriptors, Traj, HOF and MBH have a large gain (6.2-16 pp) over the baseline.

5. Study of mid-level features

Estimating the location and size of the human in action might be an easier task than estimating accurate pixel-wise flow.
In the section below, the authors only use Farnebäck's flow (of ).

5.1. DT given foreground mask

The authors consider two types of regions of interest: the dilated puppet mask Dmask and bbox described above.
The authors consider two ways of masking, one is in the feature space (F); i.e. compute flow/descriptors on the whole frame then only use those from within the mask.
In 50% of the images, the overlap between the predicted box and the ground truth box exceeds 50%.
This suggests that the human detector in [1] is not accurate enough to help action recognition.

5.2. DT given scale

The authors resize all the frames as well as the corresponding Dmask such that all persons are around 200 pixels in height, and repeat the analysis in (10) .
Finally, combining kernels of features relying on different low/mid level features results in a 12.4 pp gain over the baseline (Tab. 2 ( 13)).
It is interesting to see that for many paired comparisons, such as (5) vs. ( 6), (1) vs. ( 7), (10) vs. (11) , the amount of performance change for an individual descriptor does not always result in a similar amount of overall performance change, indicating that the features are not very complimentary, but have different error characteristics.

6. Study of high-level features 6.1. Pose features

For action recognition with pose features, the authors use various types of descriptors derived from joint annotations.
The joints are in the neutral puppet positions.
Note that unlike Traj in Sec. 4.1, the authors consider features along the xand ycoordinate as separate descriptors, and this results in better performance than treating them as one descriptor.
With the noise, the performance drop is less than 2 pp.

6.2. DT given joints

The authors use a smaller codebook size (N = 100) because here there are only 15 trajectories per frame.
The subset contains 316 clips distributed over 12 categories.
A closer look at the performance of individual descriptors reveals that the texture-based HOX benefits more given low/mid-level than high-level information, while the position-based Traj shows the opposite.
Dense Trajectories given estimated joints results in a 3.8 pp gain over the baseline, and NTraj+ computed from the 15 estimated joint positions results in a 8.1 pp gain over the baseline (Tab. 3 (5) ).
This suggests that while the estimated joint positions are not accurate compared to the ground truth, the derived pose features already outperform low/mid level features for action recognition.

6.3. Summary

Table 4 summarizes the improvements to Dense Trajectories realized by providing low/mid-level and high-level features on the full dataset J-HMDB and the subset sub-J-HMDB.
Overall, the two sets show a 12-17 pp improvement over the baseline with ground truth low/mid features and a 19-29 pp improvement with high-level features.

7. Discussion

The authors have presented a complex, annotated, video dataset in order to analyze action recognition algorithms.
Starting with a state-of-the-art method [30] , the authors supply the algorithm with a range of low-to-high-level ground truth information.
It is also surprising that, with a good bounding box, which is probably easier to achieve than estimating accurate flow, one can obtain a large improvement over the baseline.
While this might not be surprising, their contribution here is threefold.
Third, for sub-J-HMDB, where the full body is visible, a recent pose estimation algorithm computes poses that are more reliable than low/mid level features for action recognition of complex actions in realistic videos.

Did you find this useful? Give us your feedback

Figures (7)

Figure 2. Comparison of various flow settings. The flow is numbered according to Tab. 2. See Sec. 4.2 and Sec. 5 for details.

Table 2. The impact of low and mid level feature modifications on J-HMDB. of and pf denote the optical flow computed by Farnebäck’s method and puppet flow, respectively. pmask denotes the puppet mask and Dmask the dilated pmask. F and Im corresponds to masking in the feature space and in the image space, respectively. bbox is 20% larger in the x and y dimensions than the tightest box enclosing pmask.

Figure 3. Performance of pose features as a function of the trajectory length T and the frame step size s, see Sec. 6.1.

Figure 1. Overview of our annotation and evaluation. (a-d) A video frame annotated by a puppet model [36]. (a) image frame, (b) puppet flow [35], (c) puppet mask, (d) joint positions and relations. Three types of joint relations are used: 1) distance and 2) orientation of the vector connecting pairs of joints; i.e. the magnitude and the direction of the vector u. 3) Inner angle spanned by two vectors connecting triples of joints; i.e. the angle between the two vectors u and v. (e-h) From left to right, we gradually provide the baseline algorithm (e) with different levels of ground truth from (b) to (d). The trajectories are displayed in green.

Table 4. Overview of the recognition rate for both datasets.

Table 3. The impact of high-level feature modifications on sub-JHMDB. ALL is the combination of HOX/Traj. ALL+ is the combination of ALL/NTraj+, see Sec. 6.2 for details.

Content maybe subject to copyright Report

HAL Id: hal-00906902

https://hal.inria.fr/hal-00906902

Submitted on 10 Dec 2013

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Towards understanding action recognition

Hueihan Jhuang, Jurgen Gall, Silvia Zu, Cordelia Schmid, Michael J. Black

To cite this version:

Hueihan Jhuang, Jurgen Gall, Silvia Zu, Cordelia Schmid, Michael J. Black. Towards understanding

action recognition. ICCV - IEEE International Conference on Computer Vision, Dec 2013, Sydney,

Australia. pp.3192-3199, �10.1109/ICCV.2013.396�. �hal-00906902�

Towards understanding action recognition

Hueihan Jhuang

Juergen Gall

Silvia Zufﬁ

Cordelia Schmid

Michael J. Black

MPI for Intelligent Systems, Germany

University of Bonn, Germany,

Brown University, USA,

LEAR, INRIA, France

Abstract

Although action recognition in videos is widely studied,

current methods often fail on real-world datasets. Many re-

cent approaches improve accuracy and robustness to cope

with challenging video sequences, but it is often unclear

what affects the results most. This paper attempts to pro-

vide insights based on a systematic performance evalua-

tion using thoroughly-annotated data of human actions. We

annotate human Joints for the HMDB dataset (J-HMDB).

This annotation can be used to derive ground truth optical

ﬂow and segmentation. We evaluate current methods using

this dataset and systematically replace the output of various

algorithms with ground truth. This enables us to discover

what is important – for example, should we work on improv-

ing ﬂow algorithms, estimating human bounding boxes, or

enabling pose estimation? In summary, we ﬁnd that high-

level pose features greatly outperform low/mid level fea-

tures; in particular, pose over time is critical. While current

pose estimation algorithms are far from perfect, features

extracted from estimated pose on a subset of J-HMDB, in

which the full body is visible, outperform low/mid-level fea-

tures. We also ﬁnd that the accuracy of the action recog-

nition framework can be greatly increased by reﬁning the

underlying low/mid level features; this suggests it is im-

portant to improve optical ﬂow and human detection algo-

rithms. Our analysis and J-HMDB dataset should facilitate

a deeper understanding of action recognition algorithms.

1. Introduction

Current computer vision algorithms fall far below hu-

man performance on activity recognition tasks. While most

computer vision algorithms perform very well on simple

lab-recorded datasets [

31], state-of-the-art approaches still

struggle to recognize actions in more complex videos taken

from public sources like movies [14, 17]. According to [30],

the HMDB51 dataset [

14] is the most challenging dataset

for vision algorithms, with the best method achieving only

48% accuracy. Many things might be limiting current meth-

ods: weak visual cues or lack of high-level cues for exam-

ple. Without a clear understanding of what makes a method

perform well, it is difﬁcult for the ﬁeld to make progress.

Our goal is twofold. First, towards understanding al-

gorithms for human action recognition, we systematically

analyze a recognition algorithm to better understand the

limitations and to identify components where an algorith-

mic improvement would most likely increase the over-

all accuracy. Second, towards understanding intermediate

data that would support recognition, we present insights on

how much low- to high-level reasoning about the human is

needed to recognize actions.

Such an analysis requires ground truth for a challeng-

ing dataset. We focus on one of the most challenging

datasets for action recognition (HMDB51 [

14]) and on the

approach that achieves the best performance on this dataset

(Dense Trajectories [

30]). From HMDB51, we extract 928

clips comprising 21 action categories and annotate each

frame using a 2D articulated human puppet model [36] that

provides scale, pose, segmentation, coarse viewpoint, and

dense optical ﬂow for the humans in action. An example

annotation is shown in Fig.

1 (a-d). We refer to this dataset

as J-HMDB for “joint-annotated HMDB”.

J-HMDB is valuable in terms of linking low-to-mid-

level features with high-level poses; see Fig. 1 (e-h) for an

illustration. Holistic approaches like [

30] rely on low-level

cues that are sampled from the entire video (e). Dense op-

tical ﬂow within the mask of the person (f) provides more

detailed low-level information. Also, by identifying the per-

son in action and their size, the sampling of the features can

be concentrated on the region of interest (g). Higher-level

pose features require the knowledge of joints (h) but can be

semantically interpreted. Relations between joints (h) pro-

vide richer information and enable more complex models.

Pose has been used in early work on action recogni-

tion [

3, 32]. For a complex dataset such as ours how-

ever, typically low- to mid-level features are used instead

of pose because pose estimation is hard. Recently, hu-

man pose as a feature for action recognition has been revis-

ited [

10, 22, 26, 29, 34]. In [34], it is shown that current ap-

low level

(e) baseline

(a) image (b) puppt ow

(f) given puppt ow

(g) given puppet mask

(d) joint positions and relations

(h) given joint positions

mid level high level

Figure 1. Overview of our annotation and evaluation. (a-d) A video frame annotated by a puppet model [36]. (a) image frame, (b) puppet

ﬂow [35], (c) puppet mask, (d) joint positions and relations. Three types of joint relations are used: 1) distance and 2) orientation of the

vector connecting pairs of joints; i.e. the magnitude and the direction of the vector u. 3) Inner angle spanned by two vectors connecting

triples of joints; i.e. the angle between the two vectors u and v. (e-h) From left to right, we gradually provide the baseline algorithm (e)

with different levels of ground truth from (b) to (d). The trajectories are displayed in green.

proaches for human pose estimation from multiple camera

views are accurate enough for reliable action recognition.

For monocular videos, several works show that current pose

estimation algorithms are reliable enough to recognize ac-

tions on relatively simple datasets [

10, 26, 29], however [22]

shows that they are not good enough to classify ﬁne-grained

activities. Using J-HMDB, we show that ground truth pose

information enables action recognition performance beyond

current state-of-the-art methods.

While our main focus is to analyze the potential impact

of different cues, the dataset is also valuable for evaluat-

ing human pose estimation and human detection in videos.

Our preliminary results show that pose features estimated

from [

33] perform much worse than the ground truth pose

features, but they outperform low/mid level features for ac-

tion recognition on clips where the full body is visible. We

also show that human bounding boxes estimated by [

2] and

optical ﬂow estimated by [

27] do not improve the perfor-

mance of current action recognition algorithms.

2. Related Studies and Datasets

Previous work has analyzed data in detail to understand

algorithm performance in the context of object detection

and image classiﬁcation. In [

20], a human study of visual

recognition tasks is performed to identify the role of algo-

rithms, data, and features. In [

11], issues like occlusion,

object size, or aspect ratio are examined for two classes of

object detectors. Our work shares with these studies the

idea that analyzing and understanding data is important to

advance the state-of-the-art.

Previous datasets used to benchmark pose estimation or

action recognition algorithms are summarized in Tab.

Existing datasets that contain action labels and pose anno-

tations are typically recorded in a laboratory or static en-

vironment with actors performing speciﬁc actions. These

are often unrealistic, resulting in lower intra-class variation

than in real-world videos. While marker-based motion cap-

ture systems provide accurate 3D ground-truth pose data

[

12, 15, 19, 25], they are impractical for recording realis-

tic video data. Other datasets focus on narrow scenarios

[

22, 28]. More realistic datasets for pose estimation and

action recognition have been collected from TV or movie

footage. Commonly considered sources for action recogni-

tion are sport activities [

18], YouTube videos [21], or movie

scenes [

14, 16]. In comparison to sport videos, actions an-

notated from movies are much more challenging as they

present real-world background variation, exhibit more intra-

class variation, and have more appearance variation due

to viewpoint, scale, and occlusion. Since HMDB51 [

14]

is the most challenging dataset among the current movie

datasets [

30], we build on it to create J-HMDB.

J-HMDB is, however, more than a dataset of human ac-

tions; it could also serve as a benchmark for pose estimation

and human detection. Most pose datasets contain images of

a single non-occluded person in the center of the image and

benchmark examples

videos

actions

wild

pose

Buffy stickman [10] y y

ETHZ PASCAL [8] y y

estimation

H3D [2] y y

Leeds Sports [13] y y

VideoPose [24] y y y

action

UCF50 [21] y y y

HMDB51 [14] y y y

recognition

Hollywood2 [17] y y y

Olympics [18] y y y

pose

HumanEvaII [25] y y y

CMU-MMAC [15] y y y

and

Human 3.6M [12] y y y

Berkeley MHAD [19] y y y

action

MPII Cooking [22] y y y

TUM kitchen [28] y y y

J-HMDB y y y y

Table 1. Related datasets.

the approximate scale of the person is known [

8, 10, 13].

These image-based datasets constitute a very small subset

of all the possible variations of human poses and sizes be-

cause the subjects are not performing actions, with the ex-

ception of the Leeds Sports Pose Dataset [

13]. The Video-

Pose2 dataset [

24] contains a number of annotated video

clips taken from two TV series in order to evaluate pose es-

timation approaches on realistic data. The dataset is, how-

ever, limited to upper body pose estimation and contains

very few clips. Our dataset presents a new challenge to the

ﬁeld of human pose estimation and tracking since it contains

more variation in poses, humans sizes, camera motions, mo-

tion blur, and partial- or full-body visibility.

3. The Dataset

3.1. Selection

The HMDB51 database [

14] contains more than 5,100

clips of 51 different human actions collected from movies

or the Internet. Annotating this entire dataset is imprac-

tical so J-HMDB is a subset with fewer categories. We

excluded categories that contain mainly facial expressions

like smiling, interactions with others such as shaking hands,

and actions that can only be done in a speciﬁc way such as

a cartwheel. The result contains 21 categories involving a

single person in action: brush hair, catch, clap, climb stairs,

golf, jump, kick ball, pick, pour, pull-up, push, run, shoot

ball, shoot bow, shoot gun, sit, stand, swing baseball, throw,

walk, wave. Since we focus on and annotate the person in

action in each clip, we remove clips in which the actor is

not obvious. For the remaining clips, we further crop them

in time such that the ﬁrst and last frame roughly correspond

to the beginning and end of an action. This selection-and-

cleaning process results in 36-55 clips per action class with

each clip containing 15-40 frames. In summary, there are

31,838 annotated frames in total. J-HMDB is available at

http://jhmdb.is.tue.mpg.de.

3.2. Annotation

For annotation, we use a 2D puppet model [

36] in which

the human body is represented as a set of 10 body parts con-

nected by 13 joints (shoulder, elbow, wrist, hip, knee, ankle,

neck) and two landmarks (face and belly). We construct

puppets in 16 viewpoints across the 360 degree radial space

in the transverse plane. We built a graphical user interface

to control the viewpoint and scale and in which the joints

can be selected and moved in the image plane. The annota-

tion involves adjusting the joint position so that the contours

of the puppet align with image information [

36]. In con-

trast to simple joint or limb annotations, the puppet model

guarantees realistic limb size proportions, in particular in

the context of occlusions, and also provides an approximate

2D shape of the human body. The annotated shapes are

then used to compute the 2D optical ﬂow corresponding to

the human motion, which we call “puppet ﬂow” [

35]. The

puppet mask (i.e. the region contained within the puppet) is

also used to initialize GrabCut [23] to obtain a segmentation

mask. Fig.

1 (b-d) shows a sample annotation.

The annotation is done using Amazon Mechanical Turk.

To aid annotators, we provide the posed puppet on the ﬁrst

frame of each video clip. For each subsequent frame the in-

terface initializes the joint positions and the scale with those

of the previous frame. We manually correct annotation er-

rors during a post-annotation screening process.

In summary, the person performing the action in each

frame is annotated with his/her 2D joint positions, scale,

viewpoint, segmentation, puppet mask and puppet ﬂow.

Details about the annotation interface and the distribution

of joint locations, viewpoints, and scales of the annotations

are provided on the website.

3.3. Training and testing set generation

Training and testing splits are generated as in [

14]. For

each action category, clips are randomly grouped into two

sets with the constraint that the clips from the same video

belong to the same set. We iterate the grouping until the

ratio of the number of clips in the two sets and the ratio

of the number of distinct video sources in the two sets are

both close to 7:3. The 70% set is used for training and the

30% set for testing. Three splits are randomly generated and

the performance reported here is the average of the three

splits. Note that the number of training/testing clips is sim-

ilar across categories and we report the per-video accuracy,

which does not differ much from the per-class accuracy.

4. Study of low-level features

We focus our evaluation on the Dense Trajectories (DT)

algorithm [

30] since it is currently the best performing

1) baseline 4) pf Dmask3) pf pmask2) of pmask 7) Classic+NL ow5) pf pmask of outside pmask 10) Dmask Im

Figure 2. Comparison of various ﬂow settings. The ﬂow is numbered according to Tab. 2. See Sec. 4.2 and Sec. 5 for details.

method on the HMDB51 database [

14] and because it re-

lies on video feature descriptors that are also used by other

methods. We ﬁrst review DT in Sec. 4.1, and then we re-

place pieces of the algorithm with the ground truth data to

provide low, mid, and high level information in Sec.

4.2,

Sec.

5 and Sec. 6.2 respectively.

4.1. DT features

The DT algorithm [

30] represents video data by dense

trajectories along with motion and shape features around

the trajectories. The feature points are densely sampled on

each frame using a grid with a spacing of 5 pixels and at

each of the 8 spatial scales which increase by a factor of

√

Feature points are further pruned to keep the ones whose

eigenvalues of the auto-correlation matrix are larger than

some threshold. For each frame, a dense optical ﬂow ﬁeld

is computed w.r.t. the next frame using the OpenCV imple-

mentation of Gunnar Farneb

ack’s algorithm [

9]. A 3 × 3

median ﬁlter is applied to the ﬂow ﬁeld and this denoised

ﬂow is used to compute the trajectories of selected points

through the 15 frames of the clip.

For each trajectory, L = 5 types of descriptors are com-

puted, where each descriptor is normalized to have unit L

norm: Traj: Given a trajectory of length T = 15, the

shape of the trajectory is described by a sequence of dis-

placement vectors, corresponding to the translation along

the x- and y-coordinate across the trajectory. It is further

normalized by the sum of displacement vector magnitudes,

i.e.

(∆P

,...,∆P

t+T −1

)

t+T −1

j=t

||∆P

, where ∆P

= (x

t+1

− x

, y

t+1

− y

HOG: Histograms of oriented gradients [

5] of 8 bins are

computed in a 32-pixels × 32-pixels × 15-frames spatio-

temporal volume surrounding the trajectory. The volume

is further subdivided into a spatio-temporal grid of size 2-

pixels × 2-pixels × 3-frames. HOF: Histograms of optical

ﬂow [

16] are computed similarly as HOG except that there

are 9 bins with the additional one corresponding to pixels

with optical ﬂow magnitude lower than a threshold. MBH:

Motion boundary histograms [

6] are computed separately

for the horizontal and vertical gradients of the optical ﬂow

(giving two descriptors).

For each descriptor type, a codebook of size N = 4, 000

is formed by running k-means 8 times on a random selection

of M = 100, 000 descriptors and taking the codebook with

the lowest error. The features are computed using the pub-

licly available source code of Dense Trajectories [

30] with

one modiﬁcation. While in the original implementation, op-

tical ﬂow is computed for each scale of the spatial pyramid,

we compute the ﬂow at the full resolution and build a spatial

pyramid of the ﬂow. While this decreases the performance

on our dataset by less than 1%, it is necessary to fairly eval-

uate the impact of the ﬂow accuracy using the puppet ﬂow,

which is generated at the original video scale.

For classiﬁcation, a non-linear SVM with RBF-χ

kernel, k(x, y), is used and L types of descriptors

are combined in a multi-channel setup as K(i, j) =

exp



−

c=1

k(x

)



. Here, x

is the c-th descriptor

for the i-th video, A

is the mean of the χ

distance between

the training examples for the c-th channel. The multi-class

classiﬁcation is done by LIBSVM [

4] using a one-vs-all ap-

proach. The performance is denoted as “baseline” in Tab.

(1), and the ﬂow is shown in Fig. 2 (1).

4.2. DT given puppet ﬂow

We can not evaluate the gain of having perfect dense op-

tical ﬂow, and therefore perfect trajectories. Instead, we

use the puppet ﬂow as the ground truth motion in the fore-

ground, i.e. within the puppet mask (pmask). When the

body parts move only slightly from one frame to the next,

the puppets do not always move correspondingly because

small translations are not easily observed and annotated. To

address this, we replace the puppet ﬂow for each body part

that does not move with the ﬂow from the baseline.

To evaluate the quality of the foreground ﬂow, we set

the ﬂow outside pmask to zero to disable tracks outside

the foreground. We compare optical ﬂow (of ) computed

by Farneb

ack’s method and puppet ﬂow (pf ), as shown in

Fig.

2 (2-3). Masking optical ﬂow results in a 4 percentage

points (pp) gain over the baseline, and masking puppet ﬂow

gives a 6 pp gain (Tab.

2 (2-3)). The gain mainly comes

from HOF and MBH.

We dilate the puppet mask to include the narrow strip

surrounding the person’s contour, called Dmask. The width

is scale dependent, ranging from 1 to 10 pixels with an av-

erage width of 6 pixels. Since the puppet ﬂow is not deﬁned

outside the puppet mask, of is used on the narrow strip,

as shown in Fig.

2 (4). Using Dmask increases the perfor-

mance of (3) by 2.3 pp (Tab.

2 (4) vs. (3)). Comparing Fig. 2

(3) and (4), the latter has clear ﬂow discontinuities caused

HTML Viewer

Towards Understanding Action Recognition

Summary (4 min read)

1. Introduction

3.1. Selection

3.2. Annotation

3.3. Training and testing set generation

4. Study of low-level features

4.1. DT features

4.2. DT given puppet flow

5. Study of mid-level features

5.1. DT given foreground mask

5.2. DT given scale

6. Study of high-level features 6.1. Pose features

6.2. DT given joints

6.3. Summary

7. Discussion

Figures (7)

Citations

Cites background from "Towards Understanding Action Recogn..."

Cites background or methods from "Towards Understanding Action Recogn..."

Cites methods from "Towards Understanding Action Recogn..."

References

"Towards Understanding Action Recogn..." refers methods in this paper

"Towards Understanding Action Recogn..." refers methods in this paper

"Towards Understanding Action Recogn..." refers background or methods in this paper

"Towards Understanding Action Recogn..." refers background or methods in this paper

Related Papers (5)