Proceedings Article•DOI•

Action Recognition with Improved Trajectories

Heng Wang¹, Cordelia Schmid¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Dec 2013-pp 3551-3558

TL;DR: Dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets are improved by taking into account camera motion to correct them.

read less

Abstract: Recently dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets. This paper improves their performance by taking into account camera motion to correct them. To estimate camera motion, we match feature points between frames using SURF descriptors and dense optical flow, which are shown to be complementary. These matches are, then, used to robustly estimate a homography with RANSAC. Human motion is in general different from camera motion and generates inconsistent matches. To improve the estimation, a human detector is employed to remove these matches. Given the estimated camera motion, we remove trajectories consistent with it. We also use this estimation to cancel out camera motion from the optical flow. This significantly improves motion-based descriptors, such as HOF and MBH. Experimental results on four challenging action datasets (i.e., Hollywood2, HMDB51, Olympic Sports and UCF50) significantly outperform the current state of the art.

...read moreread less

Figures (9)

Figure 5. From left to right, example frames from (a) Hollywood2, (b) HMDB51, (c) Olympic Sports and (d) UCF50.

Figure 1. First row: images of two consecutive frames overlaid; second row: optical flow [8] between the two frames; third row: optical flow after removing camera motion; last row: trajectories removed due to camera motion in white.

Table 1. Comparison of the baseline with our method and two intermediate results using FV encoding. “WarpFlow”: computing motion descriptors (i.e., Trajectory, HOF and MBH) using warped optical flow, while keep all the trajectories; “RmTrack”: removing background trajectories, but computing motion descriptors using the original flow field; “Combined”: removing background trajectories, and computing Trajectory, HOF and MBH with warped optical flow.

Table 2. Comparison of feature encoding with bag of features and Fisher vector. “DTF” stands for the original dense trajectory features [40] with RootSIFT normalization, whereas “ITF” are our improved trajectory features.

Figure 3. Examples of removed trajectories under various camera motions, e.g., pan, zoom, tilt. White trajectories are considered due to camera motion. The red dots are the trajectory positions in the current frame. The last row shows two failure cases. The left one is due to severe motion blur. The right one fits the homography to the moving humans as they dominate the frame.

Figure 2. Visualization of inlier matches of the robustly estimated homography. Green arrows correspond to SURF descriptor matches, and red ones to dense optical flow.

Table 4. Comparison of our results to the state of art. We present our results for FV encoding both with and without automatic human detection (HD).

Table 3. Comparison of the results on a subset of the Hollywood2 dataset with FV encoding. “None”: without human detection; “Automatic”: automatic human detection; “Manual”: manual labeling of humans.

Figure 4. Homography estimation without human detector (left) and with human detector (right). We show inlier matches in the first and third columns. The optical flow (second and fourth columns) is warped with the corresponding homography. The first and second rows show a clear improvement of the estimated homography, when using a human detector. The last row presents a failure case. See the text for details.

Content maybe subject to copyright Report

HAL Id: hal-00873267

https://hal.inria.fr/hal-00873267v2

Submitted on 16 Oct 2013

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Action Recognition with Improved Trajectories

Heng Wang, Cordelia Schmid

To cite this version:

Heng Wang, Cordelia Schmid. Action Recognition with Improved Trajectories. ICCV - IEEE

International Conference on Computer Vision, Dec 2013, Sydney, Australia. pp.3551-3558,

�10.1109/ICCV.2013.441�. �hal-00873267v2�

Action Recognition with Improved Trajectories

Heng Wang and Cordelia Schmid

LEAR, INRIA, France

firstname.lastname@inria.fr

Abstract

Recently dense trajectories were shown to be an efﬁcient

video representation for action recognition and achieved

state-of-the-art results on a variety of datasets. This pa-

per improves their performance by taking into account cam-

era motion to correct them. To estimate camera motion, we

match feature points between frames using SURF descrip-

tors and dense optical ﬂow, which are shown to be com-

plementary. These matches are, then, used to robustly es-

timate a homography with RANSAC. Human motion is in

general different from camera motion and generates incon-

sistent matches. To improve the estimation, a human de-

tector is employed to remove these matches. Given the es-

timated camera motion, we remove trajectories consistent

with it. We also use this estimation to cancel out camera

motion from the optical ﬂow. This signiﬁcantly improves

motion-based descriptors, such as HOF and MBH. Experi-

mental results on four challenging action datasets (i.e., Hol-

lywood2, HMDB51, Olympic Sports and UCF50) signiﬁ-

cantly outperform the current state of the art.

1. Introduction

Action recognition has been an active research area for

over three decades. Recent research focuses on realistic

datasets collected from movies [20, 22], web videos [21,

31], TV shows [28], etc. These datasets impose signiﬁcant

challenges on action recognition, e.g., background clutter,

fast irregular motion, occlusion, viewpoint changes. Local

space-time features [7, 19] were shown to be successful on

these datasets, since they avoid non-trivial pre-processing

steps, such as tracking or segmentation. A bag-of-features

representation of these local features can be directly used

for action classiﬁcation and achieves state-of-the-art perfor-

mance (see [1] for a recent survey).

Many classical image features have been generalized

to videos, e.g., 3D-SIFT [33], extended SURF [41],

HOG3D [16], and local trinary patterns [43]. Among the

local space-time features, dense trajectories [40] have been

shown to perform best on a variety of datasets. The main

Figure 1. First row: images of two consecutive frames overlaid;

second row: optical ﬂow [8] between the two frames; third row:

optical ﬂow after removing camera motion; last row: trajectories

removed due to camera motion in white.

idea is to densely sample feature points in each frame, and

track them in the video based on optical ﬂow. Multiple

descriptors are computed along the trajectories of feature

points to capture shape, appearance and motion informa-

tion. Interestingly, motion boundary histograms (MBH) [6]

give the best results due to their robustness to camera mo-

tion.

MBH is based on derivatives of optical ﬂow, which is a

simple and efﬁcient way to suppress camera motion. How-

ever, we argue that we can still beneﬁt from explicit camera

motion estimation. Camera motion generates many irrele-

Figure 2. Visualization of inlier matches of the robustly esti-

mated homography. Green arrows correspond to SURF descriptor

matches, and red ones to dense optical ﬂow.

vant trajectories in the background in realistic videos. We

can prune them and only keep trajectories from humans or

objects of interest, if we know the camera motion (see Fig-

ure 1). Furthermore, given the camera motion, we can cor-

rect the optical ﬂow, so that the motion vectors of human ac-

tors are independent of camera motion. This improves the

performance of motion descriptors based on optical ﬂow,

i.e., HOF (histograms of optical ﬂow) and MBH. We illus-

trate the difference between the original and corrected opti-

cal ﬂow in the middle two rows of Figure 1.

Very few approaches consider camera motion when ex-

tracting feature trajectories for action recognition. Uemura

et al. [38] combine feature matching with image segmen-

tation to estimate the dominant camera motion, and then

separate feature tracks from the background. Wu et al. [42]

apply a low-rank assumption to decompose feature trajec-

tories into camera-induced and object-induced components.

Recently, Park et al. [27] perform weak stabilization to re-

move both camera and object-centric motion using coarse-

scale optical ﬂow for pedestrian detection and pose estima-

tion in video. Jain et al. [14] decompose visual motion into

dominant and residual motions both for extracting trajecto-

ries and computing descriptors.

Among the approaches improving dense trajectories, Vig

et al. [39] propose to use saliency-mapping algorithms to

prune background features. This results in a more compact

video representation, and improves action recognition accu-

racy. Jiang et al. [15] cluster dense trajectories, and use the

cluster centers as reference points so that the relationship

between them can be modeled.

The rest of the paper is organized as follows. In sec-

tion 2, we detail our approach for camera motion estima-

tion and discuss how to remove inconsistent matches due to

humans. Experimental setup and evaluation protocols are

explained in section 3 and experimental results in section 4.

The code to compute improved trajectories and descriptors

is available online.

http://lear.inrialpes.fr/

wang/improved_trajectories

Figure 3. Examples of removed trajectories under various camera

motions, e.g., pan, zoom, tilt. White trajectories are considered

due to camera motion. The red dots are the trajectory positions in

the current frame. The last row shows two failure cases. The left

one is due to severe motion blur. The right one ﬁts the homography

to the moving humans as they dominate the frame.

2. Improving dense trajectories

In this section, we ﬁrst describe the major steps of our

camera motion estimation method, and how to use it to im-

prove dense trajectories. We, then, discuss how to remove

potentially inconsistent matches based on humans to obtain

a robust homography estimation.

2.1. Camera motion estimation

To estimate the global background motion, we assume

that two consecutive frames are related by a homogra-

phy [37]. This assumption holds in most cases as the global

motion between two frames is usually small. It excludes in-

dependently moving objects, such as humans and vehicles.

To estimate the homography, the ﬁrst step is to ﬁnd the

correspondences between two frames. We combine two ap-

proaches in order to generate sufﬁcient and complementary

candidate matches. We extract SURF [3] features and match

them based on the nearest neighbor rule. The reason for

choosing SURF features is their robustness to motion blur,

as shown in a recent evaluation [13].

We also sample motion vectors from the optical ﬂow,

which provides us with dense matches between frames.

Here, we use an efﬁcient optical ﬂow algorithm based on

polynomial expansion [8]. We select motion vectors for

salient feature points using the good-features-to-track cri-

terion [35], i.e., thresholding the smallest eigenvalue of the

autocorrelation matrix.

Figure 4. Homography estimation without human detector (left) and with human detector (right). We show inlier matches in the ﬁrst and

third columns. The optical ﬂow (second and fourth columns) is warped with the corresponding homography. The ﬁrst and second rows

show a clear improvement of the estimated homography, when using a human detector. The last row presents a failure case. See the text

for details.

The two approaches are complementary. SURF focuses

on blob-type structures, whereas [35] ﬁres on corners and

edges. Figure 2 visualizes the two types of matches in dif-

ferent colors. Combining them results in a more balanced

distribution of matched points, which is critical for a good

homography estimation.

We, then, robustly estimate the homography using

RANSAC [11]. This allows us to rectify the image to re-

move the camera motion. Figure 1 (two rows in the mid-

dle) demonstrates the difference of optical ﬂow before and

after rectiﬁcation. Compared to the original ﬂow (the sec-

ond row of Figure 1), the rectiﬁed version (the third row)

suppresses the background camera motion and enhances the

foreground moving objects.

For dense trajectories, there are two major advantages of

canceling out camera motion from optical ﬂow. First, the

motion descriptors can directly beneﬁt from this. As shown

in [40], the performance of the HOF descriptor degrades

signiﬁcantly in the presence of camera motion. Our exper-

imental results (in section 4.1) show that HOF can achieve

similar performance as MBH when we have correct fore-

ground optical ﬂow. The combination of HOF and MBH

can further improve the results as they represent zero-order

(HOF) and ﬁrst-order (MBH) motion information.

Second, we can remove trajectories generated by camera

motion. This can be achieved by thresholding the displace-

ment vectors of the trajectories in the warped ﬂow ﬁeld. If

the displacement is too small, the trajectory is considered

to be too similar to camera motion, and thus removed. Fig-

ure 3 shows examples of removed background trajectories.

Our method works well under various camera motions (e.g.,

pan, tilt and zoom) and only trajectories related to human

actions are kept (shown in green in Figure 3). This gives us

similar effects as sampling features based on visual saliency

maps [23, 39].

The last row of Figure 3 shows two failure cases. The left

one is due to severe motion blur, which makes both SURF

descriptor matching and optical ﬂow estimation unreliable.

Improving motion estimation in the presence of motion blur

is worth further attention, since blur often occurs in realis-

tic datasets. In the example shown on the right, humans

dominate the frame, which causes homography estimation

to fail. We discuss a solution for such cases in the following

section.

2.2. Removing inconsistent matches due to humans

In action datasets, videos often focus on the humans per-

forming the action. As a result, it is very common that hu-

mans dominate the frame, which can be a problem for cam-

era motion estimation as human motion is in general not

consistent with it. We propose to use a human detector to

remove matches from human regions. In general, human

detection in action datasets is rather difﬁcult, as there are

dramatic pose changes when the person is performing the

action. Furthermore, the person could only be visible par-

tially due to occlusion or being partially out of view.

Here, we apply a state-of-the-art human detector [30],

which adapts the general part-based human detector [9] to

action datasets. The detector combines several part detec-

tors dedicated to different regions of the human body (in-

cluding full person, upper-body and face). It is trained us-

ing the PASCAL VOC07 training data for humans as well

as near-frontal upper-bodies from [10]. Figure 4, third col-

umn, shows some examples of human detection results.

We use the human detector as a mask to remove feature

matches inside the bounding boxes when estimating the ho-

mography. Without human detection (the left two columns

of Figure 4), many features from the moving humans be-

come inlier matches and the homography is, thus, incorrect.

As a result, the corresponding optical ﬂow is not correctly

warped. In contrast, camera motion is successfully com-

pensated (the right two columns of Figure 4), when the hu-

man bounding boxes are used to remove matches not cor-

responding to camera motion. The last row of Figure 4

shows a failure case. The homography does not ﬁt the back-

ground very well despite detecting the humans correctly, as

the background is represented by two planes, one of which

is very close to the camera. In section 4.3, we compare the

performance of action recognition with or without human

detection.

The human detector does not always work perfectly. It

can miss humans due to pose or viewpoint changes. In or-

der to compensate for missing detections, we track all the

bounding boxes obtained by the human detector. Tracking

is performed forward and backward for each frame of the

video. Our approach is simple, i.e., we take the average ﬂow

vector [8] and propagate the detections to the next frame.

We track each bounding box for at most 15 frames and stop

if there is a 50% overlap with another bounding box. All

the human bounding boxes are available online.

In the fol-

lowing, we always use the human detector to remove poten-

tially inconsistent matches before computing the homogra-

phy, unless stated otherwise.

3. Experimental setup

In this section, we ﬁrst present implementation details

for our trajectory features. We, then, introduce the feature

encoding used in our evaluation. Finally, the datasets and

experimental setup are presented.

3.1. Trajectory features

We, ﬁrst, brieﬂy describe the dense trajectory fea-

tures [40], which are used as the baseline in our experi-

ments. The approach densely samples points for several

spatial scales. Points in homogeneous areas are suppressed,

as it is impossible to track them reliably. Tracking points is

achieved by median ﬁltering in a dense optical ﬂow ﬁeld [8].

In order to avoid drifting, we only track the feature points

for 15 frames and sample new points to replace them. We

remove static feature trajectories as they do not contain mo-

tion information, and also prune trajectories with sudden

large displacements.

For each trajectory, we compute several descriptors (i.e.,

Trajectory, HOG, HOF and MBH) with exactly the same pa-

rameters as [40]. The Trajectory descriptor is a concatena-

tion of normalized displacement vectors. The other descrip-

tors are computed in the space-time volume aligned with the

trajectory. HOG is based on the orientation of image gradi-

ents and captures the static appearance information. Both

HOF and MBH measure motion information, and are based

on optical ﬂow. HOF directly quantizes the orientation of

ﬂow vectors. MBH splits the optical ﬂow into horizontal

and vertical components, and quantizes the derivatives of

each component. The ﬁnal dimensions of the descriptors

are 30 for Trajectory, 96 for HOG, 108 for HOF and 192 for

MBH.

To normalize the histogram-based descriptors, i.e.,

HOG, HOF and MBH, we apply the recent RootSIFT [2]

approach, i.e., square root each dimension after L1 normal-

ization. We do not perform L2 normalization as in [40].

This brings about 0.5% improvement for the histogram-

based descriptors. We use this normalization in all the ex-

periments.

To extract our improved trajectories, we sample and

track feature points exactly the same way as [40], see above.

To compute the descriptors, we ﬁrst estimate the homogra-

phy with RANSAC using the feature matches extracted be-

tween two consecutive frames; matches on detected humans

are removed. We, then, warp the second frame with the es-

timated homography. The optical ﬂow [8] is re-computed

between the ﬁrst and the warped second frame. Motion

descriptors (HOF and MBH) are computed on the warped

optical ﬂow. The HOG descriptor remains unchanged. We

estimate the homography and warped optical ﬂow for every

two frames independently to avoid error propagation. We

use the same parameters and the RootSIFT normalization

as in the baseline.

The Trajectory descriptor is also computed based on the

motion vectors of the warped ﬂow. We further utilize these

stabilized motion vectors to remove background trajecto-

ries. For each trajectory, we compute the maximal mag-

nitude of them. If the maximal magnitude is lower than a

threshold (i.e., 1 pixel), the trajectory is considered to be

consistent with camera motion, and thus removed.

3.2. Feature encoding

To encode features, we use bag of features and Fisher

vector. For bag of features, we use identical settings to [40].

We train a codebook for each descriptor type using 100,000

randomly sampled features with k-means. The size of the

codebook is set to 4000. An SVM with RBF-χ

kernel

is used for classiﬁcation, and different descriptor types are

HTML Viewer

Frequently Asked Questions (9)

Q1. How do the authors select motion vectors for salient feature points?

The authors select motion vectors for salient feature points using the good-features-to-track criterion [35], i.e., thresholding the smallest eigenvalue of the autocorrelation matrix.

Q2. How many sports are represented by the UCF50 dataset?

There are 16 sports actions (such as high-jump, pole-vault, basketball lay-up, discus), represented by a total of 783 video sequences.

Q3. How do the authors normalize the histogram-based descriptors?

To normalize the histogram-based descriptors, i.e., HOG, HOF and MBH, the authors apply the recent RootSIFT [2] approach, i.e., square root each dimension after L1 normalization.

Q4. What is the method for estimating dense trajectories?

Jain et al. [14] decompose visual motion into dominant and residual motions both for extracting trajectories and computing descriptors.

Q5. Does the compensating camera motion improve the performance of the HOG?

Since HOG is designed to capture static appearance information, the authors do not expect that compensating camera motion significantly improves its performance.

Q6. How can the authors correct the optical flow?

given the camera motion, the authors can correct the optical flow, so that the motion vectors of human actors are independent of camera motion.

Q7. What is the way to use local space-time features?

A bag-of-features representation of these local features can be directly used for action classification and achieves state-of-the-art performance (see [1] for a recent survey).

Q8. How many Gaussian features are used to estimate the GMM?

The authors set the number of Gaussians to K = 256 and randomly sample a subset of 256,000 features from the training set to estimate the GMM.

Q9. What is the funding source for this work?

This work was supported by Quaero (funded by OSEO, French State agency for innovation), the European integrated project AXES, the MSR/INRIA joint project and the ERC advanced grant ALLEGRO.

Action Recognition with Improved Trajectories

Figures (9)

Citations

Cites background or methods from "Action Recognition with Improved Tr..."

Cites background or methods or result from "Action Recognition with Improved Tr..."

References

Additional excerpts

"Action Recognition with Improved Tr..." refers methods in this paper

"Action Recognition with Improved Tr..." refers methods in this paper

"Action Recognition with Improved Tr..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (9)

Q1. How do the authors select motion vectors for salient feature points?

Q2. How many sports are represented by the UCF50 dataset?

Q3. How do the authors normalize the histogram-based descriptors?

Q4. What is the method for estimating dense trajectories?

Q5. Does the compensating camera motion improve the performance of the HOG?

Q6. How can the authors correct the optical flow?

Q7. What is the way to use local space-time features?

Q8. How many Gaussian features are used to estimate the GMM?

Q9. What is the funding source for this work?