scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Action Recognition with Improved Trajectories

TL;DR: Dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets are improved by taking into account camera motion to correct them.
Abstract: Recently dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets. This paper improves their performance by taking into account camera motion to correct them. To estimate camera motion, we match feature points between frames using SURF descriptors and dense optical flow, which are shown to be complementary. These matches are, then, used to robustly estimate a homography with RANSAC. Human motion is in general different from camera motion and generates inconsistent matches. To improve the estimation, a human detector is employed to remove these matches. Given the estimated camera motion, we remove trajectories consistent with it. We also use this estimation to cancel out camera motion from the optical flow. This significantly improves motion-based descriptors, such as HOF and MBH. Experimental results on four challenging action datasets (i.e., Hollywood2, HMDB51, Olympic Sports and UCF50) significantly outperform the current state of the art.
Figures (9)

Content maybe subject to copyright    Report

HAL Id: hal-00873267
https://hal.inria.fr/hal-00873267v2
Submitted on 16 Oct 2013
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Action Recognition with Improved Trajectories
Heng Wang, Cordelia Schmid
To cite this version:
Heng Wang, Cordelia Schmid. Action Recognition with Improved Trajectories. ICCV - IEEE
International Conference on Computer Vision, Dec 2013, Sydney, Australia. pp.3551-3558,
�10.1109/ICCV.2013.441�. �hal-00873267v2�

Action Recognition with Improved Trajectories
Heng Wang and Cordelia Schmid
LEAR, INRIA, France
firstname.lastname@inria.fr
Abstract
Recently dense trajectories were shown to be an efficient
video representation for action recognition and achieved
state-of-the-art results on a variety of datasets. This pa-
per improves their performance by taking into account cam-
era motion to correct them. To estimate camera motion, we
match feature points between frames using SURF descrip-
tors and dense optical flow, which are shown to be com-
plementary. These matches are, then, used to robustly es-
timate a homography with RANSAC. Human motion is in
general different from camera motion and generates incon-
sistent matches. To improve the estimation, a human de-
tector is employed to remove these matches. Given the es-
timated camera motion, we remove trajectories consistent
with it. We also use this estimation to cancel out camera
motion from the optical flow. This significantly improves
motion-based descriptors, such as HOF and MBH. Experi-
mental results on four challenging action datasets (i.e., Hol-
lywood2, HMDB51, Olympic Sports and UCF50) signifi-
cantly outperform the current state of the art.
1. Introduction
Action recognition has been an active research area for
over three decades. Recent research focuses on realistic
datasets collected from movies [20, 22], web videos [21,
31], TV shows [28], etc. These datasets impose significant
challenges on action recognition, e.g., background clutter,
fast irregular motion, occlusion, viewpoint changes. Local
space-time features [7, 19] were shown to be successful on
these datasets, since they avoid non-trivial pre-processing
steps, such as tracking or segmentation. A bag-of-features
representation of these local features can be directly used
for action classification and achieves state-of-the-art perfor-
mance (see [1] for a recent survey).
Many classical image features have been generalized
to videos, e.g., 3D-SIFT [33], extended SURF [41],
HOG3D [16], and local trinary patterns [43]. Among the
local space-time features, dense trajectories [40] have been
shown to perform best on a variety of datasets. The main
Figure 1. First row: images of two consecutive frames overlaid;
second row: optical flow [8] between the two frames; third row:
optical flow after removing camera motion; last row: trajectories
removed due to camera motion in white.
idea is to densely sample feature points in each frame, and
track them in the video based on optical flow. Multiple
descriptors are computed along the trajectories of feature
points to capture shape, appearance and motion informa-
tion. Interestingly, motion boundary histograms (MBH) [6]
give the best results due to their robustness to camera mo-
tion.
MBH is based on derivatives of optical flow, which is a
simple and efficient way to suppress camera motion. How-
ever, we argue that we can still benefit from explicit camera
motion estimation. Camera motion generates many irrele-

Figure 2. Visualization of inlier matches of the robustly esti-
mated homography. Green arrows correspond to SURF descriptor
matches, and red ones to dense optical flow.
vant trajectories in the background in realistic videos. We
can prune them and only keep trajectories from humans or
objects of interest, if we know the camera motion (see Fig-
ure 1). Furthermore, given the camera motion, we can cor-
rect the optical flow, so that the motion vectors of human ac-
tors are independent of camera motion. This improves the
performance of motion descriptors based on optical flow,
i.e., HOF (histograms of optical flow) and MBH. We illus-
trate the difference between the original and corrected opti-
cal flow in the middle two rows of Figure 1.
Very few approaches consider camera motion when ex-
tracting feature trajectories for action recognition. Uemura
et al. [38] combine feature matching with image segmen-
tation to estimate the dominant camera motion, and then
separate feature tracks from the background. Wu et al. [42]
apply a low-rank assumption to decompose feature trajec-
tories into camera-induced and object-induced components.
Recently, Park et al. [27] perform weak stabilization to re-
move both camera and object-centric motion using coarse-
scale optical flow for pedestrian detection and pose estima-
tion in video. Jain et al. [14] decompose visual motion into
dominant and residual motions both for extracting trajecto-
ries and computing descriptors.
Among the approaches improving dense trajectories, Vig
et al. [39] propose to use saliency-mapping algorithms to
prune background features. This results in a more compact
video representation, and improves action recognition accu-
racy. Jiang et al. [15] cluster dense trajectories, and use the
cluster centers as reference points so that the relationship
between them can be modeled.
The rest of the paper is organized as follows. In sec-
tion 2, we detail our approach for camera motion estima-
tion and discuss how to remove inconsistent matches due to
humans. Experimental setup and evaluation protocols are
explained in section 3 and experimental results in section 4.
The code to compute improved trajectories and descriptors
is available online.
1
1
http://lear.inrialpes.fr/
˜
wang/improved_trajectories
Figure 3. Examples of removed trajectories under various camera
motions, e.g., pan, zoom, tilt. White trajectories are considered
due to camera motion. The red dots are the trajectory positions in
the current frame. The last row shows two failure cases. The left
one is due to severe motion blur. The right one fits the homography
to the moving humans as they dominate the frame.
2. Improving dense trajectories
In this section, we first describe the major steps of our
camera motion estimation method, and how to use it to im-
prove dense trajectories. We, then, discuss how to remove
potentially inconsistent matches based on humans to obtain
a robust homography estimation.
2.1. Camera motion estimation
To estimate the global background motion, we assume
that two consecutive frames are related by a homogra-
phy [37]. This assumption holds in most cases as the global
motion between two frames is usually small. It excludes in-
dependently moving objects, such as humans and vehicles.
To estimate the homography, the first step is to find the
correspondences between two frames. We combine two ap-
proaches in order to generate sufficient and complementary
candidate matches. We extract SURF [3] features and match
them based on the nearest neighbor rule. The reason for
choosing SURF features is their robustness to motion blur,
as shown in a recent evaluation [13].
We also sample motion vectors from the optical flow,
which provides us with dense matches between frames.
Here, we use an efficient optical flow algorithm based on
polynomial expansion [8]. We select motion vectors for
salient feature points using the good-features-to-track cri-
terion [35], i.e., thresholding the smallest eigenvalue of the
autocorrelation matrix.

Figure 4. Homography estimation without human detector (left) and with human detector (right). We show inlier matches in the first and
third columns. The optical flow (second and fourth columns) is warped with the corresponding homography. The first and second rows
show a clear improvement of the estimated homography, when using a human detector. The last row presents a failure case. See the text
for details.
The two approaches are complementary. SURF focuses
on blob-type structures, whereas [35] fires on corners and
edges. Figure 2 visualizes the two types of matches in dif-
ferent colors. Combining them results in a more balanced
distribution of matched points, which is critical for a good
homography estimation.
We, then, robustly estimate the homography using
RANSAC [11]. This allows us to rectify the image to re-
move the camera motion. Figure 1 (two rows in the mid-
dle) demonstrates the difference of optical flow before and
after rectification. Compared to the original flow (the sec-
ond row of Figure 1), the rectified version (the third row)
suppresses the background camera motion and enhances the
foreground moving objects.
For dense trajectories, there are two major advantages of
canceling out camera motion from optical flow. First, the
motion descriptors can directly benefit from this. As shown
in [40], the performance of the HOF descriptor degrades
significantly in the presence of camera motion. Our exper-
imental results (in section 4.1) show that HOF can achieve
similar performance as MBH when we have correct fore-
ground optical flow. The combination of HOF and MBH
can further improve the results as they represent zero-order
(HOF) and first-order (MBH) motion information.
Second, we can remove trajectories generated by camera
motion. This can be achieved by thresholding the displace-
ment vectors of the trajectories in the warped flow field. If
the displacement is too small, the trajectory is considered
to be too similar to camera motion, and thus removed. Fig-
ure 3 shows examples of removed background trajectories.
Our method works well under various camera motions (e.g.,
pan, tilt and zoom) and only trajectories related to human
actions are kept (shown in green in Figure 3). This gives us
similar effects as sampling features based on visual saliency
maps [23, 39].
The last row of Figure 3 shows two failure cases. The left
one is due to severe motion blur, which makes both SURF
descriptor matching and optical flow estimation unreliable.
Improving motion estimation in the presence of motion blur
is worth further attention, since blur often occurs in realis-
tic datasets. In the example shown on the right, humans
dominate the frame, which causes homography estimation
to fail. We discuss a solution for such cases in the following
section.
2.2. Removing inconsistent matches due to humans
In action datasets, videos often focus on the humans per-
forming the action. As a result, it is very common that hu-
mans dominate the frame, which can be a problem for cam-
era motion estimation as human motion is in general not
consistent with it. We propose to use a human detector to
remove matches from human regions. In general, human
detection in action datasets is rather difficult, as there are
dramatic pose changes when the person is performing the
action. Furthermore, the person could only be visible par-
tially due to occlusion or being partially out of view.

Here, we apply a state-of-the-art human detector [30],
which adapts the general part-based human detector [9] to
action datasets. The detector combines several part detec-
tors dedicated to different regions of the human body (in-
cluding full person, upper-body and face). It is trained us-
ing the PASCAL VOC07 training data for humans as well
as near-frontal upper-bodies from [10]. Figure 4, third col-
umn, shows some examples of human detection results.
We use the human detector as a mask to remove feature
matches inside the bounding boxes when estimating the ho-
mography. Without human detection (the left two columns
of Figure 4), many features from the moving humans be-
come inlier matches and the homography is, thus, incorrect.
As a result, the corresponding optical flow is not correctly
warped. In contrast, camera motion is successfully com-
pensated (the right two columns of Figure 4), when the hu-
man bounding boxes are used to remove matches not cor-
responding to camera motion. The last row of Figure 4
shows a failure case. The homography does not fit the back-
ground very well despite detecting the humans correctly, as
the background is represented by two planes, one of which
is very close to the camera. In section 4.3, we compare the
performance of action recognition with or without human
detection.
The human detector does not always work perfectly. It
can miss humans due to pose or viewpoint changes. In or-
der to compensate for missing detections, we track all the
bounding boxes obtained by the human detector. Tracking
is performed forward and backward for each frame of the
video. Our approach is simple, i.e., we take the average flow
vector [8] and propagate the detections to the next frame.
We track each bounding box for at most 15 frames and stop
if there is a 50% overlap with another bounding box. All
the human bounding boxes are available online.
1
In the fol-
lowing, we always use the human detector to remove poten-
tially inconsistent matches before computing the homogra-
phy, unless stated otherwise.
3. Experimental setup
In this section, we first present implementation details
for our trajectory features. We, then, introduce the feature
encoding used in our evaluation. Finally, the datasets and
experimental setup are presented.
3.1. Trajectory features
We, first, briefly describe the dense trajectory fea-
tures [40], which are used as the baseline in our experi-
ments. The approach densely samples points for several
spatial scales. Points in homogeneous areas are suppressed,
as it is impossible to track them reliably. Tracking points is
achieved by median filtering in a dense optical flow field [8].
In order to avoid drifting, we only track the feature points
for 15 frames and sample new points to replace them. We
remove static feature trajectories as they do not contain mo-
tion information, and also prune trajectories with sudden
large displacements.
For each trajectory, we compute several descriptors (i.e.,
Trajectory, HOG, HOF and MBH) with exactly the same pa-
rameters as [40]. The Trajectory descriptor is a concatena-
tion of normalized displacement vectors. The other descrip-
tors are computed in the space-time volume aligned with the
trajectory. HOG is based on the orientation of image gradi-
ents and captures the static appearance information. Both
HOF and MBH measure motion information, and are based
on optical flow. HOF directly quantizes the orientation of
flow vectors. MBH splits the optical flow into horizontal
and vertical components, and quantizes the derivatives of
each component. The final dimensions of the descriptors
are 30 for Trajectory, 96 for HOG, 108 for HOF and 192 for
MBH.
To normalize the histogram-based descriptors, i.e.,
HOG, HOF and MBH, we apply the recent RootSIFT [2]
approach, i.e., square root each dimension after L1 normal-
ization. We do not perform L2 normalization as in [40].
This brings about 0.5% improvement for the histogram-
based descriptors. We use this normalization in all the ex-
periments.
To extract our improved trajectories, we sample and
track feature points exactly the same way as [40], see above.
To compute the descriptors, we first estimate the homogra-
phy with RANSAC using the feature matches extracted be-
tween two consecutive frames; matches on detected humans
are removed. We, then, warp the second frame with the es-
timated homography. The optical flow [8] is re-computed
between the first and the warped second frame. Motion
descriptors (HOF and MBH) are computed on the warped
optical flow. The HOG descriptor remains unchanged. We
estimate the homography and warped optical flow for every
two frames independently to avoid error propagation. We
use the same parameters and the RootSIFT normalization
as in the baseline.
The Trajectory descriptor is also computed based on the
motion vectors of the warped flow. We further utilize these
stabilized motion vectors to remove background trajecto-
ries. For each trajectory, we compute the maximal mag-
nitude of them. If the maximal magnitude is lower than a
threshold (i.e., 1 pixel), the trajectory is considered to be
consistent with camera motion, and thus removed.
3.2. Feature encoding
To encode features, we use bag of features and Fisher
vector. For bag of features, we use identical settings to [40].
We train a codebook for each descriptor type using 100,000
randomly sampled features with k-means. The size of the
codebook is set to 4000. An SVM with RBF-χ
2
kernel
is used for classification, and different descriptor types are

Citations
More filters
Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this article, the non-local operation computes the response at a position as a weighted sum of the features at all positions, which can be used to capture long-range dependencies.
Abstract: Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method [4] in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our nonlocal models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code will be made available.

8,059 citations

Proceedings ArticleDOI
07 Dec 2015
TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.
Abstract: We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets, 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets, and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

7,091 citations


Cites background or methods from "Action Recognition with Improved Tr..."

  • ...We apply the same process with iDT [44] as well as Imagenet features [7] and compare the results in Figure 5....

    [...]

  • ...C3D is 91x faster than improved dense trajectories [44] and 274x faster than Brox’s GPU implementation in OpenCV....

    [...]

  • ...proposed improved Dense Trajectories (iDT) [44] which is currently the state-of-the-art hand-crafted feature....

    [...]

  • ...For iDT, we use the code kindly provided by the authors [44]....

    [...]

  • ...Baselines: We compare C3D feature with a few baselines: the current best hand-crafted features, namely improved dense trajectories (iDT) [44] and the popular-used deep image features, namely Imagenet [16], using Caffe’s Imagenet pre-train model....

    [...]

Proceedings Article
08 Dec 2014
TL;DR: This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Abstract: We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multitask learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

6,397 citations


Cites background or methods or result from "Action Recognition with Improved Tr..."

  • ...There still remain some essential ingredients of the state-of-the-art shallow representation [26], which are missed in our current architecture....

    [...]

  • ...Recent improvements of trajectory-based hand-crafted representations include compensation of global (camera) motion [10, 16, 26], and the use of the Fisher vector encoding [22] (in [26]) or its deeper variant [23] (in [21])....

    [...]

  • ...deep architecture significantly outperforms that of [14] and is competitive with the state of the art shallow representations [20, 21, 26] in spite of being trained on relatively small datasets....

    [...]

  • ...The combination of the two nets further improves the results (in line with the single-split experiments above), and is comparable to the very recent state-of-the-art hand-crafted models [20, 21, 26]....

    [...]

  • ...The importance of camera motion compensation has been previously highlighted in [10, 26], where a global motion component was estimated and subtracted from the dense flow....

    [...]

Proceedings ArticleDOI
21 Jul 2017
TL;DR: In this article, a Two-Stream Inflated 3D ConvNet (I3D) is proposed to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and their parameters.
Abstract: The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101.

5,073 citations

References
More filters
Journal ArticleDOI
TL;DR: New results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form that provide the basis for an automatic system that can solve the Location Determination Problem under difficult viewing.
Abstract: A new paradigm, Random Sample Consensus (RANSAC), for fitting a model to experimental data is introduced. RANSAC is capable of interpreting/smoothing data containing a significant percentage of gross errors, and is thus ideally suited for applications in automated image analysis where interpretation is based on the data provided by error-prone feature detectors. A major portion of this paper describes the application of RANSAC to the Location Determination Problem (LDP): Given an image depicting a set of landmarks with known locations, determine that point in space from which the image was obtained. In response to a RANSAC requirement, new results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form. These results provide the basis for an automatic system that can solve the LDP under difficult viewing

23,396 citations


Additional excerpts

  • ...In section 4.3, we compare the performance of action recognition with or without human detection....

    [...]

Book ChapterDOI
07 May 2006
TL;DR: A novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
Abstract: In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF's strong performance.

13,011 citations

Journal ArticleDOI
TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Abstract: We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.

10,501 citations


"Action Recognition with Improved Tr..." refers methods in this paper

  • ...Finally, the datasets and experimental setup are presented....

    [...]

Proceedings ArticleDOI
21 Jun 1994
TL;DR: A feature selection criterion that is optimal by construction because it is based on how the tracker works, and a feature monitoring method that can detect occlusions, disocclusions, and features that do not correspond to points in the world are proposed.
Abstract: No feature-based vision system can work unless good features can be identified and tracked from frame to frame. Although tracking itself is by and large a solved problem, selecting features that can be tracked well and correspond to physical points in the world is still hard. We propose a feature selection criterion that is optimal by construction because it is based on how the tracker works, and a feature monitoring method that can detect occlusions, disocclusions, and features that do not correspond to points in the world. These methods are based on a new tracking algorithm that extends previous Newton-Raphson style search methods to work under affine image transformations. We test performance with several simulations and experiments. >

8,432 citations


"Action Recognition with Improved Tr..." refers methods in this paper

  • ...It is trained using the PASCAL VOC07 training data for humans as well as near-frontal upper-bodies from [10]....

    [...]

  • ...In contrast, camera motion is successfully compensated (the right two columns of Figure 4), when the human bounding boxes are used to remove matches not corresponding to camera motion....

    [...]

Journal ArticleDOI
TL;DR: The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.
Abstract: In this paper, we propose a computational model of the recognition of real world scenes that bypasses the segmentation and the processing of individual objects or regions. The procedure is based on a very low dimensional representation of the scene, that we term the Spatial Envelope. We propose a set of perceptual dimensions (naturalness, openness, roughness, expansion, ruggedness) that represent the dominant spatial structure of a scene. Then, we show that these dimensions may be reliably estimated using spectral and coarsely localized information. The model generates a multidimensional space in which scenes sharing membership in semantic categories (e.g., streets, highways, coasts) are projected closed together. The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.

6,882 citations


"Action Recognition with Improved Tr..." refers methods in this paper

  • ...This work was supported by Quaero (funded by OSEO, French State agency for innovation), the European integrated project AXES, the MSR/INRIA joint project and the ERC advanced grant ALLEGRO....

    [...]

Frequently Asked Questions (9)
Q1. How do the authors select motion vectors for salient feature points?

The authors select motion vectors for salient feature points using the good-features-to-track criterion [35], i.e., thresholding the smallest eigenvalue of the autocorrelation matrix. 

There are 16 sports actions (such as high-jump, pole-vault, basketball lay-up, discus), represented by a total of 783 video sequences. 

To normalize the histogram-based descriptors, i.e., HOG, HOF and MBH, the authors apply the recent RootSIFT [2] approach, i.e., square root each dimension after L1 normalization. 

Jain et al. [14] decompose visual motion into dominant and residual motions both for extracting trajectories and computing descriptors. 

Since HOG is designed to capture static appearance information, the authors do not expect that compensating camera motion significantly improves its performance. 

given the camera motion, the authors can correct the optical flow, so that the motion vectors of human actors are independent of camera motion. 

A bag-of-features representation of these local features can be directly used for action classification and achieves state-of-the-art performance (see [1] for a recent survey). 

The authors set the number of Gaussians to K = 256 and randomly sample a subset of 256,000 features from the training set to estimate the GMM. 

This work was supported by Quaero (funded by OSEO, French State agency for innovation), the European integrated project AXES, the MSR/INRIA joint project and the ERC advanced grant ALLEGRO.