A Robust and Efficient Video Representation for Action Recognition

doi:10.1007/S11263-015-0846-5

Open AccessJournal ArticleDOI

A Robust and Efficient Video Representation for Action Recognition

Heng Wang, +3 more

- 01 Sep 2016 -

International Journal of Computer Vision

- Vol. 119, Iss: 3, pp 219-238

TLDR

In this paper, the authors extract feature point matches between frames using SURF descriptors and dense optical flow, and use the matches to estimate a homography with RANSAC.

Abstract:

This paper introduces a state-of-the-art video representation and applies it to efficient action recognition and detection. We first propose to improve the popular dense trajectory features by explicit camera motion estimation. More specifically, we extract feature point matches between frames using SURF descriptors and dense optical flow. The matches are used to estimate a homography with RANSAC. To improve the robustness of homography estimation, a human detector is employed to remove outlier matches from the human body as human motion is not constrained by the camera. Trajectories consistent with the homography are considered as due to camera motion, and thus removed. We also use the homography to cancel out camera motion from the optical flow. This results in significant improvement on motion-based HOF and MBH descriptors. We further explore the recent Fisher vector as an alternative feature encoding approach to the standard bag-of-words (BOW) histogram, and consider different ways to include spatial layout information in these encodings. We present a large and varied set of evaluations, considering (i) classification of short basic actions on six datasets, (ii) localization of such actions in feature-length movies, and (iii) large-scale recognition of complex events. We find that our improved trajectory features significantly outperform previous dense trajectories, and that Fisher vectors are superior to BOW encodings for video recognition tasks. In all three tasks, we show substantial improvements over the state-of-the-art results.

A Robust and Efficient Video Representation for Action Recognition

Citations

Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks

Human Action Recognition and Prediction: A Survey.

Learning to track for spatio-temporal action localization

A Comprehensive Survey of Vision-Based Human Action Recognition Methods.

Learning to Track for Spatio-Temporal Action Localization

References

Histograms of oriented gradients for human detection

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

SURF: speeded up robust features

Object Detection with Discriminatively Trained Part-Based Models

Good features to track

Related Papers (5)

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Large-Scale Video Classification with Convolutional Neural Networks

Two-Stream Convolutional Networks for Action Recognition in Videos

Learning Spatiotemporal Features with 3D Convolutional Networks

Learning realistic human actions from movies