scispace - formally typeset
Book ChapterDOI

Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos

Reads0
Chats0
TLDR
This work proposes Spatio-temporal VLAD (ST-VLAD), an extended encoding method which incorporates spatio-tem temporal information within the encoding process by proposing a video division and extracting specific information over the feature group of each video split.
Abstract
Encoding is one of the key factors for building an effective video representation. In the recent works, super vector-based encoding approaches are highlighted as one of the most powerful representation generators. Vector of Locally Aggregated Descriptors (VLAD) is one of the most widely used super vector methods. However, one of the limitations of VLAD encoding is the lack of spatial information captured from the data. This is critical, especially when dealing with video information. In this work, we propose Spatio-temporal VLAD (ST-VLAD), an extended encoding method which incorporates spatio-temporal information within the encoding process. This is carried out by proposing a video division and extracting specific information over the feature group of each video split. Experimental validation is performed using both hand-crafted and deep features. Our pipeline for action recognition with the proposed encoding method obtains state-of-the-art performance over three challenging datasets: HMDB51 (67.6%), UCF50 (97.8%) and UCF101 (91.5%).

read more

Citations
More filters
Proceedings ArticleDOI

A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences

TL;DR: A taxonomy that summarizes important aspects of deep learning for approaching both action and gesture recognition in image sequences is introduced, and the main works proposed so far are summarized.
Proceedings ArticleDOI

Timeception for Complex Action Recognition

TL;DR: Timeception as discussed by the authors uses multi-scale temporal convolutions and reduces the complexity of 3D convolutions, which achieves impressive accuracy in recognizing the human activities of Charades, Breakfast Actions and MultiTHUMOS.
Journal ArticleDOI

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

TL;DR: Results show that the proposed ActionS-ST-VLAD method is able to effectively pool useful deep features spatiotemporally, leading to the state-of-the-art performance for video-based action recognition.
Posted Content

Timeception for Complex Action Recognition

TL;DR: Timeception as discussed by the authors uses multi-scale temporal convolutions and reduces the complexity of 3D convolutions, which achieves impressive accuracy in recognizing the human activities of Charades, Breakfast Actions, and MultiTHUMOS.
References
More filters
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Proceedings ArticleDOI

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

TL;DR: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence that exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories.
Related Papers (5)