Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos

doi:10.1007/978-3-319-51811-4_30

Book ChapterDOI

Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos

Ionut Cosmin Duta, +3 more

- pp 365-378

Chats0

TLDR

This work proposes Spatio-temporal VLAD (ST-VLAD), an extended encoding method which incorporates spatio-tem temporal information within the encoding process by proposing a video division and extracting specific information over the feature group of each video split.

Abstract:

Encoding is one of the key factors for building an effective video representation. In the recent works, super vector-based encoding approaches are highlighted as one of the most powerful representation generators. Vector of Locally Aggregated Descriptors (VLAD) is one of the most widely used super vector methods. However, one of the limitations of VLAD encoding is the lack of spatial information captured from the data. This is critical, especially when dealing with video information. In this work, we propose Spatio-temporal VLAD (ST-VLAD), an extended encoding method which incorporates spatio-temporal information within the encoding process. This is carried out by proposing a video division and extracting specific information over the feature group of each video split. Experimental validation is performed using both hand-crafted and deep features. Our pipeline for action recognition with the proposed encoding method obtains state-of-the-art performance over three challenging datasets: HMDB51 (67.6%), UCF50 (97.8%) and UCF101 (91.5%).

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences

Maryam Asadi-Aghbolaghi, +8 more

TL;DR: A taxonomy that summarizes important aspects of deep learning for approaching both action and gesture recognition in image sequences is introduced, and the main works proposed so far are summarized.

...read moreread less

Proceedings ArticleDOI

Timeception for Complex Action Recognition

Noureldien Hussein, +2 more

TL;DR: Timeception as discussed by the authors uses multi-scale temporal convolutions and reduces the complexity of 3D convolutions, which achieves impressive accuracy in recognizing the human activities of Charades, Breakfast Actions and MultiTHUMOS.

...read moreread less

Journal ArticleDOI

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Zhigang Tu, +5 more

- 03 Jan 2019 -

IEEE Transactions on Image Processing

TL;DR: Results show that the proposed ActionS-ST-VLAD method is able to effectively pool useful deep features spatiotemporally, leading to the state-of-the-art performance for video-based action recognition.

...read moreread less

Posted Content

Timeception for Complex Action Recognition

Noureldien Hussein, +2 more

- 04 Dec 2018 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Timeception as discussed by the authors uses multi-scale temporal convolutions and reduces the complexity of 3D convolutions, which achieves impressive accuracy in recognizing the human activities of Charades, Breakfast Actions, and MultiTHUMOS.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Proceedings ArticleDOI

Histograms of oriented gradients for human detection

Navneet Dalal, +1 more

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.

...read moreread less

Proceedings ArticleDOI

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Svetlana Lazebnik, +2 more

TL;DR: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence that exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories.

...read moreread less

Collapse

Related Papers (5)

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, +2 more

- 03 Dec 2012 -

arXiv: Computer Vision and Pattern Recog...

Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos

Citations

A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences

Timeception for Complex Action Recognition

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Timeception for Complex Action Recognition

Deep Learning for Action and Gesture Recognition in Image Sequences: A Survey

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Histograms of oriented gradients for human detection

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Related Papers (5)

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

HMDB: A large video database for human motion recognition

Two-Stream Convolutional Networks for Action Recognition in Videos

Learning Spatiotemporal Features with 3D Convolutional Networks

Large-Scale Video Classification with Convolutional Neural Networks