Book ChapterDOI
Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos
Ionut Cosmin Duta,Bogdan Ionescu,Kiyoharu Aizawa,Nicu Sebe +3 more
- pp 365-378
Reads0
Chats0
TLDR
This work proposes Spatio-temporal VLAD (ST-VLAD), an extended encoding method which incorporates spatio-tem temporal information within the encoding process by proposing a video division and extracting specific information over the feature group of each video split.Abstract:
Encoding is one of the key factors for building an effective video representation. In the recent works, super vector-based encoding approaches are highlighted as one of the most powerful representation generators. Vector of Locally Aggregated Descriptors (VLAD) is one of the most widely used super vector methods. However, one of the limitations of VLAD encoding is the lack of spatial information captured from the data. This is critical, especially when dealing with video information. In this work, we propose Spatio-temporal VLAD (ST-VLAD), an extended encoding method which incorporates spatio-temporal information within the encoding process. This is carried out by proposing a video division and extracting specific information over the feature group of each video split. Experimental validation is performed using both hand-crafted and deep features. Our pipeline for action recognition with the proposed encoding method obtains state-of-the-art performance over three challenging datasets: HMDB51 (67.6%), UCF50 (97.8%) and UCF101 (91.5%).read more
Citations
More filters
Proceedings ArticleDOI
A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences
Maryam Asadi-Aghbolaghi,Albert Clapés,Marco Bellantonio,Hugo Jair Escalante,Víctor Ponce-López,Xavier Baró,Isabelle Guyon,Shohreh Kasaei,Sergio Escalera +8 more
TL;DR: A taxonomy that summarizes important aspects of deep learning for approaching both action and gesture recognition in image sequences is introduced, and the main works proposed so far are summarized.
Proceedings ArticleDOI
Timeception for Complex Action Recognition
TL;DR: Timeception as discussed by the authors uses multi-scale temporal convolutions and reduces the complexity of 3D convolutions, which achieves impressive accuracy in recognizing the human activities of Charades, Breakfast Actions and MultiTHUMOS.
Journal ArticleDOI
Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition
TL;DR: Results show that the proposed ActionS-ST-VLAD method is able to effectively pool useful deep features spatiotemporally, leading to the state-of-the-art performance for video-based action recognition.
Posted Content
Timeception for Complex Action Recognition
TL;DR: Timeception as discussed by the authors uses multi-scale temporal convolutions and reduces the complexity of 3D convolutions, which achieves impressive accuracy in recognizing the human activities of Charades, Breakfast Actions, and MultiTHUMOS.
Book ChapterDOI
Deep Learning for Action and Gesture Recognition in Image Sequences: A Survey
Maryam Asadi-Aghbolaghi,Maryam Asadi-Aghbolaghi,Maryam Asadi-Aghbolaghi,Albert Clapés,Albert Clapés,Marco Bellantonio,Hugo Jair Escalante,Víctor Ponce-López,Xavier Baró,Isabelle Guyon,Shohreh Kasaei,Sergio Escalera,Sergio Escalera +12 more
TL;DR: This chapter is a survey of current deep learning based methodologies for action and gesture recognition in sequences of images, and introduces a taxonomy that summarizes important aspects of deep learning for approaching both tasks.
References
More filters
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI
ImageNet: A large-scale hierarchical image database
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI
Histograms of oriented gradients for human detection
Navneet Dalal,Bill Triggs +1 more
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Proceedings ArticleDOI
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories
TL;DR: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence that exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories.