scispace - formally typeset
Open AccessProceedings ArticleDOI

HMDB: A large video database for human motion recognition

Reads0
Chats0
TLDR
This paper uses the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube, to evaluate the performance of two representative computer vision systems for action recognition and explore the robustness of these methods under various conditions.
Abstract
With nearly one billion online videos viewed everyday, an emerging new frontier in computer vision research is recognition and search in video. While much effort has been devoted to the collection and annotation of large scalable static image datasets containing thousands of image categories, human action datasets lag far behind. Current action recognition databases contain on the order of ten different action categories collected under fairly controlled conditions. State-of-the-art performance on these datasets is now near ceiling and thus there is a need for the design and creation of new benchmarks. To address this issue we collected the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube. We use this database to evaluate the performance of two representative computer vision systems for action recognition and explore the robustness of these methods under various conditions such as camera motion, viewpoint, video quality and occlusion.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Proceedings Article

Two-Stream Convolutional Networks for Action Recognition in Videos

TL;DR: This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Proceedings ArticleDOI

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

TL;DR: In this article, a Two-Stream Inflated 3D ConvNet (I3D) is proposed to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and their parameters.
Posted Content

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

TL;DR: This work introduces UCF101 which is currently the largest dataset of human actions and provides baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%.
Posted Content

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

TL;DR: A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
References
More filters
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Journal ArticleDOI

Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

TL;DR: The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.
Proceedings ArticleDOI

Learning realistic human actions from movies

TL;DR: A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.
Journal ArticleDOI

LabelMe: A Database and Web-Based Tool for Image Annotation

TL;DR: In this article, a large collection of images with ground truth labels is built to be used for object detection and recognition research, such data is useful for supervised learning and quantitative evaluation.
Proceedings ArticleDOI

Recognizing human actions: a local SVM approach

TL;DR: This paper construct video representations in terms of local space-time features and integrate such representations with SVM classification schemes for recognition and presents the presented results of action recognition.
Related Papers (5)