UntrimmedNets for Weakly Supervised Action Recognition and Detection
Limin Wang,Yuanjun Xiong,Dahua Lin,Luc Van Gool +3 more
- pp 6402-6411
TLDR
This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances.Abstract:
Current action recognition methods heavily rely on trimmed videos for model training. However, it is expensive and time-consuming to acquire a large-scale trimmed video dataset. This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances. Our UntrimmedNet couples two important components, the classification module and the selection module, to learn the action models and reason about the temporal duration of action instances, respectively. These two components are implemented with feed-forward networks, and UntrimmedNet is therefore an end-to-end trainable architecture. We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet only employs weak supervision, our method achieves performance superior or comparable to that of those strongly supervised approaches on these two datasets.read more
Citations
More filters
Book ChapterDOI
Cascaded Pyramid Mining Network for Weakly Supervised Temporal Action Localization
Haisheng Su,Xu Zhao,Tianwei Lin +2 more
TL;DR: Wang et al. as mentioned in this paper proposed a cascaded pyramid mining network (CPMN) to exploit hierarchical contextual information in videos and reduce missing detections, which produces a scale-invariant attention map through combining the feature maps from different levels.
Proceedings ArticleDOI
Action Coherence Network for Weakly Supervised Temporal Action Localization
TL;DR: This work presents Action Coherence Network (ACN) for W-TAL, which features a new coherence loss that better supervises action boundary learning and facilitate proposal regression and a purpose-built fusion module is proposed for localization inference based on features extracted by two streams of convolutional neural network.
Proceedings ArticleDOI
Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition
Chun-Fu Chen,Rameswar Panda,Kandan Ramakrishnan,Rogerio Feris,John M. Cohn,Aude Oliva,Quanfu Fan +6 more
TL;DR: In this paper, a comparative analysis of 2D-3D convolutional neural networks (CNNs) and 3D-CNNs was carried out for action recognition, showing that a significant leap is made in efficiency but not in accuracy, while 3D CNNs behave similarly in terms of spatial-temporal representation abilities and transferability.
Proceedings ArticleDOI
ZSTAD: Zero-Shot Temporal Activity Detection
TL;DR: This work designs an end-to-end deep network based on R-C3D that is optimized with an innovative loss function that considers the embeddings of activity labels and their super-classes while learning the common semantics of seen and unseen activities.
Proceedings ArticleDOI
Activity Driven Weakly Supervised Object Detection
TL;DR: This work shows that the action depicted in the image/video can provide strong cues about the location of the associated object and learns a spatial prior for the object dependent on the action, and incorporates this prior to simultaneously train a joint object detection and action classification model.
References
More filters
Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings ArticleDOI
ImageNet: A large-scale hierarchical image database
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Journal ArticleDOI
Gradient-based learning applied to document recognition
Yann LeCun,Léon Bottou,Léon Bottou,Yoshua Bengio,Yoshua Bengio,Yoshua Bengio,Patrick Haffner +6 more
TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Proceedings Article
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe,Christian Szegedy +1 more
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Journal ArticleDOI
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky,Jia Deng,Hao Su,Jonathan Krause,Sanjeev Satheesh,Sean Ma,Zhiheng Huang,Andrej Karpathy,Aditya Khosla,Michael S. Bernstein,Alexander C. Berg,Li Fei-Fei +11 more
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Related Papers (5)
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Joao Carreira,Andrew Zisserman +1 more