Open AccessPosted Content
Convolutional Two-Stream Network Fusion for Video Action Recognition
TLDR
In this paper, a spatial and temporal network can be fused at the last convolution layer without loss of performance, but with a substantial saving in parameters, and furthermore, pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance.Abstract:
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.read more
Citations
More filters
Proceedings ArticleDOI
Kernel Transformer Networks for Compact Spherical Convolution
Yu-Chuan Su,Kristen Grauman +1 more
TL;DR: The Kernel Transformer Network (KTN) is presented to efficiently transfer convolution kernels from perspective images to the equirectangular projection of 360° images and successfully preserves the source CNN’s accuracy, while offering transferability, scalability to typical image resolutions, and, in many cases, a substantially lower memory footprint.
Posted Content
Learning Spherical Convolution for Fast Features from 360{\deg} Imagery
Yu-Chuan Su,Kristen Grauman +1 more
TL;DR: In this paper, a spherical convolutional network is proposed to translate a planar CNN to process 360° imagery directly in its equirectangular projection, sensitive to the varying distortion effects across the viewing sphere.
Proceedings ArticleDOI
Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks
TL;DR: The Contrast-based Localization EvaluAtioN Network (CleanNet) is proposed with the new action proposal evaluator, which provides pseudo-supervision by leveraging the temporal contrast in snippet-level action classification predictions, and is an integral part of CleanNet which enables end-to-end training.
Journal ArticleDOI
Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments
TL;DR: An efficient and optimized CNN based system to process data streams in real-time, acquired from visual sensor of non-stationary surveillance environment, using a non-linear learning approach, quadratic SVM and an iterative fine-tuning process in the testing phase.
Proceedings ArticleDOI
DynamoNet: Dynamic Action and Motion Network
TL;DR: A novel unified spatio-temporal 3D-CNN architecture (DynamoNet) that jointly optimizes the video classification and learning motion representation by predicting future frames as a multi-task learning problem is introduced.
References
More filters
Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI
Going deeper with convolutions
Christian Szegedy,Wei Liu,Yangqing Jia,Pierre Sermanet,Scott Reed,Dragomir Anguelov,Dumitru Erhan,Vincent Vanhoucke,Andrew Rabinovich +8 more
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Proceedings Article
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe,Christian Szegedy +1 more
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.