(Open Access) Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks (2017) | Zhaofan Qiu

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

A Closer Look at Spatiotemporal Convolutions for Action Recognition

[...]

Du Tran¹, Heng Wang¹, Lorenzo Torresani¹, Jamie Ray², Jamie Ray¹, Yann LeCun¹, Manohar Paluri¹ - Show less +3 more•Institutions (2)

Facebook¹, Dartmouth College²

12 Apr 2018

TL;DR: In this article, a new spatio-temporal convolutional block "R(2+1)D" was proposed, which achieved state-of-the-art performance on Sports-1M, Kinetics, UCF101, and HMDB51.

...read moreread less

Abstract: In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly gains in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which produces CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101, and HMDB51.

...read moreread less

1,827 citations

Proceedings Article•DOI•

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet

[...]

Kensho Hara¹, Hirokatsu Kataoka¹, Yutaka Satoh¹•Institutions (1)

National Institute of Advanced Industrial Science and Technology¹

18 Jun 2018

TL;DR: Whether current video datasets have sufficient data for training very deep convolutional neural networks with spatio-temporal three-dimensional (3D) kernels is determined and it is believed that using deep 3D CNNs together with Kinetics will retrace the successful history of 2DCNNs and ImageNet, and stimulate advances in computer vision for videos.

...read moreread less

Abstract: The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) ResNet-18 training resulted in significant overfitting for UCF-101, HMDB-51, and ActivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on UCF-101 and HMDB-51, respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available1.

...read moreread less

1,769 citations

Cites result from "Learning Spatio-Temporal Representa..."

...Here, we can see that ResNeXt-101 achieved higher accuracies compared with C3D [23], P3D [19], two-stream CNN [20], and TDD [27]....
[...]
...Here, we can see that ResNeXt101 achieved higher accuracies compared with C3D [23], P3D [19], two-stream CNN [20], and TDD [27]....
[...]

Book Chapter•DOI•

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

[...]

Saining Xie¹, Chen Sun¹, Jonathan Huang¹, Zhuowen Tu¹, Kevin Murphy¹ - Show less +1 more•Institutions (1)

Google¹

08 Sep 2018

TL;DR: In this article, it was shown that it is possible to replace many of the expensive 3D convolutions by low-cost 2D convolution, and the best result was achieved when replacing the 3D CNNs at the bottom of the network, suggesting that temporal representation learning on high-level semantic features is more useful.

...read moreread less

Abstract: Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification Three main challenges exist including spatial (image) feature representation, temporal information representation, and model/computation complexity It was recently shown by Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained on ImageNet, could be a promising way for spatial and temporal representation learning However, as for model/computation complexity, 3D CNNs are much more expensive than 2D CNNs and prone to overfit We seek a balance between speed and accuracy by building an effective and efficient video classification system through systematic exploration of critical network design choices In particular, we show that it is possible to replace many of the 3D convolutions by low-cost 2D convolutions Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level “semantic” features is more useful Our conclusion generalizes to datasets with very different properties When combined with several other cost-effective designs including separable spatial/temporal convolution and feature gating, our system results in an effective video classification system that that produces very competitive results on several action classification benchmarks (Kinetics, Something-something, UCF101 and HMDB), as well as two action detection (localization) benchmarks (JHMDB and UCF101-24)

...read moreread less

809 citations

Book Chapter•DOI•

Videos as Space-Time Region Graphs

[...]

Xiaolong Wang¹, Abhinav Gupta¹•Institutions (1)

Carnegie Mellon University¹

08 Sep 2018

TL;DR: The proposed graph representation achieves state-of-the-art results on the Charades and Something-Something datasets and obtains a huge gain when the model is applied in complex environments.

...read moreread less

Abstract: How do humans recognize the action “opening a book”? We argue that there are two important cues: modeling temporal shape dynamics and modeling functional relationships between humans and objects. In this paper, we propose to represent videos as space-time region graphs which capture these two important cues. Our graph nodes are defined by the object region proposals from different frames in a long range video. These nodes are connected by two types of relations: (i) similarity relations capturing the long range dependencies between correlated objects and (ii) spatial-temporal relations capturing the interactions between nearby objects. We perform reasoning on this graph representation via Graph Convolutional Networks. We achieve state-of-the-art results on the Charades and Something-Something datasets. Especially for Charades with complex environments, we obtain a huge \(4.4\%\) gain when our model is applied in complex environments.

...read moreread less

763 citations

Cites methods from "Learning Spatio-Temporal Representa..."

...One of the most popular model is the two-Stream ConvNets [1] where temporal information is model by a network with 10 optical flow frames as inputs ( 1 second)....
[...]
...To better model longer-term information, a lot of work has been focused on using Recurrent Neural Networks (RNNs) [3,4,38,39,40,5,41,42,43] and 3D ConvNets [44,45,8,9,46,47,48]....
[...]
...For example, the state-of-the-art approaches based on twostream ConvNets [1,2] are still learning to classify actions based on individual video frame or local motion vectors....
[...]
...In the context of deep learning, especially for semantic segmentation, the CRF model is often applied on the outputs of the ConvNets by performing mean-field inference [61,62,63,64,65,66]....
[...]
...In this section, we will first introduce the feature extraction process for our model with 3D ConvNets and then describe the construction of the similarity graph as well as the spatial-temporal graph....
[...]

Proceedings Article•DOI•

Skeleton-Based Action Recognition With Directed Graph Neural Networks

[...]

Lei Shi¹, Yifan Zhang¹, Jian Cheng¹, Hanqing Lu•Institutions (1)

Chinese Academy of Sciences¹

15 Jun 2019

TL;DR: A novel directed graph neural network is designed specially to extract the information of joints, bones and their relations and make prediction based on the extracted features and is tested on two large-scale datasets, NTU-RGBD and Skeleton-Kinetics, and exceeds state-of-the-art performance on both of them.

...read moreread less

Abstract: The skeleton data have been widely used for the action recognition tasks since they can robustly accommodate dynamic circumstances and complex backgrounds. In existing methods, both the joint and bone information in skeleton data have been proved to be of great help for action recognition tasks. However, how to incorporate these two types of data to best take advantage of the relationship between joints and bones remains a problem to be solved. In this work, we represent the skeleton data as a directed acyclic graph based on the kinematic dependency between the joints and bones in the natural human body. A novel directed graph neural network is designed specially to extract the information of joints, bones and their relations and make prediction based on the extracted features. In addition, to better fit the action recognition task, the topological structure of the graph is made adaptive based on the training process, which brings notable improvement. Moreover, the motion information of the skeleton sequence is exploited and combined with the spatial information to further enhance the performance in a two-stream framework. Our final model is tested on two large-scale datasets, NTU-RGBD and Skeleton-Kinetics, and exceeds state-of-the-art performance on both of them.

...read moreread less

634 citations

Cites methods from "Learning Spatio-Temporal Representa..."

...The most widely used models in deep-learning-based methods are recurrent neural networks (RNNs), convolutional neural networks (CNNs) and graph convolutional networks (GCNs), where the coordinates of joints are represented as vector sequences, pseudo-images and graphs, respectively....
[...]
...By decoupling the spatial and temporal dimensions, the pseudo-3D CNN can model the spatiotemporal information in a more economic and effective way....
[...]
...The pseudo-3D CNN [23] has shown its superiority in the RGB-based action recognition field, which models the spatial information with the 2D convolutions and then models the temporal information with the 1D convolutions....
[...]
...Conventional methods always model the skeleton data as a sequence of vectors or a pseudo-image to be processed by RNNs or CNNs....
[...]
...Graph is a more general data structure than image and sequence, which cannot be directly modeled by conventional deep learning modules such as CNNs and RNNs. Approaches for operating directly on graphs and solving graph-based problems have been explored extensively for several years [15, 9, 33, 24, 1, 11, 2]....
[...]

Collapse

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

Citations

Cites result from "Learning Spatio-Temporal Representa..."

Cites methods from "Learning Spatio-Temporal Representa..."

Cites methods from "Learning Spatio-Temporal Representa..."

References

"Learning Spatio-Temporal Representa..." refers methods in this paper

Related Papers (5)