Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks
Citations
1,827 citations
1,769 citations
Cites result from "Learning Spatio-Temporal Representa..."
...Here, we can see that ResNeXt-101 achieved higher accuracies compared with C3D [23], P3D [19], two-stream CNN [20], and TDD [27]....
[...]
...Here, we can see that ResNeXt101 achieved higher accuracies compared with C3D [23], P3D [19], two-stream CNN [20], and TDD [27]....
[...]
809 citations
763 citations
Cites methods from "Learning Spatio-Temporal Representa..."
...One of the most popular model is the two-Stream ConvNets [1] where temporal information is model by a network with 10 optical flow frames as inputs ( 1 second)....
[...]
...To better model longer-term information, a lot of work has been focused on using Recurrent Neural Networks (RNNs) [3,4,38,39,40,5,41,42,43] and 3D ConvNets [44,45,8,9,46,47,48]....
[...]
...For example, the state-of-the-art approaches based on twostream ConvNets [1,2] are still learning to classify actions based on individual video frame or local motion vectors....
[...]
...In the context of deep learning, especially for semantic segmentation, the CRF model is often applied on the outputs of the ConvNets by performing mean-field inference [61,62,63,64,65,66]....
[...]
...In this section, we will first introduce the feature extraction process for our model with 3D ConvNets and then describe the construction of the similarity graph as well as the spatial-temporal graph....
[...]
634 citations
Cites methods from "Learning Spatio-Temporal Representa..."
...The most widely used models in deep-learning-based methods are recurrent neural networks (RNNs), convolutional neural networks (CNNs) and graph convolutional networks (GCNs), where the coordinates of joints are represented as vector sequences, pseudo-images and graphs, respectively....
[...]
...By decoupling the spatial and temporal dimensions, the pseudo-3D CNN can model the spatiotemporal information in a more economic and effective way....
[...]
...The pseudo-3D CNN [23] has shown its superiority in the RGB-based action recognition field, which models the spatial information with the 2D convolutions and then models the temporal information with the 1D convolutions....
[...]
...Conventional methods always model the skeleton data as a sequence of vectors or a pseudo-image to be processed by RNNs or CNNs....
[...]
...Graph is a more general data structure than image and sequence, which cannot be directly modeled by conventional deep learning modules such as CNNs and RNNs. Approaches for operating directly on graphs and solving graph-based problems have been explored extensively for several years [15, 9, 33, 24, 1, 11, 2]....
[...]
References
123,388 citations
73,978 citations
49,914 citations
40,257 citations
30,124 citations
"Learning Spatio-Temporal Representa..." refers methods in this paper
...Video representation embedding visualizations of ResNet-152 and P3D ResNet on UCF101 using t-SNE [32]....
[...]
...Figure 7 further shows the t-SNE [32] visualization of embedding of video representation learnt by ResNet-152 and P3D ResNet....
[...]