Temporal Convolutional Networks for Action Segmentation and Detection

doi:10.1109/CVPR.2017.113

Home
/
Papers
/
Temporal Convolutional Networks for Action Segmentation and Detection

Proceedings Article•DOI•

Temporal Convolutional Networks for Action Segmentation and Detection

Colin Lea¹, Michael D. Flynn¹, René Vidal¹, Austin Reiter¹, Gregory D. Hager¹ - Show less +1 more•Institutions (1)

Johns Hopkins University¹

01 Jul 2017-pp 1003-1012

TL;DR: A class of temporal models that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection, which are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks.

read less

Abstract: The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local spatiotemporal features from video frames and then feeding them into a temporal classifier that captures high-level temporal patterns. We describe a class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to efficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements over the state of the art.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

[...]

Shaojie Bai, J. Zico Kolter, Vladlen Koltun

04 Mar 2018-arXiv: Learning

TL;DR: A systematic evaluation of generic convolutional and recurrent architectures for sequence modeling concludes that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutionals should be regarded as a natural starting point for sequence modeled tasks.

...read moreread less

Abstract: For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. To assist related work, we have made code available at this http URL .

...read moreread less

2,776 citations

Journal Article•DOI•

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

[...]

Yi Luo¹, Nima Mesgarani¹•Institutions (1)

Columbia University¹

20 Sep 2018-arXiv: Sound

TL;DR: A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.

...read moreread less

Abstract: Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency in calculating the spectrograms. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two- and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications.

...read moreread less

1,061 citations

Cites methods from "Temporal Convolutional Networks for..."

...Motivated by the temporal convolutional network (TCN) [23], [24], [25], we propose a fully convolutional separation module that consists of stacked 1-D dilated convolutional blocks, as shown in Figure 1 B....
[...]
...a replacement for RNNs in various tasks [23], [24], [25]....
[...]
...This method is motivated by the success of temporal convolutional network (TCN) models [23], [24], [25], which allow parallel processing on consecutive...
[...]

Proceedings Article•DOI•

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

[...]

Yu-Wei Chao¹, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng¹, Rahul Sukthankar - Show less +2 more•Institutions (1)

University of Michigan¹

01 Jun 2018

TL;DR: TAL-Net as mentioned in this paper improves receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations and better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields.

...read moreread less

Abstract: We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster RCNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localization on THUMOS'14 detection benchmark and competitive performance on ActivityNet challenge.

...read moreread less

647 citations

Proceedings Article•DOI•

Interpretable 3D Human Action Analysis with Temporal Convolutional Networks

[...]

Tae Soo Kim¹, Austin Reiter¹•Institutions (1)

Johns Hopkins University¹

21 Jul 2017

TL;DR: This work proposes to use a new class of models known as Temporal Convolutional Neural Networks (TCN) for 3D human action recognition, and aims to take a step towards a spatio-temporal model that is easier to understand, explain and interpret.

...read moreread less

Abstract: The discriminative power of modern deep learning models for 3D human action recognition is growing ever so potent. In conjunction with the recent resurgence of 3D human action representation with 3D skeletons, the quality and the pace of recent progress have been significant. However, the inner workings of state-of-the-art learning based methods in 3D human action recognition still remain mostly black-box. In this work, we propose to use a new class of models known as Temporal Convolutional Neural Networks (TCN) for 3D human action recognition. TCN provides us a way to explicitly learn readily interpretable spatio-temporal representations for 3D human action recognition. Through this work, we wish to take a step towards a spatio-temporal model that is easier to understand, explain and interpret. The resulting model, Res-TCN, achieves state-of-the-art results on the largest 3D human action recognition dataset, NTU-RGBD.

...read moreread less

471 citations

Cites background or methods from "Temporal Convolutional Networks for..."

...A D-dimensional feature vector, whether it is a deep feature from a spatial CNN such as fc7 activation of AlexNet [19] or a set of kinematic features [20], is extracted per each video frame....
[...]
...In this section, we provide a brief overview of the structure of a TCN as provided in the original paper [20]....
[...]
...In this light, we propose Temporal Convolutional Neural Networks (TCN) [20] applied to 3D Human Action Recognition....
[...]
...Moreover, by model design based on temporal convolutions [20] and residual connections [12], we can begin to directly interpret what our model parameters and features represent....
[...]

Proceedings Article•DOI•

Machine Learning at Facebook: Understanding Inference at the Edge

[...]

Carole-Jean Wu¹, David Brooks¹, Kevin Chen¹, Douglas Chen¹, Sy Choudhury¹, Marat Dukhan¹, Kim Hazelwood¹, Eldad Isaac¹, Yangqing Jia¹, Bill Jia¹, Tommer Leyvand¹, Hao Lu¹, Yang Lu¹, Lin Qiao¹, Brandon Reagen¹, Joe Spisak¹, Fei Sun¹, Andrew Tulloch¹, Peter Vajda¹, Xiaodong Wang¹, Yanghan Wang¹, Bram Wasti¹, Yiming Wu¹, Ran Xian¹, Sungjoo Yoo¹, Sungjoo Yoo², Peizhao Zhang¹ - Show less +23 more•Institutions (2)

Facebook¹, Seoul National University²

26 Mar 2019

TL;DR: This paper takes a datadriven approach to present the opportunities and design challenges faced by Facebook in order to enable machine learning inference locally on smartphones and other edge platforms.

...read moreread less

Abstract: At Facebook, machine learning provides a wide range of capabilities that drive many aspects of user experience including ranking posts, content understanding, object detection and tracking for augmented and virtual reality, speech and text translations. While machine learning models are currently trained on customized datacenter infrastructure, Facebook is working to bring machine learning inference to the edge. By doing so, user experience is improved with reduced latency (inference time) and becomes less dependent on network connectivity. Furthermore, this also enables many more applications of deep learning with important features only made available at the edge. This paper takes a datadriven approach to present the opportunities and design challenges faced by Facebook in order to enable machine learning inference locally on smartphones and other edge platforms.

...read moreread less

385 citations

Additional excerpts

...Hand Tracking U-Net [29] 10x 1x Image Model-1 GoogLeNet [30] 100x 1x Image Model-2 ShuffleNet [27] 10x 2x Pose Estimation Mask-RCNN [28] 100x 4x Segmentation TCN [31] 1x 1....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172

Collapse

References

PDF

Open Access

More filters

Posted Content•

Adam: A Method for Stochastic Optimization

[...]

Diederik P. Kingma¹, Jimmy Ba²•Institutions (2)

University of Amsterdam¹, University of Toronto²

22 Dec 2014-arXiv: Learning

TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.

...read moreread less

Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

...read moreread less

23,486 citations

Proceedings Article•DOI•

Learning Spatiotemporal Features with 3D Convolutional Networks

[...]

Du Tran¹, Du Tran², Lubomir Bourdev², Rob Fergus², Lorenzo Torresani¹, Manohar Paluri² - Show less +2 more•Institutions (2)

Dartmouth College¹, Facebook²

07 Dec 2015

TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.

...read moreread less

Abstract: We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets, 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets, and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

...read moreread less

7,091 citations

"Temporal Convolutional Networks for..." refers background in this paper

...Large-scale Recognition: There has been substantial work on spatiotemporal models for large scale video classification and detection [31, 11, 12, 27, 33, 23, 19]....
[...]

Proceedings Article•

Multi-Scale Context Aggregation by Dilated Convolutions

[...]

Fisher Yu¹, Vladlen Koltun²•Institutions (2)

Princeton University¹, Intel²

30 Apr 2016

TL;DR: This work develops a new convolutional network module that is specifically designed for dense prediction, and shows that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems.

...read moreread less

Abstract: State-of-the-art models for semantic segmentation are based on adaptations of convolutional networks that had originally been designed for image classification. However, dense prediction and image classification are structurally different. In this work, we develop a new convolutional network module that is specifically designed for dense prediction. The presented module uses dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution. The architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage. We show that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems. In addition, we examine the adaptation of image classification networks to dense prediction and show that simplifying the adapted network can increase accuracy.

...read moreread less

5,566 citations

"Temporal Convolutional Networks for..." refers methods in this paper

...The Encoder-Decoder TCN is most similar to SegNet [2] whereas the Dilated TCN is most similar to the Multi-Scale Context model [38]....
[...]

Proceedings Article•DOI•

Large-Scale Video Classification with Convolutional Neural Networks

[...]

Andrej Karpathy¹, George Toderici¹, Sanketh Shetty¹, Thomas Leung¹, Rahul Sukthankar¹, Li Fei-Fei¹ - Show less +2 more•Institutions (1)

Stanford University¹

23 Jun 2014

TL;DR: This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

...read moreread less

Abstract: Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).

...read moreread less

4,876 citations

"Temporal Convolutional Networks for..." refers background in this paper

...Large-scale Recognition: There has been substantial work on spatiotemporal models for large scale video classification and detection [31, 11, 12, 27, 33, 23, 19]....
[...]

Proceedings Article•DOI•

Action Recognition with Improved Trajectories

[...]

Heng Wang¹, Cordelia Schmid¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Dec 2013

TL;DR: Dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets are improved by taking into account camera motion to correct them.

...read moreread less

Abstract: Recently dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets. This paper improves their performance by taking into account camera motion to correct them. To estimate camera motion, we match feature points between frames using SURF descriptors and dense optical flow, which are shown to be complementary. These matches are, then, used to robustly estimate a homography with RANSAC. Human motion is in general different from camera motion and generates inconsistent matches. To improve the estimation, a human detector is employed to remove these matches. Given the estimated camera motion, we remove trajectories consistent with it. We also use this estimation to cancel out camera motion from the optical flow. This significantly improves motion-based descriptors, such as HOF and MBH. Experimental results on four challenging action datasets (i.e., Hollywood2, HMDB51, Olympic Sports and UCF50) significantly outperform the current state of the art.

...read moreread less

3,487 citations

"Temporal Convolutional Networks for..." refers methods in this paper

...Rohrbach et al. [26] used Dense Trajectories [37] and human pose features on the MPII Cooking dataset....
[...]
...[26] used Dense Trajectories [37] and human pose features on the MPII Cooking dataset....
[...]