SST: Single-Stream Temporal Action Proposals

doi:10.1109/CVPR.2017.675

Home
/
Papers
/
SST: Single-Stream Temporal Action Proposals

Proceedings Article•DOI•

SST: Single-Stream Temporal Action Proposals

Shyamal Buch¹, Victor Escorcia², Chuanqi Shen¹, Bernard Ghanem², Juan Carlos Niebles¹ - Show less +1 more•Institutions (2)

Stanford University¹, King Abdullah University of Science and Technology²

01 Jul 2017-pp 6373-6382

TL;DR: It is demonstrated empirically that the new Single-Stream Temporal Action Proposals model outperforms the state-of-the-art on the task of temporal action proposal generation, while achieving some of the fastest processing speeds in the literature.

read less

Abstract: Our paper presents a new approach for temporal detection of human actions in long, untrimmed video sequences. We introduce Single-Stream Temporal Action Proposals (SST), a new effective and efficient deep architecture for the generation of temporal action proposals. Our network can run continuously in a single stream over very long input video sequences, without the need to divide input into short overlapping clips or temporal windows for batch processing. We demonstrate empirically that our model outperforms the state-of-the-art on the task of temporal action proposal generation, while achieving some of the fastest processing speeds in the literature. Finally, we demonstrate that using SST proposals in conjunction with existing action classifiers results in improved state-of-the-art temporal action detection performance.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

[...]

Yu-Wei Chao¹, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng¹, Rahul Sukthankar - Show less +2 more•Institutions (1)

University of Michigan¹

01 Jun 2018

TL;DR: TAL-Net as mentioned in this paper improves receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations and better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields.

...read moreread less

Abstract: We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster RCNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localization on THUMOS'14 detection benchmark and competitive performance on ActivityNet challenge.

...read moreread less

647 citations

Book Chapter•DOI•

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

[...]

Tianwei Lin¹, Xu Zhao¹, Haisheng Su¹, Chongjing Wang, Ming Yang¹ - Show less +1 more•Institutions (1)

Shanghai Jiao Tong University¹

08 Sep 2018

TL;DR: An effective proposal generation method, named Boundary-Sensitive Network (BSN), which adopts "local to global" fashion and significantly improves the state-of-the-art temporal action detection performance.

...read moreread less

Abstract: Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content. This problem requires methods not only generating proposals with precise temporal boundaries, but also retrieving proposals to cover truth action instances with high recall and high overlap using relatively fewer proposals. To address these difficulties, we introduce an effective proposal generation method, named Boundary-Sensitive Network (BSN), which adopts “local to global” fashion. Locally, BSN first locates temporal boundaries with high probabilities, then directly combines these boundaries as proposals. Globally, with Boundary-Sensitive Proposal feature, BSN retrieves proposals by evaluating the confidence of whether a proposal contains an action within its region. We conduct experiments on two challenging datasets: ActivityNet-1.3 and THUMOS14, where BSN outperforms other state-of-the-art temporal action proposal generation methods with high recall and high temporal precision. Finally, further experiments demonstrate that by combining existing action classifiers, our method significantly improves the state-of-the-art temporal action detection performance.

...read moreread less

546 citations

Cites background or methods or result from "SST: Single-Stream Temporal Action ..."

...Recently some methods [4,5,9,13,32] generate proposals with pre-defined temporal durations and intervals, and use multiple methods to evaluate the confidence score of proposals, such as dictionary learning [5] and recurrent neural network [9]....
[...]
...For video feature encoding, except for two-stream network, C3D network [35] is also adopted in some works [4,9,13,32]....
[...]
...Most recently proposal generation methods [4,5,9,32] generate proposals via sliding temporal windows of multiple durations in video with regular interval, then train a model to evaluate the confidence scores of generated proposals for proposals retrieving, while there is also method [13] making external boundaries regression....
[...]
...Thus recently temporal action proposal generation has received much attention [4,5,9,13], aiming to improve the detection performance by improving the quality of proposals....
[...]
...The comparison results of THUMOS14 shown in Table 6 suggest that (1) using same action classifier, our method achieves significantly better performance than other proposal generation methods; (2) comparing with proposal-level classifier [32], video-level classifier [43] achieves better performance on BSN proposals and worse performance on [4] and [13] proposals, which indicates that confidence scores generated by BSN are more reliable than scores generated by proposal-level classifier, and are reliable enough for retrieving detection results in action detection task; (3) detection framework based on our proposals significantly outperforms state-of-the-art action detection methods, especially when the overlap threshold is high....
[...]

Proceedings Article•DOI•

Graph Convolutional Networks for Temporal Action Localization

[...]

Runhao Zeng¹, Wenbing Huang², Chuang Gan³, Mingkui Tan¹, Yu Rong², Peilin Zhao², Junzhou Huang⁴ - Show less +3 more•Institutions (4)

South China University of Technology¹, Tencent², Massachusetts Institute of Technology³, University of Texas at Arlington⁴

01 Oct 2019

TL;DR: Zhang et al. as mentioned in this paper exploit the proposal-proposal relations using GraphConvolutional Networks (GCNs) to exploit the context information for each proposal and the correlations between distinct actions.

...read moreread less

Abstract: Most state-of-the-art action localization systems process each action proposal individually, without explicitly exploiting their relations during learning. However, the relations between proposals actually play an important role in action localization, since a meaningful action always consists of multiple proposals in a video. In this paper, we propose to exploit the proposal-proposal relations using GraphConvolutional Networks (GCNs). First, we construct an action proposal graph, where each proposal is represented as a node and their relations between two proposals as an edge. Here, we use two types of relations, one for capturing the context information for each proposal and the other one for characterizing the correlations between distinct actions. Then we apply the GCNs over the graph to model the relations among different proposals and learn powerful representations for the action classification and localization. Experimental results show that our approach significantly outperforms the state-of-the-art on THUMOS14(49.1% versus 42.8%). Moreover, augmentation experiments on ActivityNet also verify the efficacy of modeling action proposal relationships.

...read moreread less

460 citations

Proceedings Article•DOI•

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

[...]

Tianwei Lin¹, Xiao Liu¹, Li Xin¹, Errui Ding¹, Shilei Wen¹ - Show less +1 more•Institutions (1)

Baidu¹

01 Oct 2019

TL;DR: This work proposes an effective, efficient and end-to-end proposal generation method, named Boundary-Matching Network (BMN), which generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously, and can achieve state-of-the-art temporal action detection performance.

...read moreread less

Abstract: Temporal action proposal generation is an challenging and promising task which aims to locate temporal regions in real-world videos where action or event may occur. Current bottom-up proposal generation methods can generate proposals with precise boundary, but cannot efficiently generate adequately reliable confidence scores for retrieving proposals. To address these difficulties, we introduce the Boundary-Matching (BM) mechanism to evaluate confidence scores of densely distributed proposals, which denote a proposal as a matching pair of starting and ending boundaries and combine all densely distributed BM pairs into the BM confidence map. Based on BM mechanism, we propose an effective, efficient and end-to-end proposal generation method, named Boundary-Matching Network (BMN), which generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously. The two-branches of BMN are jointly trained in an unified framework. We conduct experiments on two challenging datasets: THUMOS-14 and ActivityNet-1.3, where BMN shows significant performance improvement with remarkable efficiency and generalizability. Further, combining with existing action classifier, BMN can achieve state-of-the-art temporal action detection performance.

...read moreread less

453 citations

Cites methods from "SST: Single-Stream Temporal Action ..."

...Most existing proposal generation methods [3, 4, 8, 24] adopted a “top-down” fashion to generate proposals with multi-scale temporal sliding windows in regular interval, and then evaluate confidence scores of proposals respectively or simultaneously....
[...]
...Comparison between our method with state-of-the-art proposal generation methods SCNN [24], SST [3], TURN [12], TAG [36], CTAP [10], BSN [18] on THUMOS-14 dataset in terms of AR@AN, where SNMS stands for Soft-NMS....
[...]
...For proposal generation task, most previous works [3, 4, 8, 12, 24] adopt top-down fashion to generate proposals with pre-defined duration and interval, where the main drawback is the lack of boundary precision and duration flexibility....
[...]
...Following recent proposal generation methods [3, 8, 12, 18], we construct BMN model upon visual feature sequence extracted from raw video....
[...]

Proceedings Article•DOI•

Weakly Supervised Action Localization by Sparse Temporal Pooling Network

[...]

Phuc Xuan Nguyen¹, Bohyung Han², Ting Liu², Gautam Prasad³•Institutions (3)

University of California, Irvine¹, Google², Seoul National University³

18 Jun 2018

TL;DR: In this article, a weakly supervised temporal action localization algorithm is proposed, which learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations.

...read moreread less

Abstract: We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks. Our algorithm learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations. We design our network to identify a sparse subset of key segments associated with target actions in a video using an attention module and fuse the key segments through adaptive temporal pooling. Our loss function is comprised of two terms that minimize the video-level action classification error and enforce the sparsity of the segment selection. At inference time, we extract and score temporal proposals using temporal class activations and class-agnostic attentions to estimate the time intervals that correspond to target actions. The proposed algorithm attains state-of-the-art results on the THUMOS14 dataset and outstanding performance on ActivityNet1.3 even with its weak supervision.

...read moreread less

353 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

Adam: A Method for Stochastic Optimization

[...]

Diederik P. Kingma¹, Jimmy Ba²•Institutions (2)

University of Amsterdam¹, University of Toronto²

01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

...read moreread less

111,197 citations

Posted Content•

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

[...]

Shaoqing Ren¹, Kaiming He², Ross Girshick³, Jian Sun²•Institutions (3)

University of Science and Technology of China¹, Microsoft², Facebook³

04 Jun 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.

...read moreread less

Abstract: State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

...read moreread less

23,183 citations

Proceedings Article•DOI•

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

[...]

Ross Girshick¹, Jeff Donahue¹, Trevor Darrell¹, Jitendra Malik¹•Institutions (1)

University of California, Berkeley¹

23 Jun 2014

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less

Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

...read moreread less

21,729 citations

Proceedings Article•DOI•

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

[...]

Kyunghyun Cho¹, Bart van Merriënboer², Caglar Gulcehre², Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio³, Yoshua Bengio⁴, Yoshua Bengio⁵ - Show less +5 more•Institutions (5)

Aalto University¹, Université de Montréal², AT&T³, Alcatel-Lucent⁴, École Polytechnique de Montréal⁵

01 Jan 2014

TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.

...read moreread less

Abstract: In this paper, we propose a novel neural network model called RNN Encoder‐ Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixedlength vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder‐Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

...read moreread less

19,998 citations

Proceedings Article•

Faster R-CNN: towards real-time object detection with region proposal networks

[...]

Shaoqing Ren¹, Kaiming He¹, Ross Girshick¹, Jian Sun¹•Institutions (1)

Microsoft¹

07 Dec 2015

TL;DR: Ren et al. as discussed by the authors proposed a region proposal network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.

...read moreread less

Abstract: State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model [19], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. Code is available at https://github.com/ShaoqingRen/faster_rcnn.

...read moreread less

13,674 citations