Home
/
Authors
/
Mohammadreza Zolfaghari

Author

Mohammadreza Zolfaghari

Other affiliations: Sharif University of Technology, Amazon.com

Bio: Mohammadreza Zolfaghari is an academic researcher from University of Freiburg. The author has contributed to research in topics: Network architecture & K-SVD. The author has an hindex of 11, co-authored 20 publications receiving 1121 citations. Previous affiliations of Mohammadreza Zolfaghari include Sharif University of Technology & Amazon.com.

Topics: Network architecture, K-SVD, 3D pose estimation, Pose, Sensory cue ...read more

Papers

PDF

Open Access

More filters

Book Chapter•DOI•

ECO: Efficient Convolutional Network for Online Video Understanding

[...]

Mohammadreza Zolfaghari¹, Kamaljeet Singh¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

08 Sep 2018

TL;DR: A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.

...read moreread less

Abstract: The state of the art in video understanding suffers from two problems: (1) The major part of reasoning is performed locally in the video, therefore, it misses important relationships within actions that span several seconds. (2) While there are local methods with fast per-frame processing, the processing of the whole video is not efficient and hampers fast video retrieval or online classification of long-term activities. In this paper, we introduce a network architecture (https://github.com/mzolfaghari/ECO-efficient-video-understanding) that takes long-term content into account and enables fast per-video processing at the same time. The architecture is based on merging long-term content already in the network rather than in a post-hoc fusion. Together with a sampling strategy, which exploits that neighboring frames are largely redundant, this yields high-quality action classification and video captioning at up to 230 videos per second, where each video can consist of a few hundred frames. The approach achieves competitive performance across all datasets while being 10\(\times \) to 80\(\times \) faster than state-of-the-art methods.

...read moreread less

330 citations

Posted Content•

ECO: Efficient Convolutional Network for Online Video Understanding

[...]

Mohammadreza Zolfaghari¹, Kamaljeet Singh¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

24 Apr 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a network architecture that takes long-term content into account and enables fast per-video processing at the same time is proposed, which achieves competitive performance across all datasets while being 10x to 80x faster than state-of-theart methods.

...read moreread less

Abstract: The state of the art in video understanding suffers from two problems: (1) The major part of reasoning is performed locally in the video, therefore, it misses important relationships within actions that span several seconds. (2) While there are local methods with fast per-frame processing, the processing of the whole video is not efficient and hampers fast video retrieval or online classification of long-term activities. In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time. The architecture is based on merging long-term content already in the network rather than in a post-hoc fusion. Together with a sampling strategy, which exploits that neighboring frames are largely redundant, this yields high-quality action classification and video captioning at up to 230 videos per second, where each video can consist of a few hundred frames. The approach achieves competitive performance across all datasets while being 10x to 80x faster than state-of-the-art methods.

...read moreread less

293 citations

Proceedings Article•DOI•

Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection

[...]

Mohammadreza Zolfaghari¹, Gabriel L. Oliveira¹, Nima Sedaghat¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

01 Oct 2017

TL;DR: This paper proposes a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images and introduces a Markov chain model which adds cues successively.

...read moreread less

Abstract: General human action recognition requires understanding of various visual cues. In this paper, we propose a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images. For the integration, we introduce a Markov chain model which adds cues successively. The resulting approach is efficient and applicable to action classification as well as to spatial and temporal action localization. The two contributions clearly improve the performance over respective baselines. The overall approach achieves state-of-the-art action classification performance on HMDB51, J-HMDB and NTU RGB+D datasets. Moreover, it yields state-of-the-art spatio-temporal action localization results on UCF101 and J-HMDB.

...read moreread less

209 citations

Posted Content•

Orientation-boosted Voxel Nets for 3D Object Recognition

[...]

Nima Sedaghat¹, Mohammadreza Zolfaghari¹, Ehsan Amiri, Thomas Brox¹•Institutions (1)

University of Freiburg¹

12 Apr 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors argue that objects induce different features in the network under rotation and propose a multi-task approach, in which the network is trained to predict the pose of the object in addition to the class label.

...read moreread less

Abstract: Recent work has shown good recognition results in 3D object recognition using 3D convolutional networks. In this paper, we show that the object orientation plays an important role in 3D recognition. More specifically, we argue that objects induce different features in the network under rotation. Thus, we approach the category-level classification task as a multi-task problem, in which the network is trained to predict the pose of the object in addition to the class label as a parallel task. We show that this yields significant improvements in the classification results. We test our suggested architecture on several datasets representing various 3D data sources: LiDAR data, CAD models, and RGB-D images. We report state-of-the-art results on classification as well as significant improvements in precision and speed over the baseline on 3D detection.

...read moreread less

170 citations

Proceedings Article•DOI•

Orientation-boosted Voxel Nets for 3D Object Recognition

[...]

Nima Sedaghat¹, Mohammadreza Zolfaghari¹, Ehsan Amiri, Thomas Brox¹•Institutions (1)

University of Freiburg¹

01 Jan 2017

...read moreread less

139 citations

1
2
3
4
…
5

Cited by

PDF

Open Access

More filters

Journal Article•

“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告

[...]

杉山拓海

12 Sep 2017-Computers & Graphics

3,940 citations

Proceedings Article•DOI•

SlowFast Networks for Video Recognition

[...]

Christoph Feichtenhofer¹, Haoqi Fan¹, Jitendra Malik², Kaiming He¹•Institutions (2)

Facebook¹, University of California, Berkeley²

01 Oct 2019

TL;DR: This work presents SlowFast networks for video recognition, which achieves strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by the SlowFast concept.

...read moreread less

Abstract: We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/facebookresearch/SlowFast.

...read moreread less

2,320 citations

Proceedings Article•DOI•

OctNet: Learning Deep 3D Representations at High Resolutions

[...]

Gernot Riegler¹, Ali Osman Ulusoy², Andreas Geiger²•Institutions (2)

Graz University of Technology¹, Max Planck Society²

21 Jul 2017

TL;DR: The utility of the OctNet representation is demonstrated by analyzing the impact of resolution on several 3D tasks including 3D object classification, orientation estimation and point cloud labeling.

...read moreread less

Abstract: We present OctNet, a representation for deep learning with sparse 3D data. In contrast to existing models, our representation enables 3D convolutional networks which are both deep and high resolution. Towards this goal, we exploit the sparsity in the input data to hierarchically partition the space using a set of unbalanced octrees where each leaf node stores a pooled feature representation. This allows to focus memory allocation and computation to the relevant dense regions and enables deeper networks without compromising resolution. We demonstrate the utility of our OctNet representation by analyzing the impact of resolution on several 3D tasks including 3D object classification, orientation estimation and point cloud labeling.

...read moreread less

1,280 citations

Proceedings Article•DOI•

Dynamic Edge-Conditioned Filters in Convolutional Neural Networks on Graphs

[...]

Martin Simonovsky¹, Nikos Komodakis¹•Institutions (1)

École des ponts ParisTech¹

21 Jul 2017

TL;DR: This work generalizes the convolution operator from regular grids to arbitrary graphs while avoiding the spectral domain, which allows us to handle graphs of varying size and connectivity.

...read moreread less

Abstract: A number of problems can be formulated as prediction on graph-structured data. In this work, we generalize the convolution operator from regular grids to arbitrary graphs while avoiding the spectral domain, which allows us to handle graphs of varying size and connectivity. To move beyond a simple diffusion, filter weights are conditioned on the specific edge labels in the neighborhood of a vertex. Together with the proper choice of graph coarsening, we explore constructing deep neural networks for graph classification. In particular, we demonstrate the generality of our formulation in point cloud classification, where we set the new state of the art, and on a graph classification dataset, where we outperform other deep learning approaches.

...read moreread less

957 citations

Proceedings Article•DOI•

TSM: Temporal Shift Module for Efficient Video Understanding

[...]

Ji Lin¹, Chuang Gan¹, Song Han¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Oct 2019

TL;DR: Temporal Shift Module (TSM) as mentioned in this paper shifts part of the channels along the temporal dimension to facilitate information exchanged among neighboring frames, which can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters.

...read moreread less

Abstract: The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive, making it expensive to deploy. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. Specifically, it can achieve the performance of 3D CNN but maintain 2D CNN’s complexity. TSM shifts part of the channels along the temporal dimension; thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters. We also extended TSM to online setting, which enables real-time low-latency online video recognition and video object detection. TSM is accurate and efficient: it ranks the first place on the Something-Something leaderboard upon publication; on Jetson Nano and Galaxy Note8, it achieves a low latency of 13ms and 35ms for online video recognition. The code is available at: https://github. com/mit-han-lab/temporal-shift-module.

...read moreread less

892 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse