scispace - formally typeset
Search or ask a question
Author

Myunggi Lee

Bio: Myunggi Lee is an academic researcher from Seoul National University. The author has contributed to research in topics: Deep learning & Software system. The author has an hindex of 5, co-authored 11 publications receiving 181 citations.

Papers
More filters
Book ChapterDOI
Myunggi Lee1, Seungeui Lee1, Sung Joon Son1, Gyutae Park1, Nojun Kwak1 
08 Sep 2018
TL;DR: MFNet as mentioned in this paper uses motion blocks to encode spatio-temporal information between adjacent frames in a unified network that can be trained end-to-end with only a small additional cost.
Abstract: Spatio-temporal representations in frame sequences play an important role in the task of action recognition. Previously, a method of using optical flow as a temporal information in combination with a set of RGB images that contain spatial information has shown great performance enhancement in the action recognition tasks. However, it has an expensive computational cost and requires two-stream (RGB and optical flow) framework. In this paper, we propose MFNet (Motion Feature Network) containing motion blocks which make it possible to encode spatio-temporal information between adjacent frames in a unified network that can be trained end-to-end. The motion block can be attached to any existing CNN-based action recognition frameworks with only a small additional cost. We evaluated our network on two of the action recognition datasets (Jester and Something-Something) and achieved competitive performances for both datasets by training the networks from scratch.

86 citations

Posted Content
Myunggi Lee1, Seungeui Lee1, Sung Joon Son1, Gyutae Park1, Nojun Kwak1 
TL;DR: This paper proposes MFNet (Motion Feature Network) containing motion blocks which make it possible to encode spatio-temporal information between adjacent frames in a unified network that can be trained end-to-end.
Abstract: Spatio-temporal representations in frame sequences play an important role in the task of action recognition. Previously, a method of using optical flow as a temporal information in combination with a set of RGB images that contain spatial information has shown great performance enhancement in the action recognition tasks. However, it has an expensive computational cost and requires two-stream (RGB and optical flow) framework. In this paper, we propose MFNet (Motion Feature Network) containing motion blocks which make it possible to encode spatio-temporal information between adjacent frames in a unified network that can be trained end-to-end. The motion block can be attached to any existing CNN-based action recognition frameworks with only a small additional cost. We evaluated our network on two of the action recognition datasets (Jester and Something-Something) and achieved competitive performances for both datasets by training the networks from scratch.

77 citations

Proceedings ArticleDOI
28 Dec 2015
TL;DR: The controllers for THORMANG were developed to consider stability as the first priority, because the humanoid robot for rescue should perform complex tasks in unexpected environments.
Abstract: This paper presents the technical approaches including the system architecture and the controllers that have been used by Team SNU at the DARPA Robotics Challenge (DRC) Finals 2015. The platform THORMANG we used is a modular humanoid robot developed by ROBOTIS. On top of this platform, Team SNU developed the iris camera module and the end effector with passive palm in order to increase success rate of the tasks at the DRC Finals. Also, we developed the software architecture to operate the robot intuitively, in spite of degraded communication. The interface enables operator to select sensor data to be communicated during each task. These efforts on the hardware and the software reduce operation time of the tasks, and increase reliability of the robot. Finally, the controllers for THORMANG were developed to consider stability as the first priority, because the humanoid robot for rescue should perform complex tasks in unexpected environments. The proposed approaches were verified at the DRC Finals 2015, where Team SNU ranked 12th place out of 23 teams.

25 citations

Journal ArticleDOI
TL;DR: This paper presents the technical approaches used and experimental results obtained by Team SNU (Seoul National University) at the 2015 DARPA Robotics Challenge (DRC) Finals and a number of lessons learned by analyzing the 2015 DRC Finals.
Abstract: This paper presents the technical approaches used and experimental results obtained by Team SNU Seoul National University at the 2015 DARPA Robotics Challenge DRC Finals. Team SNU is one of the newly qualified teams, unlike 12 teams who previously participated in the December 2013 DRC Trials. The hardware platform THORMANG, which we used, has been developed by ROBOTIS. THORMANG is one of the smallest robots at the DRC Finals. Based on this platform, we focused on developing software architecture and controllers in order to perform complex tasks in disaster response situations and modifying hardware modules to maximize manipulability. Ensuring stability and modularization are two main keywords in the technical approaches of the architecture. We designed our interface and controllers to achieve a higher robustness level against disaster situations. Moreover, we concentrated on developing our software architecture by integrating a number of modules to eliminate software system complexity and programming errors. With these efforts on the hardware and software, we successfully finished the competition without falling, and we ranked 12th out of 23 teams. This paper is concluded with a number of lessons learned by analyzing the 2015 DRC Finals.

18 citations

Posted Content
TL;DR: A novel UV map generative model that learns to generate diverse and realistic synthetic UV maps without requiring high-quality UV maps for training is presented.
Abstract: Reconstructing 3D human faces in the wild with the 3D Morphable Model (3DMM) has become popular in recent years. While most prior work focuses on estimating more robust and accurate geometry, relatively little attention has been paid to improving the quality of the texture model. Meanwhile, with the advent of Generative Adversarial Networks (GANs), there has been great progress in reconstructing realistic 2D images. Recent work demonstrates that GANs trained with abundant high-quality UV maps can produce high-fidelity textures superior to those produced by existing methods. However, acquiring such high-quality UV maps is difficult because they are expensive to acquire, requiring laborious processes to refine. In this work, we present a novel UV map generative model that learns to generate diverse and realistic synthetic UV maps without requiring high-quality UV maps for training. Our proposed framework can be trained solely with in-the-wild images (i.e., UV maps are not required) by leveraging a combination of GANs and a differentiable renderer. Both quantitative and qualitative evaluations demonstrate that our proposed texture model produces more diverse and higher fidelity textures compared to existing methods.

8 citations


Cited by
More filters
Proceedings Article
01 Jan 1999

2,010 citations

Proceedings ArticleDOI
01 Oct 2019
TL;DR: Temporal Shift Module (TSM) as mentioned in this paper shifts part of the channels along the temporal dimension to facilitate information exchanged among neighboring frames, which can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters.
Abstract: The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive, making it expensive to deploy. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. Specifically, it can achieve the performance of 3D CNN but maintain 2D CNN’s complexity. TSM shifts part of the channels along the temporal dimension; thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters. We also extended TSM to online setting, which enables real-time low-latency online video recognition and video object detection. TSM is accurate and efficient: it ranks the first place on the Something-Something leaderboard upon publication; on Jetson Nano and Galaxy Note8, it achieves a low latency of 13ms and 35ms for online video recognition. The code is available at: https://github. com/mit-han-lab/temporal-shift-module.

892 citations

Posted Content
TL;DR: A generic and effective Temporal Shift Module (TSM) that can achieve the performance of 3D CNN but maintain 2D CNN’s complexity and is extended to online setting, which enables real-time low-latency online video recognition and video object detection.
Abstract: The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive, making it expensive to deploy. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. Specifically, it can achieve the performance of 3D CNN but maintain 2D CNN's complexity. TSM shifts part of the channels along the temporal dimension; thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters. We also extended TSM to online setting, which enables real-time low-latency online video recognition and video object detection. TSM is accurate and efficient: it ranks the first place on the Something-Something leaderboard upon publication; on Jetson Nano and Galaxy Note8, it achieves a low latency of 13ms and 35ms for online video recognition. The code is available at: this https URL.

721 citations

Proceedings ArticleDOI
04 Apr 2019
TL;DR: It is empirically demonstrated that the amount of channel interactions plays an important role in the accuracy of 3D group convolutional networks, and this leads to an architecture -- Channel-Separated Convolutional Network (CSN) -- which is simple, efficient, yet accurate.
Abstract: Group convolution has been shown to offer great computational savings in various 2D convolutional architectures for image classification. It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks. This paper studies the effects of different design choices in 3D group convolutional networks for video classification. We empirically demonstrate that the amount of channel interactions plays an important role in the accuracy of 3D group convolutional networks. Our experiments suggest two main findings. First, it is a good practice to factorize 3D convolutions by separating channel interactions and spatiotemporal interactions as this leads to improved accuracy and lower computational cost. Second, 3D channel-separated convolutions provide a form of regularization, yielding lower training accuracy but higher test accuracy compared to 3D convolutions. These two empirical findings lead us to design an architecture -- Channel-Separated Convolutional Network (CSN) -- which is simple, efficient, yet accurate. On Sports1M and Kinetics, our CSNs are comparable with or better than the state-of-the-art while being 2-3 times more efficient.

505 citations

Proceedings ArticleDOI
Christoph Feichtenhofer1
14 Jun 2020
TL;DR: This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth, finding that networks with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters.
Abstract: This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed that expands a single axis in each step, such that good accuracy to complexity trade-off is achieved. To expand X3D to a specific target complexity, we perform progressive forward expansion followed by backward contraction. X3D achieves state-of-the-art performance while requiring 4.8x and 5.5x fewer multiply-adds and parameters for similar accuracy as previous work. Our most surprising finding is that networks with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters. We report competitive accuracy at unprecedented efficiency on video classification and detection benchmarks. Code is available at: https://github.com/facebookresearch/SlowFast.

392 citations