Home
/
Authors
/
Changmin Lee

Author

Changmin Lee

Other affiliations: Samsung

Bio: Changmin Lee is an academic researcher from Yonsei University. The author has contributed to research in topics: Microarchitecture & Thread (computing). The author has an hindex of 5, co-authored 18 publications receiving 93 citations. Previous affiliations of Changmin Lee include Samsung.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Boosting CUDA Applications with CPU---GPU Hybrid Computing

[...]

Changmin Lee¹, Won Woo Ro¹, Jean-Luc Gaudiot²•Institutions (2)

Yonsei University¹, University of California, Irvine²

01 Apr 2014-International Journal of Parallel Programming

TL;DR: A cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU, without any source recompilation is presented.

...read moreread less

Abstract: This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups of 3.08 $$\times $$ × in the best case and 1.42 $$\times $$ × on average compared to the baseline GPU-only processing.

...read moreread less

29 citations

Proceedings Article•DOI•

Cooperative heterogeneous computing for parallel processing on CPU/GPU hybrids

[...]

Changmin Lee¹, Won Woo Ro¹, Jean-Luc Gaudiot²•Institutions (2)

Yonsei University¹, University of California, Irvine²

25 Feb 2012

...read moreread less

Abstract: This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups as high as 3.08 compared to the baseline GPU-only processing.

...read moreread less

20 citations

Journal Article•DOI•

Offloading of media transcoding for high-quality multimedia services

[...]

Seung Hun Kim¹, Keunsoo Kim¹, Changmin Lee¹, Won Woo Ro¹•Institutions (1)

Yonsei University¹

05 Jul 2012-IEEE Transactions on Consumer Electronics

TL;DR: This paper proposes an efficient offloading method for media transcoding that is designed on top of Java and provides sufficient transcoding performance for media streaming with low-end processors.

...read moreread less

Abstract: The demand for high-quality multimedia service in handheld devices has increased with the advance of consumer electronics technologies. However, there are inherent performance limits in such mobile devices due to low processing power, requirements for low power consumption, and limited storage capacity. These factors have obviously prevented high-performance multimedia services from being feasible. In addition, most of the mobile devices support distinct multimedia data formats for video and audio codecs, bit rate, and screen size. Therefore, streaming media or receiving data from other users' devices essentially requires additional data translation along with a transcoding operation. However, most of the processors that are used in mobile devices or even in personal media servers cannot provide sufficient processing power for transcoding. To address the problem, this paper proposes an efficient offloading method for media transcoding can be applied in current commercial products. The system is designed on top of Java and provides sufficient transcoding performance for media streaming with low-end processors.

...read moreread less

13 citations

Journal Article•DOI•

REACT: Scalable and High-Performance Regular Expression Pattern Matching Accelerator for In-Storage Processing

[...]

Won Seob Jeong¹, Changmin Lee¹, Keunsoo Kim¹, Myung Kuk Yoon¹, Won Jeon¹, Myoungsoo Jung², Won Woo Ro¹ - Show less +3 more•Institutions (2)

Yonsei University¹, KAIST²

01 May 2020-IEEE Transactions on Parallel and Distributed Systems

TL;DR: REACT, including the data access scheduling algorithm, increases the utilization of SSD and the degree of internal memory parallelism for pattern matching processes and achieves maximum 22.6 percent of matching throughput improvement on a 16-channel high-performance SSD compared to the accelerator without the proposed scheduling algorithm.

...read moreread less

Abstract: This article proposes REACT, a regular expression matching accelerator, which can be embedded in a modern Solid-State Drive (SSD) and a novel data access scheduling algorithm for high matching throughput. Specifically, REACT, including our data access scheduling algorithm, increases the utilization of SSD and the degree of internal memory parallelism for pattern matching processes. While the low-level flash exhibits long latency, modern SSDs in practice achieve high I/O performance by utilizing the massive internal parallelism at the system-level. However, exploiting the parallelism is limited for pattern matching since the sub-blocks, which constitute an input data and can be placed in multiple flash pages, should be tested in a sequence to process the input correctly. This limitation can induce low utilization of the accelerator. To address this challenge, the proposed REACT simultaneously processes multiple input streams with a parallel processing architecture to maximize matching throughput by hiding the long and irregular latency. The scheduling algorithm finds a data stream which requires a sub-block in closest time and prioritizes the access request to reduce the data stall of REACT. REACT achieves maximum 22.6 percent of matching throughput improvement on a 16-channel high-performance SSD compared to the accelerator without the proposed scheduling algorithm.

...read moreread less

10 citations

Proceedings Article•DOI•

NVDIMM-C: A Byte-Addressable Non-Volatile Memory Module for Compatibility with Standard DDR Memory Interfaces

[...]

Changmin Lee¹, Wonjae Shin¹, Kim Dae Jeong¹, Yu Yongjun¹, Kim Sung-Joon¹, Ko Tae-Kyeong¹, Seo Deok-Ho¹, Jongmin Park¹, Lee Kwanghee¹, Seong Ho Choi¹, Nam-Hyung Kim¹, Vishak G¹, Arun George¹, Vishwas¹, Dong Hun Lee, Kangwoo Choi, Changbin Song, Do-Han Kim¹, Choi Insu¹, Ilgyu Jung¹, Yong Ho Song¹, Jinman Han¹ - Show less +18 more•Institutions (1)

Samsung¹

01 Feb 2020

TL;DR: This paper proposes an NVDIMM architecture with several system-wide mechanisms to allow the synchronous DDR4 memory interfaces to support non-deterministic (asynchronous) timing.

...read moreread less

Abstract: Currently, there are two representative non-volatile dual in-line memory module (NVDIMM) interfaces: a proprietary Intel DDR-T and the JEDEC NVDIMM-P, which are not supported by existing platforms. Adoption of new platform is costly and measuring its efficiency of migrating to the new platform is much more complex. This study is an alternative way of them—finding a new memory device that can be supported by all existing systems. In this paper, we propose an NVDIMM architecture with several system-wide mechanisms to allow the synchronous DDR4 memory interfaces to support non-deterministic (asynchronous) timing. The proposed memory architecture is implemented as a real device prototype, and also evaluated using synthetic and real workloads on an x86-64 server system.

...read moreread less

10 citations

1
2
3
4
…

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

A Survey of CPU-GPU Heterogeneous Computing Techniques

[...]

Sparsh Mittal¹, Jeffrey S. Vetter¹•Institutions (1)

Oak Ridge National Laboratory¹

21 Jul 2015-ACM Computing Surveys

TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.

...read moreread less

Abstract: As both CPUs and GPUs become employed in a wide range of applications, it has been acknowledged that both of these Processing Units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated a significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this article, we survey Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler, and application levels. Further, we review both discrete and fused CPU-GPU systems and discuss benchmark suites designed for evaluating Heterogeneous Computing Systems (HCSs). We believe that this article will provide insights into the workings and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.

...read moreread less

414 citations

Posted Content•

Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

[...]

Yu Emma Wang¹, Gu-Yeon Wei¹, David Brooks¹•Institutions (1)

Harvard University¹

24 Jul 2019-arXiv: Learning

TL;DR: ParaDnn is introduced, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected, convolutional (CNN), and recurrent (RNN) neural networks, and the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms are quantified.

...read moreread less

Abstract: Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware specialization to improve performance. To systematically benchmark deep learning platforms, we introduce ParaDnn, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected (FC), convolutional (CNN), and recurrent (RNN) neural networks. Along with six real-world models, we benchmark Google's Cloud TPU v2/v3, NVIDIA's V100 GPU, and an Intel Skylake CPU platform. We take a deep dive into TPU architecture, reveal its bottlenecks, and highlight valuable lessons learned for future specialized system design. We also provide a thorough comparison of the platforms and find that each has unique strengths for some types of models. Finally, we quantify the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms.

...read moreread less

163 citations

Journal Article•DOI•

Machine Learning Methods for Reliable Resource Provisioning in Edge-Cloud Computing: A Survey

[...]

Thang Le Duc¹, Rafael Garcia Leiva², Paolo Casari², Per-Olov Östberg¹•Institutions (2)

Umeå University¹, IMDEA²

13 Sep 2019-ACM Computing Surveys

TL;DR: This article investigates the problem of reliable resource provisioning in joint edge-cloud environments, and surveys technologies, mechanisms, and methods that can be used to improve the reliability of distributed applications in diverse and heterogeneous network environments.

...read moreread less

Abstract: Large-scale software systems are currently designed as distributed entities and deployed in cloud data centers. To overcome the limitations inherent to this type of deployment, applications are increasingly being supplemented with components instantiated closer to the edges of networks—a paradigm known as edge computing. The problem of how to efficiently orchestrate combined edge-cloud applications is, however, incompletely understood, and a wide range of techniques for resource and application management are currently in use. This article investigates the problem of reliable resource provisioning in joint edge-cloud environments, and surveys technologies, mechanisms, and methods that can be used to improve the reliability of distributed applications in diverse and heterogeneous network environments. Due to the complexity of the problem, special emphasis is placed on solutions to the characterization, management, and control of complex distributed applications using machine learning approaches. The survey is structured around a decomposition of the reliable resource provisioning problem into three categories of techniques: workload characterization and prediction, component placement and system consolidation, and application elasticity and remediation. Survey results are presented along with a problem-oriented discussion of the state-of-the-art. A summary of identified challenges and an outline of future research directions are presented to conclude the article.

...read moreread less

100 citations

Journal Article•DOI•

Game Theoretic Resource Allocation in Media Cloud With Mobile Social Users

[...]

Zhou Su¹, Qichao Xu¹, Minrui Fei¹, Mianxiong Dong²•Institutions (2)

Shanghai University¹, Muroran Institute of Technology²

01 Aug 2016-IEEE Transactions on Multimedia

TL;DR: A game theoretic resource allocation scheme for media cloud to allocate resource to mobile social users though brokers and results show that each player in the game can obtain the optimal strategy where the Stackelberg equilibrium exists stably.

...read moreread less

Abstract: Due to the rapid increases in both the population of mobile social users and the demand for quality of experience (QoE), providing mobile social users with satisfied multimedia services has become an important issue. Media cloud has been shown to be an efficient solution to resolve the above issue, by allowing mobile social users to connect to it through a group of distributed brokers. However, as the resource in media cloud is limited, how to allocate resource among media cloud, brokers, and mobile social users becomes a new challenge. Therefore, in this paper, we propose a game theoretic resource allocation scheme for media cloud to allocate resource to mobile social users though brokers. First, a framework of resource allocation among media cloud, brokers, and mobile social users is presented. Media cloud can dynamically determine the price of the resource and allocate its resource to brokers. A mobile social user can select his broker to connect to the media cloud by adjusting the strategy to achieve the maximum revenue, based on the social features in the community. Next, we formulate the interactions among media cloud, brokers, and mobile social users by a four-stage Stackelberg game. In addition, through the backward induction method, we propose an iterative algorithm to implement the proposed scheme and obtain the Stackelberg equilibrium. Finally, simulation results show that each player in the game can obtain the optimal strategy where the Stackelberg equilibrium exists stably.

...read moreread less

93 citations

Proceedings Article•DOI•

High-performance packet classification on GPU

[...]

Shijie Zhou¹, Shreyas G. Singapura¹, Viktor K. Prasanna¹•Institutions (1)

University of Southern California¹

01 Sep 2014

TL;DR: A high-performance packet classifier on GPU is presented, which can achieve the throughput of 85 million packets per second and the average processing latency of 4.9 μs per packet.

...read moreread less

Abstract: Multi-field packet classification is a network kernel function where packets are classified and routed based on a predefined rule set. Recently, there has been a new trend in exploring Graphics Processing Unit (GPU) for network applications. These applications typically do not perform floating point operations and it is challenging to obtain speedup. This paper presents a high-performance packet classifier on GPU. We investigate GPU's characteristics in parallelism and memory accessing, and implement our packet classifier using Compute Unified Device Architecture (CUDA). The basic operations of our design are binary range-tree search and bitwise AND operation.We optimize our design by storing the range-trees using compact arrays without explicit pointers in shared memory. We evaluate the performance with respect to throughput and processing latency. Experimental results show that our approach scales well across a range of rule set sizes from 512 to 4096. When the size of rule set is 512, our design can achieve the throughput of 85 million packets per second and the average processing latency of 4.9 µs per packet. Compared with the implementation on the state-of-the-art multi-core platform, our design demonstrates 1.9x improvement with respect to throughput.

...read moreread less

57 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

Collapse