scispace - formally typeset
Search or ask a question
Author

Changmin Lee

Other affiliations: Samsung
Bio: Changmin Lee is an academic researcher from Yonsei University. The author has contributed to research in topics: Microarchitecture & Thread (computing). The author has an hindex of 5, co-authored 18 publications receiving 93 citations. Previous affiliations of Changmin Lee include Samsung.

Papers
More filters
Journal ArticleDOI
TL;DR: A cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU, without any source recompilation is presented.
Abstract: This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups of 3.08 $$\times $$ × in the best case and 1.42 $$\times $$ × on average compared to the baseline GPU-only processing.

29 citations

Proceedings ArticleDOI
25 Feb 2012
TL;DR: A cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU, without any source recompilation is presented.
Abstract: This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups as high as 3.08 compared to the baseline GPU-only processing.

20 citations

Journal ArticleDOI
TL;DR: This paper proposes an efficient offloading method for media transcoding that is designed on top of Java and provides sufficient transcoding performance for media streaming with low-end processors.
Abstract: The demand for high-quality multimedia service in handheld devices has increased with the advance of consumer electronics technologies. However, there are inherent performance limits in such mobile devices due to low processing power, requirements for low power consumption, and limited storage capacity. These factors have obviously prevented high-performance multimedia services from being feasible. In addition, most of the mobile devices support distinct multimedia data formats for video and audio codecs, bit rate, and screen size. Therefore, streaming media or receiving data from other users' devices essentially requires additional data translation along with a transcoding operation. However, most of the processors that are used in mobile devices or even in personal media servers cannot provide sufficient processing power for transcoding. To address the problem, this paper proposes an efficient offloading method for media transcoding can be applied in current commercial products. The system is designed on top of Java and provides sufficient transcoding performance for media streaming with low-end processors.

13 citations

Journal ArticleDOI
TL;DR: REACT, including the data access scheduling algorithm, increases the utilization of SSD and the degree of internal memory parallelism for pattern matching processes and achieves maximum 22.6 percent of matching throughput improvement on a 16-channel high-performance SSD compared to the accelerator without the proposed scheduling algorithm.
Abstract: This article proposes REACT, a regular expression matching accelerator, which can be embedded in a modern Solid-State Drive (SSD) and a novel data access scheduling algorithm for high matching throughput. Specifically, REACT, including our data access scheduling algorithm, increases the utilization of SSD and the degree of internal memory parallelism for pattern matching processes. While the low-level flash exhibits long latency, modern SSDs in practice achieve high I/O performance by utilizing the massive internal parallelism at the system-level. However, exploiting the parallelism is limited for pattern matching since the sub-blocks, which constitute an input data and can be placed in multiple flash pages, should be tested in a sequence to process the input correctly. This limitation can induce low utilization of the accelerator. To address this challenge, the proposed REACT simultaneously processes multiple input streams with a parallel processing architecture to maximize matching throughput by hiding the long and irregular latency. The scheduling algorithm finds a data stream which requires a sub-block in closest time and prioritizes the access request to reduce the data stall of REACT. REACT achieves maximum 22.6 percent of matching throughput improvement on a 16-channel high-performance SSD compared to the accelerator without the proposed scheduling algorithm.

10 citations

Proceedings ArticleDOI
01 Feb 2020
TL;DR: This paper proposes an NVDIMM architecture with several system-wide mechanisms to allow the synchronous DDR4 memory interfaces to support non-deterministic (asynchronous) timing.
Abstract: Currently, there are two representative non-volatile dual in-line memory module (NVDIMM) interfaces: a proprietary Intel DDR-T and the JEDEC NVDIMM-P, which are not supported by existing platforms. Adoption of new platform is costly and measuring its efficiency of migrating to the new platform is much more complex. This study is an alternative way of them—finding a new memory device that can be supported by all existing systems. In this paper, we propose an NVDIMM architecture with several system-wide mechanisms to allow the synchronous DDR4 memory interfaces to support non-deterministic (asynchronous) timing. The proposed memory architecture is implemented as a real device prototype, and also evaluated using synthetic and real workloads on an x86-64 server system.

10 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.
Abstract: As both CPUs and GPUs become employed in a wide range of applications, it has been acknowledged that both of these Processing Units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated a significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this article, we survey Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler, and application levels. Further, we review both discrete and fused CPU-GPU systems and discuss benchmark suites designed for evaluating Heterogeneous Computing Systems (HCSs). We believe that this article will provide insights into the workings and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.

414 citations

Posted Content
TL;DR: ParaDnn is introduced, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected, convolutional (CNN), and recurrent (RNN) neural networks, and the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms are quantified.
Abstract: Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware specialization to improve performance. To systematically benchmark deep learning platforms, we introduce ParaDnn, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected (FC), convolutional (CNN), and recurrent (RNN) neural networks. Along with six real-world models, we benchmark Google's Cloud TPU v2/v3, NVIDIA's V100 GPU, and an Intel Skylake CPU platform. We take a deep dive into TPU architecture, reveal its bottlenecks, and highlight valuable lessons learned for future specialized system design. We also provide a thorough comparison of the platforms and find that each has unique strengths for some types of models. Finally, we quantify the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms.

163 citations

Journal ArticleDOI
TL;DR: This article investigates the problem of reliable resource provisioning in joint edge-cloud environments, and surveys technologies, mechanisms, and methods that can be used to improve the reliability of distributed applications in diverse and heterogeneous network environments.
Abstract: Large-scale software systems are currently designed as distributed entities and deployed in cloud data centers. To overcome the limitations inherent to this type of deployment, applications are increasingly being supplemented with components instantiated closer to the edges of networks—a paradigm known as edge computing. The problem of how to efficiently orchestrate combined edge-cloud applications is, however, incompletely understood, and a wide range of techniques for resource and application management are currently in use. This article investigates the problem of reliable resource provisioning in joint edge-cloud environments, and surveys technologies, mechanisms, and methods that can be used to improve the reliability of distributed applications in diverse and heterogeneous network environments. Due to the complexity of the problem, special emphasis is placed on solutions to the characterization, management, and control of complex distributed applications using machine learning approaches. The survey is structured around a decomposition of the reliable resource provisioning problem into three categories of techniques: workload characterization and prediction, component placement and system consolidation, and application elasticity and remediation. Survey results are presented along with a problem-oriented discussion of the state-of-the-art. A summary of identified challenges and an outline of future research directions are presented to conclude the article.

100 citations

Journal ArticleDOI
TL;DR: A game theoretic resource allocation scheme for media cloud to allocate resource to mobile social users though brokers and results show that each player in the game can obtain the optimal strategy where the Stackelberg equilibrium exists stably.
Abstract: Due to the rapid increases in both the population of mobile social users and the demand for quality of experience (QoE), providing mobile social users with satisfied multimedia services has become an important issue. Media cloud has been shown to be an efficient solution to resolve the above issue, by allowing mobile social users to connect to it through a group of distributed brokers. However, as the resource in media cloud is limited, how to allocate resource among media cloud, brokers, and mobile social users becomes a new challenge. Therefore, in this paper, we propose a game theoretic resource allocation scheme for media cloud to allocate resource to mobile social users though brokers. First, a framework of resource allocation among media cloud, brokers, and mobile social users is presented. Media cloud can dynamically determine the price of the resource and allocate its resource to brokers. A mobile social user can select his broker to connect to the media cloud by adjusting the strategy to achieve the maximum revenue, based on the social features in the community. Next, we formulate the interactions among media cloud, brokers, and mobile social users by a four-stage Stackelberg game. In addition, through the backward induction method, we propose an iterative algorithm to implement the proposed scheme and obtain the Stackelberg equilibrium. Finally, simulation results show that each player in the game can obtain the optimal strategy where the Stackelberg equilibrium exists stably.

93 citations

Proceedings ArticleDOI
01 Sep 2014
TL;DR: A high-performance packet classifier on GPU is presented, which can achieve the throughput of 85 million packets per second and the average processing latency of 4.9 μs per packet.
Abstract: Multi-field packet classification is a network kernel function where packets are classified and routed based on a predefined rule set. Recently, there has been a new trend in exploring Graphics Processing Unit (GPU) for network applications. These applications typically do not perform floating point operations and it is challenging to obtain speedup. This paper presents a high-performance packet classifier on GPU. We investigate GPU's characteristics in parallelism and memory accessing, and implement our packet classifier using Compute Unified Device Architecture (CUDA). The basic operations of our design are binary range-tree search and bitwise AND operation.We optimize our design by storing the range-trees using compact arrays without explicit pointers in shared memory. We evaluate the performance with respect to throughput and processing latency. Experimental results show that our approach scales well across a range of rule set sizes from 512 to 4096. When the size of rule set is 512, our design can achieve the throughput of 85 million packets per second and the average processing latency of 4.9 µs per packet. Compared with the implementation on the state-of-the-art multi-core platform, our design demonstrates 1.9x improvement with respect to throughput.

57 citations