scispace - formally typeset
Search or ask a question
Author

Sangpil Lee

Bio: Sangpil Lee is an academic researcher from Yonsei University. The author has contributed to research in topics: Thread (computing) & Register file. The author has an hindex of 7, co-authored 12 publications receiving 234 citations.

Papers
More filters
Proceedings ArticleDOI
13 Jun 2015
TL;DR: This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small.
Abstract: This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small Removing data redundancy of register values through register compression reduces the effective register width, thereby enabling power reduction opportunities GPU register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching As a result register file consumes a large fraction of the total GPU chip power GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism To reduce register file data redundancy warped-compression uses low-cost and implementation-efficient base-delta-immediate (BDI) compression scheme, that takes advantage of banked register file organization used in GPUs Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single register, or one of the register banks, as the primary base and then computing delta values of all the other registers, or banks Warped-compression can be used to reduce both dynamic and leakage power By compressing register values, each warp-level register access activates fewer register banks, which leads to reduction in dynamic power When fewer banks are used to store the register content, leakage power can be reduced by power gating the unused banks Evaluation results show that register compression saves 25% of the total register file power consumption

97 citations

Journal ArticleDOI
18 Jun 2016
TL;DR: A Virtual Thread (VT) architecture is proposed which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit, to reduce the logic complexity of managing more threads concurrently.
Abstract: Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper.This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.

46 citations

Proceedings ArticleDOI
12 Mar 2016
TL;DR: This paper proposes that when a warp is stalled on a long-latency operation it enters P-mode, a pre-execution mode for improving GPU performance, and shows 23% performance improvement for memory intensive applications, without negatively impacting other application categories.
Abstract: This paper presents a pre-execution approach for improving GPU performance, called P-mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding processing delay of operations. However, certain long-latency operations such as off-chip memory accesses often take hundreds of cycles and hence leads to stalls even in the presence of thread concurrency and fast thread switching capability. It is unclear if adding more threads can improve latency tolerance due to increased memory contention. Further, adding more threads increases on-chip storage demands. Instead we propose that when a warp is stalled on a long-latency operation it enters P-mode. In P-mode, a warp continues to fetch and decode successive instructions to identify any independent instruction that is not on the long latency dependence chain. These independent instructions are then pre-executed. To tackle write-after-write and write-after-read hazards, during P-mode output values are written to renamed physical registers. We exploit the register file underutilization to re-purpose a few unused registers to store the P-mode results. When a warp is switched from P-mode to normal execution mode it reuses pre-executed results by reading the renamed registers. Any global load operation in P-mode is transformed into a pre-load which fetches data into the L1 cache to reduce future memory access penalties. Our evaluation results show 23% performance improvement for memory intensive applications, without negatively impacting other application categories.

40 citations

Proceedings ArticleDOI
21 Apr 2013
TL;DR: A new parallel simulation framework and a parallel simulation technique called work-group parallel simulation is proposed in order to improve the simulation speed for modern many-core GPUs and boosts the performance of parallelized GPU simulation.
Abstract: GPU computing is at the forefront of high-performance computing, and it has greatly affected current studies on parallel software and hardware design because of its massively parallel architecture. Therefore, numerous studies have focused on the utilization of GPUs in various fields. However, studies of GPU architectures are constrained by the lack of a suitable GPU simulator. Previously proposed GPU simulators do not have sufficient simulation speed for advanced software and architecture studies. In this paper, we propose a new parallel simulation framework and a parallel simulation technique called work-group parallel simulation in order to improve the simulation speed for modern many-core GPUs. The proposed framework divides the GPU architecture into parallel and shared components, and it determines which GPU component can be effectively parallelized and can work correctly in multithreaded simulation. In addition, the work-group parallel simulation technique effectively boosts the performance of parallelized GPU simulation by eliminating the synchronization overhead. Experimental results obtained using a simulator with the proposed framework show that the proposed parallel simulation technique has a speed-up of up to 4.15 as compared to an existing sequential GPU simulator on an 8-core machine providing minimized cycle errors.

24 citations

Journal ArticleDOI
TL;DR: A new parallel decoding algorithm for network coding using a graphics processing unit (GPU) that can simultaneously process multiple incoming streams and can maintain its maximum decoding performance irrespective of the size and number of transfer units is proposed.
Abstract: Network coding, a well-known technique for optimizing data-flow in wired and wireless network systems, has attracted considerable attention in various fields. However, the decoding complexity in network coding becomes a major performance bottleneck in the practical network systems; thus, several researches have been conducted for improving the decoding performance in network coding. Nevertheless, previously proposed parallel network coding algorithms have shown limited scalability and performance imbalance for different-sized transfer units and multiple streams. In this paper, we propose a new parallel decoding algorithm for network coding using a graphics processing unit (GPU). This algorithm can simultaneously process multiple incoming streams and can maintain its maximum decoding performance irrespective of the size and number of transfer units. Our experimental results show that the proposed algorithm exhibits a 682.2 Mbps decoding bandwidth on a system with GeForce GTX 285 GPU and speed-ups of up to 26 as compared to the existing single stream decoding procedure with a 128 × 128 coefficient matrix and different-sized data blocks.

21 citations


Cited by
More filters
Proceedings ArticleDOI
07 Feb 2015
TL;DR: A GPU performance and power estimation model that uses machine learning techniques on measurements from real GPU hardware that runs as fast as, or faster than the program running natively on real hardware after an initial training phase.
Abstract: Graphics Processing Units (GPUs) have numerous configuration and design options, including core frequency, number of parallel compute units (CUs), and available memory bandwidth. At many stages of the design process, it is important to estimate how application performance and power are impacted by these options. This paper describes a GPU performance and power estimation model that uses machine learning techniques on measurements from real GPU hardware. The model is trained on a collection of applications that are run at numerous different hardware configurations. From the measured performance and power data, the model learns how applications scale as the GPU's configuration is changed. Hardware performance counter values are then gathered when running a new application on a single GPU configuration. These dynamic counter values are fed into a neural network that predicts which scaling curve from the training data best represents this kernel. This scaling curve is then used to estimate the performance and power of the new application at different GPU configurations. Over an 8× range of the number of CUs, a 3.3× range of core frequencies, and a 2.9× range of memory bandwidth, our model's performance and power estimates are accurate to within 15% and 10% of real hardware, respectively. This is comparable to the accuracy of cycle-level simulators. However, after an initial training phase, our model runs as fast as, or faster than the program running natively on real hardware.

196 citations

Proceedings ArticleDOI
Siva Kumar Sastry Hari1, Timothy Tsai1, Mark Stephenson1, Stephen W. Keckler1, Joel Emer1 
24 Apr 2017
TL;DR: This paper presents an error injection-based methodology and tool called SASSIFI to study the soft error resilience of massively parallel applications running on state-of-the-art NVIDIA GPUs.
Abstract: As GPUs become more pervasive in both scalable high-performance computing systems and safety-critical embedded systems, evaluating and analyzing their resilience to soft errors caused by high-energy particle strikes will grow increasingly important. GPU designers must develop tools and techniques to understand the effect of these soft errors on applications. This paper presents an error injection-based methodology and tool called SASSIFI to study the soft error resilience of massively parallel applications running on state-of-the-art NVIDIA GPUs. Our approach uses a low-level assembly-language instrumentation tool called SASSI to profile and inject errors. SASSI provides efficiency by allowing instrumentation code to execute entirely on the GPU and provides the ability to inject into different architecture-visible state. For example, SASSIFI can inject errors in general-purpose registers, GPU memory, condition code registers, and predicate registers. SASSIFI can also inject errors into addresses and register indices. In this paper, we describe the SASSIFI tool, its capabilities, and present experiments to illustrate some of the analyses SASSIFI can be used to perform.

117 citations

Journal ArticleDOI
TL;DR: Compression and approximate value prediction show great promise for reducing the communication bottleneck in bandwidth-constrained applications, while relaxed synchronization is found to provide large speedups for select error-tolerant applications, but suffers from limited general applicability and unreliable output degradation guarantees.
Abstract: Approximate computing has gained research attention recently as a way to increase energy efficiency and/or performance by exploiting some applications’ intrinsic error resiliency. However, little attention has been given to its potential for tackling the communication bottleneck that remains one of the looming challenges to be tackled for efficient parallelism. This article explores the potential benefits of approximate computing for communication reduction by surveying three promising techniques for approximate communication: compression, relaxed synchronization, and value prediction. The techniques are compared based on an evaluation framework composed of communication cost reduction, performance, energy reduction, applicability, overheads, and output degradation. Comparison results demonstrate that lossy link compression and approximate value prediction show great promise for reducing the communication bottleneck in bandwidth-constrained applications. Meanwhile, relaxed synchronization is found to provide large speedups for select error-tolerant applications, but suffers from limited general applicability and unreliable output degradation guarantees. Finally, this article concludes with several suggestions for future research on approximate communication techniques.

108 citations

Journal ArticleDOI
TL;DR: A survey of techniques for using compression in cache and main memory systems and classifies the techniques based on key parameters to highlight their similarities and differences is presented.
Abstract: As the number of cores on a chip increases and key applications become even more data-intensive, memory systems in modern processors have to deal with increasingly large amount of data. In face of such challenges, data compression presents as a promising approach to increase effective memory system capacity and also provide performance and energy advantages. This paper presents a survey of techniques for using compression in cache and main memory systems. It also classifies the techniques based on key parameters to highlight their similarities and differences. It discusses compression in CPUs and GPUs, conventional and non-volatile memory (NVM) systems, and 2D and 3D memory systems. We hope that this survey will help the researchers in gaining insight into the potential role of compression approach in memory components of future extreme-scale systems.

98 citations

Journal ArticleDOI
18 Jun 2016
TL;DR: Warped-Slicer is proposed, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance and is also computationally efficient.
Abstract: As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernel's performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23% over the baseline multiprogramming approach with minimal hardware overhead.

95 citations