scispace - formally typeset
Search or ask a question
Author

Ling Sing Yung

Bio: Ling Sing Yung is an academic researcher from Hong Kong Baptist University. The author has contributed to research in topics: Erasure code & Frequency scaling. The author has an hindex of 2, co-authored 2 publications receiving 93 citations.

Papers
More filters
Proceedings ArticleDOI
03 Nov 2013
TL;DR: A thorough measurement study that aims to explore how GPU DVFS affects the system energy consumption and shows that GPU voltage/frequency scaling is an effective approach to conserving energy.
Abstract: Nowadays, GPUs are widely used to accelerate many high performance computing applications. Energy conservation of such computing systems has become an important research topic. Dynamic voltage/frequency scaling (DVFS) is proved to be an appealing method for saving energy for traditional computing centers. However, there is still a lack of firsthand study on the effectiveness of GPU DVFS. This paper presents a thorough measurement study that aims to explore how GPU DVFS affects the system energy consumption. We conduct experiments on a real GPU platform with 37 benchmark applications. Our results show that GPU voltage/frequency scaling is an effective approach to conserving energy. For example, by scaling down the GPU core voltage and frequency, we have achieved an average of 19.28% energy reduction compared with the default setting, while giving up no more than 4% of performance. For all tested GPU applications, core voltage scaling is significantly effective to reduce system energy consumption. Meanwhile the effects of scaling core frequency and memory frequency depend on the characteristics of GPU applications.

80 citations

Proceedings ArticleDOI
08 Jun 2015
TL;DR: PErasure, a parallel Cauchy Reed-Solomon coding library, is designed and implemented and can achieve up to 10GB/s of overall encoding speed using just a single GPU for a large storage system that can withstand up to 8 disk failures.
Abstract: In recent years, erasure coding has been adopted by large-scale cloud storage systems to replace data replication. With the increase of disk I/O throughput and network bandwidth, the speed of erasure coding becomes one of the key system bottlenecks. In this paper, we propose to offload the task of erasure coding to Graphics Processing Units (GPUs). Specifically, we have designed and implemented PErasure, a parallel Cauchy Reed-Solomon (CRS) coding library. We compare the performance of PErasure with that of two state-of-the-art libraries: Jerasure (for CPUs) and Gibraltar (for GPUs). Our experiments show that the raw coding speed of PErasure on a $500 Nvidia GTX780 card is about 10 times faster than that of multithreaded Jerasure on a quad-core modern CPU, and 2-4 times faster than Gibraltar on the same GPU. PErasure can achieve up to 10GB/s of overall encoding speed using just a single GPU for a large storage system that can withstand up to 8 disk failures.

24 citations


Cited by
More filters
Proceedings ArticleDOI
18 Apr 2016
TL;DR: This paper proposes a novel distributed reconstruction technique, called Partial Parallel Repair (PPR), which divides the reconstruction operation to small partial operations and schedules them on multiple nodes already involved in the data reconstruction, and reduces repair time and degraded read time significantly.
Abstract: With the explosion of data in applications all around us, erasure coded storage has emerged as an attractive alternative to replication because even with significantly lower storage overhead, they provide better reliability against data loss. Reed-Solomon code is the most widely used erasure code because it provides maximum reliability for a given storage overhead and is flexible in the choice of coding parameters that determine the achievable reliability. However, reconstruction time for unavailable data becomes prohibitively long mainly because of network bottlenecks. Some proposed solutions either use additional storage or limit the coding parameters that can be used. In this paper, we propose a novel distributed reconstruction technique, called Partial Parallel Repair (PPR), which divides the reconstruction operation to small partial operations and schedules them on multiple nodes already involved in the data reconstruction. Then a distributed protocol progressively combines these partial results to reconstruct the unavailable data blocks and this technique reduces the network pressure. Theoretically, our technique can complete the network transfer in ⌈(log2(k + 1))⌉ time, compared to k time needed for a (k, m) Reed-Solomon code. Our experiments show that PPR reduces repair time and degraded read time significantly. Moreover, our technique is compatible with existing erasure codes and does not require any additional storage overhead. We demonstrate this by overlaying PPR on top of two prior schemes, Local Reconstruction Code and Rotated Reed-Solomon code, to gain additional savings in reconstruction time.

112 citations

Proceedings ArticleDOI
13 Dec 2014
TL;DR: Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel is proposed.
Abstract: GPUs use thousands of threads to provide high performance and efficiency. In general, if one thread of a kernel uses one of the resources (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource due to the large number of identical concurrent threads. This contention will eventually saturate the performance of the kernel due to contention for the bottleneck resource, while at the same time leaving other resources underutilized. To overcome this problem, a runtime system that can tune the hardware to match the characteristics of a kernel can effectively mitigate the imbalance between resource requirements of kernels and the hardware resources present on the GPU. We propose Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel. Equalizer provides efficiency in two modes. Firstly, it can save energy without significant performance degradation by GPUs use thousands of threads to provide high performance and efficiency. In general, if one thread of a kernel uses one of the resources (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource due to the large number of identical concurrent threads. This contention will eventually saturate the performance of the kernel due to contention for the bottleneck resource, while at the same time leaving other resources underutilized. To overcome this problem, a runtime system that can tune the hardware to match the characteristics of a kernel can effectively mitigate the imbalance between resource requirements of kernels and the hardware resources present on the GPU. We propose Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel. Equalizer provides efficiency in two modes. Firstly, it can save energy without significant performance degradation by throttling under-utilized resources. Secondly, it can boost bottleneck resources to reduce contention and provide higher performance without significant energy increase. Across a spectrum of 27 kernels, Equalizer achieves 15% savings in energy mode and 22% speedup in performance mode. Throttling under-utilized resources. Secondly, it can boost bottleneck resources to reduce contention and provide higher performance without significant energy increase. Across a spectrum of 27 kernels, Equalizer achieves 15% savings in energy mode and 22% speedup in performance mode.

72 citations

Proceedings ArticleDOI
01 May 2017
TL;DR: This work studies energy conservation on emerging CPU-GPU hybrid clusters through dynamic voltage and frequency scaling (DVFS) and stresses the nonlinear relationship between task execution time and processor speed for GPU-accelerated applications for more accurately capturing real-world GPU energy consumption.
Abstract: Conserving the energy consumption of large data centers is of critical significance, where a few percent in consumption reduction translates into millions-dollar savings. This work studies energy conservation on emerging CPU-GPU hybrid clusters through dynamic voltage and frequency scaling (DVFS). We aim at minimizing the total energy consumption of processing a sequence of real-time tasks under deadline constraints. We compute the appropriate voltage/frequency setting for each task through mathematical optimization, and assign multiple tasks to the cluster with heuristic scheduling algorithms. In performance evaluation driven by real-world power measurement traces, our scheduling algorithm shows comparable energy savings to the theoretical upper bound. With a GPU scaling interval where analytically at most 38% of energy can be saved, we record 30–36% of energy savings. Our results are applicable to energy management on modern heterogeneous clusters. In particular, our model stresses the nonlinear relationship between task execution time and processor speed for GPU-accelerated applications, for more accurately capturing real-world GPU energy consumption.

64 citations

Book ChapterDOI
18 Sep 2014
TL;DR: This work proposes a novel fine-grained benchmarking approach and applies it on two popular GPUs, namely Fermi and Kepler, to expose the previously unknown characteristics of their memory hierarchies, and investigates the structures of different cache systems.
Abstract: Memory access efficiency is a key factor for fully exploiting the computational power of Graphics Processing Units (GPUs) However, many details of the GPU memory hierarchy are not released by the vendors We propose a novel fine-grained benchmarking approach and apply it on two popular GPUs, namely Fermi and Kepler, to expose the previously unknown characteristics of their memory hierarchies Specifically, we investigate the structures of different cache systems, such as data cache, texture cache, and the translation lookaside buffer (TLB) We also investigate the impact of bank conflict on shared memory access latency Our benchmarking results offer a better understanding on the mysterious GPU memory hierarchy, which can help in the software optimization and the modelling of GPU architectures Our source code and experimental results are publicly available

55 citations

Proceedings ArticleDOI
07 Sep 2015
TL;DR: This work presents a flexible dynamic resolution scaling system for smartphones that adopts an ultrasonic-based approach to detect the user-screen distance at low-power cost and makes scaling decisions automatically for maximum user experience and power saving.
Abstract: The extremely-high display density of modern smartphones imposes a significant burden on power consumption, yet does not always provide an improved user experience and may even lead to a compromised user experience. As human visually-perceivable ability highly depends on the user-screen distance, a reduced display resolution may still achieve the same user experience when the user-screen distance is large. This provides new power-saving opportunities. In this paper, we present a flexible dynamic resolution scaling system for smartphones. The system adopts an ultrasonic-based approach to accurately detect the user-screen distance at low-power cost and makes scaling decisions automatically for maximum user experience and power saving. App developers or users can also adjust the resolution manually as their needs. Our system is able to work on existing commercial smartphones and support legacy apps, without requiring re-building the ROM or any changes of apps. An end-to-end dynamic resolution scaling system is implemented on the Galaxy S5 LTE-A and Nexus 6 smartphones, and the correctness and effectiveness are evaluated against 30 games and benchmarks. Experimental results show that all the 30 apps can run successfully with per-frame, real-time dynamic resolution scaling. The energy per frame can be reduced by 30.1% on average and up to 60.5\% at most when the resolution is halved, for 15 apps. A user study with 10 users indicates that our system remains good user experience, as none of the 10 users could perceive the resolution changes in the user study.

52 citations