Proceedings ArticleDOI
Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems
Tayler Hetherington,Timothy G. Rogers,Lisa Hsu,Mike O'Connor,Tor M. Aamodt +4 more
- pp 88-98
TLDR
This work explores the challenges in porting Memcached to OpenCL and provides a detailed analysis intomemcached's behavior on a GPU to better explain the performance results observed on physical hardware.Abstract:Â
The recent use of graphics processing units (GPUs) in several top supercomputers demonstrate their ability to consistently deliver positive results in high-performance computing (HPC). GPU support for significant amounts of parallelism would seem to make them strong candidates for non-HPC applications as well. Server workloads are inherently parallel; however, at first glance they may not seem suitable to run on GPUs due to their irregular control flow and memory access patterns. In this work, we evaluate the performance of a widely used key-value store middleware application, Memcached, on recent integrated and discrete CPU+GPU heterogeneous hardware and characterize the resulting performance. To gain greater insight, we also evaluate Memcached's performance on a GPU simulator. This work explores the challenges in porting Memcached to OpenCL and provides a detailed analysis into Memcached's behavior on a GPU to better explain the performance results observed on physical hardware. On the integrated CPU+GPU systems, we observe up to 7.5X performance increase compared to the CPU when executing the key-value look-up handler on the GPU.read more
Citations
More filters
Journal ArticleDOI
A Survey of CPU-GPU Heterogeneous Computing Techniques
Sparsh Mittal,Jeffrey S. Vetter +1 more
TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.
Proceedings ArticleDOI
Cache-Conscious Wavefront Scheduling
TL;DR: This paper proposes Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity.
Proceedings ArticleDOI
A quantitative study of irregular programs on GPUs
TL;DR: This paper defines two measures of irregularity called control-flow irregularity and memory-access irregularity, and investigates, using performance-counter measurements, how irregular GPU kernels differ from regular kernels with respect to these measures.
Proceedings Article
MemC3: compact and concurrent MemCache with dumber caching and smarter hashing
TL;DR: These techniques--optimistic cuckoo hashing, a compact LRU-approximating eviction algorithm based upon CLOCK, and comprehensive implementation of optimistic locking--enable the resulting Memcached system to use 30% less memory for small key-value pairs, and serve up to 3x as many queries per second over the network.
Proceedings ArticleDOI
Thin servers with smart pipes: designing SoC accelerators for memcached
TL;DR: This work argues for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance memcached deployment, and demonstrates the potential benefits of the TSSP architecture through an FPGA prototyping platform, and shows the potential for a 6X-16X power-performance improvement over conventional server baselines.
References
More filters
Proceedings ArticleDOI
Analyzing CUDA workloads using a detailed GPU simulator
TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
Proceedings ArticleDOI
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0
TL;DR: This work implements two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache, and adopts state-of-the-art design space exploration strategies for non-uniform cache access (NUCA).
Journal ArticleDOI
A performance study of general-purpose applications on graphics processors using CUDA
TL;DR: This paper uses NVIDIA's C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU.
Proceedings ArticleDOI
FAWN: a fast array of wimpy nodes
David G. Andersen,Jason Franklin,Michael Kaminsky,Amar Phanishayee,Lawrence Tan,Vijay K. Vasudevan +5 more
TL;DR: The key contributions of this paper are the principles of the FAWN architecture and the design and implementation of FAWN-KV--a consistent, replicated, highly available, and high-performance key-value storage system built on a FAWN prototype.
Journal ArticleDOI
Parallel Computing Experiences with CUDA
Michael Garland,S. Le Grand,John R. Nickolls,Joshua A. Anderson,J. Hardwick,S. Morton,E. Phillips,Yao Zhang,Vasily Volkov +8 more
TL;DR: Experiences gained in applying CUDA to a diverse set of problems are surveyed and the parallel speedups over sequential codes running on traditional CPU architectures attained by executing key computations on the GPU are surveyed.