Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems

doi:10.1109/ISPASS.2012.6189209

Proceedings ArticleDOI

Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems

- pp 88-98

TLDR

This work explores the challenges in porting Memcached to OpenCL and provides a detailed analysis intomemcached's behavior on a GPU to better explain the performance results observed on physical hardware.

Abstract:

The recent use of graphics processing units (GPUs) in several top supercomputers demonstrate their ability to consistently deliver positive results in high-performance computing (HPC). GPU support for significant amounts of parallelism would seem to make them strong candidates for non-HPC applications as well. Server workloads are inherently parallel; however, at first glance they may not seem suitable to run on GPUs due to their irregular control flow and memory access patterns. In this work, we evaluate the performance of a widely used key-value store middleware application, Memcached, on recent integrated and discrete CPU+GPU heterogeneous hardware and characterize the resulting performance. To gain greater insight, we also evaluate Memcached's performance on a GPU simulator. This work explores the challenges in porting Memcached to OpenCL and provides a detailed analysis into Memcached's behavior on a GPU to better explain the performance results observed on physical hardware. On the integrated CPU+GPU systems, we observe up to 7.5X performance increase compared to the CPU when executing the key-value look-up handler on the GPU.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A Survey of CPU-GPU Heterogeneous Computing Techniques

Sparsh Mittal, +1 more

- 21 Jul 2015 -

ACM Computing Surveys

TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.

...read moreread less

Proceedings ArticleDOI

Cache-Conscious Wavefront Scheduling

Timothy G. Rogers, +2 more

TL;DR: This paper proposes Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity.

...read moreread less

Proceedings ArticleDOI

A quantitative study of irregular programs on GPUs

Martin Burtscher, +2 more

TL;DR: This paper defines two measures of irregularity called control-flow irregularity and memory-access irregularity, and investigates, using performance-counter measurements, how irregular GPU kernels differ from regular kernels with respect to these measures.

...read moreread less

Proceedings Article

MemC3: compact and concurrent MemCache with dumber caching and smarter hashing

Bin Fan, +2 more

TL;DR: These techniques--optimistic cuckoo hashing, a compact LRU-approximating eviction algorithm based upon CLOCK, and comprehensive implementation of optimistic locking--enable the resulting Memcached system to use 30% less memory for small key-value pairs, and serve up to 3x as many queries per second over the network.

...read moreread less

Proceedings ArticleDOI

Thin servers with smart pipes: designing SoC accelerators for memcached

Kevin T. Lim, +4 more

TL;DR: This work argues for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance memcached deployment, and demonstrates the potential benefits of the TSSP architecture through an FPGA prototyping platform, and shows the potential for a 6X-16X power-performance improvement over conventional server baselines.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Analyzing CUDA workloads using a detailed GPU simulator

Ali Bakhoda, +4 more

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.

...read moreread less

Proceedings ArticleDOI

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Naveen Muralimanohar, +2 more

TL;DR: This work implements two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache, and adopts state-of-the-art design space exploration strategies for non-uniform cache access (NUCA).

...read moreread less

Journal ArticleDOI

A performance study of general-purpose applications on graphics processors using CUDA

Shuai Che, +5 more

- 01 Oct 2008 -

Journal of Parallel and Distributed Comp...

TL;DR: This paper uses NVIDIA's C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU.

...read moreread less

Proceedings ArticleDOI

FAWN: a fast array of wimpy nodes

David G. Andersen, +5 more

TL;DR: The key contributions of this paper are the principles of the FAWN architecture and the design and implementation of FAWN-KV--a consistent, replicated, highly available, and high-performance key-value storage system built on a FAWN prototype.

...read moreread less

Journal ArticleDOI

Parallel Computing Experiences with CUDA

Michael Garland, +8 more

- 01 Jul 2008 -

IEEE Micro

TL;DR: Experiences gained in applying CUDA to a diverse set of problems are surveyed and the parallel speedups over sequential codes running on traditional CPU architectures attained by executing key computations on the GPU are surveyed.

...read moreread less

Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems

Citations

A Survey of CPU-GPU Heterogeneous Computing Techniques

Cache-Conscious Wavefront Scheduling

A quantitative study of irregular programs on GPUs

MemC3: compact and concurrent MemCache with dumber caching and smarter hashing

Thin servers with smart pipes: designing SoC accelerators for memcached

References

Analyzing CUDA workloads using a detailed GPU simulator

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

A performance study of general-purpose applications on graphics processors using CUDA

FAWN: a fast array of wimpy nodes

Parallel Computing Experiences with CUDA

Related Papers (5)

Rodinia: A benchmark suite for heterogeneous computing

Analyzing CUDA workloads using a detailed GPU simulator

Workload analysis of a large-scale key-value store

Cache-Conscious Wavefront Scheduling

Dynamic warp subdivision for integrated branch and memory divergence tolerance