scispace - formally typeset
Proceedings ArticleDOI

Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems

TLDR
This work explores the challenges in porting Memcached to OpenCL and provides a detailed analysis intomemcached's behavior on a GPU to better explain the performance results observed on physical hardware.
Abstract: 
The recent use of graphics processing units (GPUs) in several top supercomputers demonstrate their ability to consistently deliver positive results in high-performance computing (HPC). GPU support for significant amounts of parallelism would seem to make them strong candidates for non-HPC applications as well. Server workloads are inherently parallel; however, at first glance they may not seem suitable to run on GPUs due to their irregular control flow and memory access patterns. In this work, we evaluate the performance of a widely used key-value store middleware application, Memcached, on recent integrated and discrete CPU+GPU heterogeneous hardware and characterize the resulting performance. To gain greater insight, we also evaluate Memcached's performance on a GPU simulator. This work explores the challenges in porting Memcached to OpenCL and provides a detailed analysis into Memcached's behavior on a GPU to better explain the performance results observed on physical hardware. On the integrated CPU+GPU systems, we observe up to 7.5X performance increase compared to the CPU when executing the key-value look-up handler on the GPU.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

A Survey of CPU-GPU Heterogeneous Computing Techniques

TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.
Proceedings ArticleDOI

Cache-Conscious Wavefront Scheduling

TL;DR: This paper proposes Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity.
Proceedings ArticleDOI

A quantitative study of irregular programs on GPUs

TL;DR: This paper defines two measures of irregularity called control-flow irregularity and memory-access irregularity, and investigates, using performance-counter measurements, how irregular GPU kernels differ from regular kernels with respect to these measures.
Proceedings Article

MemC3: compact and concurrent MemCache with dumber caching and smarter hashing

TL;DR: These techniques--optimistic cuckoo hashing, a compact LRU-approximating eviction algorithm based upon CLOCK, and comprehensive implementation of optimistic locking--enable the resulting Memcached system to use 30% less memory for small key-value pairs, and serve up to 3x as many queries per second over the network.
Proceedings ArticleDOI

Thin servers with smart pipes: designing SoC accelerators for memcached

TL;DR: This work argues for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance memcached deployment, and demonstrates the potential benefits of the TSSP architecture through an FPGA prototyping platform, and shows the potential for a 6X-16X power-performance improvement over conventional server baselines.
References
More filters
Proceedings ArticleDOI

Analyzing CUDA workloads using a detailed GPU simulator

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
Proceedings ArticleDOI

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

TL;DR: This work implements two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache, and adopts state-of-the-art design space exploration strategies for non-uniform cache access (NUCA).
Journal ArticleDOI

A performance study of general-purpose applications on graphics processors using CUDA

TL;DR: This paper uses NVIDIA's C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU.
Proceedings ArticleDOI

FAWN: a fast array of wimpy nodes

TL;DR: The key contributions of this paper are the principles of the FAWN architecture and the design and implementation of FAWN-KV--a consistent, replicated, highly available, and high-performance key-value storage system built on a FAWN prototype.
Journal ArticleDOI

Parallel Computing Experiences with CUDA

TL;DR: Experiences gained in applying CUDA to a diverse set of problems are surveyed and the parallel speedups over sequential codes running on traditional CPU architectures attained by executing key computations on the GPU are surveyed.
Related Papers (5)