Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

doi:10.1145/1815961.1816021

Proceedings ArticleDOI

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

- Vol. 38, Iss: 3, pp 451-460

TLDR

This paper discusses optimization techniques for both CPU and GPU, analyzes what architecture features contributed to performance differences between the two architectures, and recommends a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

Abstract:

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Second-generation PLINK: rising to the challenge of larger and richer datasets

Christopher C. Chang, +5 more

- 25 Feb 2015 -

GigaScience

TL;DR: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility, and for the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

...read moreread less

Journal ArticleDOI

Second-generation PLINK: rising to the challenge of larger and richer datasets

Christopher C. Chang, +5 more

- 17 Oct 2014 -

arXiv: Genomics

TL;DR: PLINK as discussed by the authors is a C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics, which has been widely used in the literature.

...read moreread less

Journal ArticleDOI

Dark Silicon and the End of Multicore Scaling

Hadi Esmaeilzadeh, +4 more

- 01 May 2012 -

IEEE Micro

TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

Proceedings ArticleDOI

Dark silicon and the end of multicore scaling

Hadi Esmaeilzadeh, +4 more

TL;DR: The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.

...read moreread less

Improving the speed of neural networks on CPUs

Vincent Vanhoucke, +2 more

TL;DR: This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

The Design and Implementation of FFTW3

Matteo Frigo, +1 more

TL;DR: It is shown that such an approach can yield an implementation of the discrete Fourier transform that is competitive with hand-optimized libraries, and the software structure that makes the current FFTW3 version flexible and adaptive is described.

...read moreread less

Proceedings ArticleDOI

The PARSEC benchmark suite: characterization and architectural implications

Christian Bienia, +3 more

TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.

...read moreread less

The Landscape of Parallel Computing Research: A View from Berkeley

Krste Asanovic, +10 more

TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.

...read moreread less

Journal ArticleDOI

Roofline: an insightful visual performance model for multicore architectures

Samuel Williams, +2 more

- 01 Apr 2009 -

Communications of The ACM

TL;DR: The Roofline model offers insight on how to improve the performance of software and hardware in the rapidly changing world of connected devices.

...read moreread less

Journal ArticleDOI

A Survey of General-Purpose Computation on Graphics Hardware

John D. Owens, +6 more

- 01 Mar 2007 -

Computer Graphics Forum

TL;DR: This report describes, summarize, and analyzes the latest research in mapping general‐purpose computation to graphics hardware.

...read moreread less

Collapse

Computer Graphics Forum

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Citations

Second-generation PLINK: rising to the challenge of larger and richer datasets

Second-generation PLINK: rising to the challenge of larger and richer datasets

Dark Silicon and the End of Multicore Scaling

Dark silicon and the end of multicore scaling

Improving the speed of neural networks on CPUs

References

The Design and Implementation of FFTW3

The PARSEC benchmark suite: characterization and architectural implications

The Landscape of Parallel Computing Research: A View from Berkeley

Roofline: an insightful visual performance model for multicore architectures

A Survey of General-Purpose Computation on Graphics Hardware

Related Papers (5)

Rodinia: A benchmark suite for heterogeneous computing

Programming Massively Parallel Processors: A Hands-on Approach

Validity of the single processor approach to achieving large scale computing capabilities

Scalable parallel programming with CUDA

A Survey of General-Purpose Computation on Graphics Hardware