An Initial Characterization of the Emu Chick

doi:10.1109/IPDPSW.2018.00097

Proceedings ArticleDOI

An Initial Characterization of the Emu Chick

- pp 579-588

TLDR

This initial evaluation demonstrates that the Emu Chick uses available memory bandwidth more efficiently than a more traditional, cache-based architecture and provides stable, predictable performance with 80% bandwidth utilization on a random-access pointer chasing benchmark with weak locality.

Abstract:

The Emu Chick is a prototype system designed around the concept of migratory memory-side processing. Rather than transferring large amounts of data across power-hungry, high-latency interconnects, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each memory read. The current prototype hardware uses FPGAs to implement cache-less "Gossamer" cores for doing computational work and a stationary core to run basic operating system functions and migrate threads between nodes. In this initial characterization of the Emu Chick, we study the memory bandwidth characteristics of the system through benchmarks like STREAM, pointer chasing, and sparse matrix vector multiply. We compare the Emu Chick hardware to architectural simulation and Intel Xeon-based platforms. While it is difficult to accurately compare prototype hardware with existing systems, our initial evaluation demonstrates that the Emu Chick uses available memory bandwidth more efficiently than a more traditional, cache-based architecture. Moreover, the Emu Chick provides stable, predictable performance with 80% bandwidth utilization on a random-access pointer chasing benchmark with weak locality.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

PASTA : a parallel sparse tensor algorithm benchmark suite

Jiajia Li, +4 more

TL;DR: This work presents a sparse tensor algorithm benchmark suite (PASTA) for single- and multi-core CPUs that targets on helping application users to evaluate different computer systems using its representative computational workloads.

...read moreread less

Posted Content

Programming Strategies for Irregular Algorithms on the Emu Chick

Eric R. Hein, +9 more

- 03 Dec 2018 -

arXiv: Distributed, Parallel, and Cluste...

TL;DR: This work evaluates irregular algorithms that could benefit from the lightweight, memory-side processing of the Chick and demonstrates techniques and optimization strategies for achieving performance in sparse matrix-vector multiply operation (SpMV), breadth-first search (BFS), and graph alignment across up to eight distributed nodes encompassing 64 nodelets in the Chick system.

...read moreread less

Proceedings ArticleDOI

Experimental Insights from the Rogues Gallery

Jeffrey Young, +5 more

TL;DR: Highlights of the first one to two years of post-Moore era research with the Rogues Gallery are presented and an indication of where the authors see future growth for this testbed and related efforts are given.

...read moreread less

Proceedings ArticleDOI

A Preliminary Study of Compiler Transformations for Graph Applications on the Emu System

Prasanth Chatarasi, +1 more

TL;DR: Two high- level compiler optimizations, i.e., loop fusion and edge flipping, and one low-level compiler transformation leveraging hardware support for remote atomic updates to address overheads arising from thread migration, creation, synchronization, and atomic operations are explored.

...read moreread less

Journal ArticleDOI

A Microbenchmark Characterization of the Emu Chick

Jeffrey Young, +7 more

TL;DR: This multi-node characterization of the Emu Chick extends an earlier single-node investigation of the the memory bandwidth characteristics of the system through benchmarks like STREAM, pointer chasing, and sparse matrix-vector multiplication and demonstrates that for many basic operations the EmU Chick can use available memory bandwidth more efficiently than a more traditional, cache-based architecture.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

A case for intelligent RAM

David A. Patterson, +7 more

- 01 Mar 1997 -

IEEE Micro

TL;DR: The state of microprocessors and DRAMs today is reviewed, some of the opportunities and challenges for IRAMs are explored, and performance and energy efficiency of three IRAM designs are estimated.

...read moreread less

Proceedings ArticleDOI

The HPC Challenge (HPCC) benchmark suite

Piotr Luszczek, +6 more

TL;DR: This tutorial will introduce attendees to HPCC, provide tools to examine differences in HPC architectures, and give hands-on training that will hopefully lead to better understanding of parallel environments.

...read moreread less

Proceedings Article

Scalability! but at what cost?

Frank McSherry, +2 more

TL;DR: This work surveys measurements of data-parallel systems recently reported in SOSP and OSDI, and finds that many systems have either a surprisingly large COST, often hundreds of cores, or simply underperform one thread for all of their reported configurations.

...read moreread less

Proceedings ArticleDOI

Practical Near-Data Processing for In-Memory Analytics Frameworks

Mingyu Gao, +2 more

TL;DR: This paper develops the hardware and software of an NDP architecture for in-memory analytics frameworks, including MapReduce, graphprocessing, and deep neural networks, and shows that it is critical to optimize software frameworks for spatial locality as it leads to 2.9x efficiency improvements for NDP.

...read moreread less

Proceedings ArticleDOI

Graphicionado: a high-performance and energy-efficient accelerator for graph analytics

Tae Jun Ham, +4 more

TL;DR: Graphicionado augments the vertex programming paradigm, allowing different graph analytics applications to be mapped to the same accelerator framework, while maintaining flexibility through a small set of reconfigurable blocks, for high-performance, energy-efficient processing of graph analytics workloads.

...read moreread less

Related Papers (5)

Highly scalable near memory processing with migrating threads on the emu system architecture

Timothy J. Dysart, +16 more

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

H. Carter Edwards, +2 more

- 01 Dec 2014 -

Journal of Parallel and Distributed Comp...

An Initial Characterization of the Emu Chick

Citations

PASTA : a parallel sparse tensor algorithm benchmark suite

Programming Strategies for Irregular Algorithms on the Emu Chick

Experimental Insights from the Rogues Gallery

A Preliminary Study of Compiler Transformations for Graph Applications on the Emu System

A Microbenchmark Characterization of the Emu Chick

References

A case for intelligent RAM

The HPC Challenge (HPCC) benchmark suite

Scalability! but at what cost?

Practical Near-Data Processing for In-Memory Analytics Frameworks

Graphicionado: a high-performance and energy-efficient accelerator for graph analytics

Related Papers (5)

Highly scalable near memory processing with migrating threads on the emu system architecture

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

Rodinia: A benchmark suite for heterogeneous computing

Designing Algorithms for the EMU Migrating-threads-based Architecture

A Preliminary Study of Compiler Transformations for Graph Applications on the Emu System