scispace - formally typeset
Proceedings ArticleDOI

An Initial Characterization of the Emu Chick

TLDR
This initial evaluation demonstrates that the Emu Chick uses available memory bandwidth more efficiently than a more traditional, cache-based architecture and provides stable, predictable performance with 80% bandwidth utilization on a random-access pointer chasing benchmark with weak locality.
Abstract
The Emu Chick is a prototype system designed around the concept of migratory memory-side processing. Rather than transferring large amounts of data across power-hungry, high-latency interconnects, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each memory read. The current prototype hardware uses FPGAs to implement cache-less "Gossamer" cores for doing computational work and a stationary core to run basic operating system functions and migrate threads between nodes. In this initial characterization of the Emu Chick, we study the memory bandwidth characteristics of the system through benchmarks like STREAM, pointer chasing, and sparse matrix vector multiply. We compare the Emu Chick hardware to architectural simulation and Intel Xeon-based platforms. While it is difficult to accurately compare prototype hardware with existing systems, our initial evaluation demonstrates that the Emu Chick uses available memory bandwidth more efficiently than a more traditional, cache-based architecture. Moreover, the Emu Chick provides stable, predictable performance with 80% bandwidth utilization on a random-access pointer chasing benchmark with weak locality.

read more

Citations
More filters
Journal ArticleDOI

PASTA : a parallel sparse tensor algorithm benchmark suite

TL;DR: This work presents a sparse tensor algorithm benchmark suite (PASTA) for single- and multi-core CPUs that targets on helping application users to evaluate different computer systems using its representative computational workloads.
Posted Content

Programming Strategies for Irregular Algorithms on the Emu Chick

TL;DR: This work evaluates irregular algorithms that could benefit from the lightweight, memory-side processing of the Chick and demonstrates techniques and optimization strategies for achieving performance in sparse matrix-vector multiply operation (SpMV), breadth-first search (BFS), and graph alignment across up to eight distributed nodes encompassing 64 nodelets in the Chick system.
Proceedings ArticleDOI

Experimental Insights from the Rogues Gallery

TL;DR: Highlights of the first one to two years of post-Moore era research with the Rogues Gallery are presented and an indication of where the authors see future growth for this testbed and related efforts are given.
Proceedings ArticleDOI

A Preliminary Study of Compiler Transformations for Graph Applications on the Emu System

TL;DR: Two high- level compiler optimizations, i.e., loop fusion and edge flipping, and one low-level compiler transformation leveraging hardware support for remote atomic updates to address overheads arising from thread migration, creation, synchronization, and atomic operations are explored.
Journal ArticleDOI

A Microbenchmark Characterization of the Emu Chick

TL;DR: This multi-node characterization of the Emu Chick extends an earlier single-node investigation of the the memory bandwidth characteristics of the system through benchmarks like STREAM, pointer chasing, and sparse matrix-vector multiplication and demonstrates that for many basic operations the EmU Chick can use available memory bandwidth more efficiently than a more traditional, cache-based architecture.
References
More filters
Journal ArticleDOI

A case for intelligent RAM

TL;DR: The state of microprocessors and DRAMs today is reviewed, some of the opportunities and challenges for IRAMs are explored, and performance and energy efficiency of three IRAM designs are estimated.
Proceedings ArticleDOI

The HPC Challenge (HPCC) benchmark suite

TL;DR: This tutorial will introduce attendees to HPCC, provide tools to examine differences in HPC architectures, and give hands-on training that will hopefully lead to better understanding of parallel environments.
Proceedings Article

Scalability! but at what cost?

TL;DR: This work surveys measurements of data-parallel systems recently reported in SOSP and OSDI, and finds that many systems have either a surprisingly large COST, often hundreds of cores, or simply underperform one thread for all of their reported configurations.
Proceedings ArticleDOI

Practical Near-Data Processing for In-Memory Analytics Frameworks

TL;DR: This paper develops the hardware and software of an NDP architecture for in-memory analytics frameworks, including MapReduce, graphprocessing, and deep neural networks, and shows that it is critical to optimize software frameworks for spatial locality as it leads to 2.9x efficiency improvements for NDP.
Proceedings ArticleDOI

Graphicionado: a high-performance and energy-efficient accelerator for graph analytics

TL;DR: Graphicionado augments the vertex programming paradigm, allowing different graph analytics applications to be mapped to the same accelerator framework, while maintaining flexibility through a small set of reconfigurable blocks, for high-performance, energy-efficient processing of graph analytics workloads.
Related Papers (5)