Author

Martin Perrigo

Bio: Martin Perrigo is an academic researcher. The author has contributed to research in topics: Database-centric architecture & Cellular architecture. The author has an hindex of 1, co-authored 1 publications receiving 48 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Highly scalable near memory processing with migrating threads on the emu system architecture

[...]

Timothy J. Dysart, Peter M. Kogge, Martin Deneroff, Eric Bovell, Preston Briggs, Jay B. Brockman, Kenneth Jacobsen, Yujen Juan, Shannon K. Kuntz, Richard Lethin, Janice O. McMahon, Chandra Pawar, Martin Perrigo, Sarah Rucker, John Ruttenberg, Max Ruttenberg, Steve Stein - Show less +13 more

13 Nov 2016

TL;DR: A new, highly-scalable PGAS memory-centric system architecture where migrating threads travel to the data they access, and a comparison of key parameters with a variety of today's systems, of differing architectures, indicates the potential advantages.

...read moreread less

Abstract: There is growing evidence that current architectures do not well handle cache-unfriendly applications such as sparse math operations, data analytics, and graph algorithms. This is due, in part, to the irregular memory access patterns demonstrated by these applications, and in how remote memory accesses are handled. This paper introduces a new, highly-scalable PGAS memory-centric system architecture where migrating threads travel to the data they access. Scaling both memory capacities and the number of cores can be largely invisible to the programmer.The first implementation of this architecture, implemented with FPGAs, is discussed in detail. A comparison of key parameters with a variety of today's systems, of differing architectures, indicates the potential advantages. Early projections of performance against several well-documented kernels translate these advantages into comparative numbers. Future implementations of this architecture may expand the performance advantages by the application of current state of the art silicon technology.

...read moreread less

53 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures

[...]

Mehmet Deveci¹, Christian Robert Trott¹, Sivasankaran Rajamanickam¹•Institutions (1)

Sandia National Laboratories¹

01 Oct 2018

TL;DR: This paper develops parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures and develops a meta-algorithm, kkSpGEMM, to choose the right algorithm and data structure based on the characteristics of the problem.

...read moreread less

Abstract: Sparse matrix-matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM , to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.

...read moreread less

40 citations

Proceedings Article•DOI•

An Initial Characterization of the Emu Chick

[...]

Eric R. Hein¹, Thomas M. Conte¹, Jeffrey Young¹, Srinivas Eswar¹, Jiajia Li¹, Patrick Lavin¹, Richard Vuduc¹, Jason Riedy¹ - Show less +4 more•Institutions (1)

Georgia Institute of Technology¹

21 May 2018

TL;DR: This initial evaluation demonstrates that the Emu Chick uses available memory bandwidth more efficiently than a more traditional, cache-based architecture and provides stable, predictable performance with 80% bandwidth utilization on a random-access pointer chasing benchmark with weak locality.

...read moreread less

Abstract: The Emu Chick is a prototype system designed around the concept of migratory memory-side processing. Rather than transferring large amounts of data across power-hungry, high-latency interconnects, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each memory read. The current prototype hardware uses FPGAs to implement cache-less "Gossamer" cores for doing computational work and a stationary core to run basic operating system functions and migrate threads between nodes. In this initial characterization of the Emu Chick, we study the memory bandwidth characteristics of the system through benchmarks like STREAM, pointer chasing, and sparse matrix vector multiply. We compare the Emu Chick hardware to architectural simulation and Intel Xeon-based platforms. While it is difficult to accurately compare prototype hardware with existing systems, our initial evaluation demonstrates that the Emu Chick uses available memory bandwidth more efficiently than a more traditional, cache-based architecture. Moreover, the Emu Chick provides stable, predictable performance with 80% bandwidth utilization on a random-access pointer chasing benchmark with weak locality.

...read moreread less

23 citations

Posted Content•

Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures

[...]

Mehmet Deveci, Christian Robert Trott, Sivasankaran Rajamanickam

09 Jan 2018-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: In this article, the authors develop parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high-performance computing architectures, and compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures.

...read moreread less

Abstract: Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM, to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.

...read moreread less

17 citations

Journal Article•DOI•

Design of Processing-“Inside”-Memory Optimized for DRAM Behaviors

[...]

Wonjun Lee¹, Chang Hyun Kim¹, Yoonah Paik¹, Jongsun Park¹, Il Park², Seon Wook Kim¹ - Show less +2 more•Institutions (2)

Korea University¹, SK Hynix²

21 Jun 2019-IEEE Access

TL;DR: This paper shows how to design and operate the PIM computing units inside DRAM by effectively coordinating with standard DRAM operations while achieving the full computing performance and minimizing the implementation cost.

...read moreread less

Abstract: The computing domain of today’s computer systems is moving very fast from arithmetic to data processing as data volumes grow exponentially. As a result, processing-in-memory (PIM) studies have been actively conducted to support the data processing in or near memory devices to address the limited bandwidth and high power consumption due to data movement between CPU/GPU and memory. However, most PIM studies so far have been conducted in a way that the processing units are designed only as an accelerator on the base die of 3D-stacked DRAM, not involved inside memory while not servicing the standard DRAM requests during the PIM execution. Therefore, in this paper, we show how to design and operate the PIM computing units inside DRAM by effectively coordinating with standard DRAM operations while achieving the full computing performance and minimizing the implementation cost. To make our goals, we extend a standard DRAM state diagram to depict the PIM behaviors in the same way as standard DRAM commands are scheduled and operated on the DRAM devices and exploit several levels of parallelism to overlap memory and computing operations. Also, we present how the entire architecture layers from applications to operating systems, memory controllers, and PIM devices should work together for the effective execution by applying our approaches to our experiment platform. In our HBM2-based experimental platform to include 16-cycle MAC (Multiply-and-Add) units and 8-cycle reducers for a matrix-vector multiplication, we achieved 406% and 35.2% faster performance by the all-bank and the per-bank schedulings, respectively, at ( $1024\times1024$ ) $\times $ ( $1024\times1$ ) 8-bit integer matrix-vector multiplication than the execution of only its operand burst reads assuming the external full DRAM bandwidth. It should be noted that the performance of the PIM on a base die of a 3D-stacked memory cannot be better than that provided by the full bandwidth in any case.

...read moreread less

16 citations

Posted Content•

PIUMA: Programmable Integrated Unified Memory Architecture.

[...]

13 Oct 2020-arXiv: Hardware Architecture

TL;DR: This paper provides initial performance estimations, projecting that a PIUMA node will outperform a conventional compute node by one to two orders of magnitude, and continues to scale across multiple nodes, which is a challenge in conventional multinode setups.

...read moreread less

Abstract: High performance large scale graph analytics is essential to timely analyze relationships in big data sets. Conventional processor architectures suffer from inefficient resource usage and bad scaling on graph workloads. To enable efficient and scalable graph analysis, Intel developed the Programmable Integrated Unified Memory Architecture (PIUMA). PIUMA consists of many multi-threaded cores, fine-grained memory and network accesses, a globally shared address space and powerful offload engines. This paper presents the PIUMA architecture, and provides initial performance estimations, projecting that a PIUMA node will outperform a conventional compute node by one to two orders of magnitude. Furthermore, PIUMA continues to scale across multiple nodes, which is a challenge in conventional multinode setups.

...read moreread less

12 citations

Collapse