scispace - formally typeset
Search or ask a question
Author

Gary W. Maier

Other affiliations: GlobalFoundries
Bio: Gary W. Maier is an academic researcher from IBM. The author has contributed to research in topics: eDRAM & Dram. The author has an hindex of 11, co-authored 29 publications receiving 373 citations. Previous affiliations of Gary W. Maier include GlobalFoundries.

Papers
More filters
Proceedings ArticleDOI
18 Jun 2018
TL;DR: A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers by employing a dataflow architecture and an on-chip scratchpad hierarchy.
Abstract: A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture and an on-chip scratchpad hierarchy. Compute precision is optimized at 16b floating point (fp 16) for high model accuracy in training and inference as well as 1b/2b (bi-nary/ternary) integer for aggressive inference performance. At 1.5 GHz, the AI core prototype achieves 1.5 TFLOPS fp 16, 12 TOPS ternary, or 24 TOPS binary peak performance in 14nm CMOS.

103 citations

Patent
01 Aug 2000
TL;DR: In this paper, the authors propose a self-test (BIST) engine with a controller responsive to a test enable signal and operative to generate and store test data in the read/write memory; a comparator operative to compare retrieved data read from the read or write memory and the test data during a first pass test, the comparator identifying failed cycles where the retrieved data does not correspond correctly to test data.
Abstract: A structure and method for an integrated circuit which includes read/write memory having a plurality of memory devices, each of the memory devices having a unique address; a built-in self-test (BIST) engine, the BIST engine having a controller responsive to a test enable signal and operative to generate and store test data in the read/write memory; a comparator operative to compare retrieved data read from the read/write memory and the test data during a first pass test, the comparator identifying failed cycles where the retrieved data does not correspond correctly to the test data; and a diagnostic unit operative to store the failed cycles and being responsive to the controller generating and storing the test data in the read/write memory and operative to store failed data and failing addresses during a first pass test, wherein the BIST engine stops only at each of the failed cycles during the first pass test.

51 citations

Proceedings ArticleDOI
03 Apr 2012
TL;DR: This work describes the design and operation of a prototype of a 3D system, constructed by stacking a memory layer, built with eDRAM and logic blocks from the IBM Power7™ processor L3 cache, and a “processor proxy” layer in 45nm CMOS technology enhanced to include through-silicon vias (TSVs).
Abstract: 3D integration (3DI) holds promise for improved performance of integrated systems by increasing interconnect bandwidth [1]. A processor stacked with cache memory is one potential application of 3DI [2]. This work describes the design and operation of a prototype of a 3D system, constructed by stacking a memory layer, built with eDRAM [3] and logic blocks from the IBM Power7™ processor L3 cache, and a “processor proxy” layer in 45nm CMOS technology [4] enhanced to include through-silicon vias (TSVs) [5]. Unlike the previously reported 3D eDRAM [6], the 3D stack described here is constructed using 50μm pitch μC4's joining the front side of one thick chip to TSV connections on the back side of a thinned chip. TSVs are formed of Cu-filled vias that are ∼20μm in diameter and <100μm deep [5].

45 citations

Journal ArticleDOI
10 Nov 2020
TL;DR: RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core that is built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware, is presented.
Abstract: Advances in deep neural networks (DNNs) and the availability of massive real-world data have enabled superhuman levels of accuracy on many AI tasks and ushered the explosive growth of AI workloads across the spectrum of computing devices. However, their superior accuracy comes at a high computational cost, which necessitates approaches beyond traditional computing paradigms to improve their operational efficiency. Leveraging the application-level insight of error resilience, we demonstrate how approximate computing (AxC) can significantly boost the efficiency of AI platforms and play a pivotal role in the broader adoption of AI-based applications and services. To this end, we present RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core (fabricated at 14-nm technology) that we built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware. We highlight the workload-guided systematic explorations of AxC techniques for AI, including custom number representations, quantization/pruning methodologies, mixed-precision architecture design, instruction sets, and compiler technologies with quality programmability, employed in the RaPiD accelerator.

32 citations

Journal ArticleDOI
01 Dec 2018
TL;DR: This letter presents a multi-TOPS AI accelerator core for deep learning training and inference that achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture to provide high throughput and an on-chip scratchpad hierarchy to meet the bandwidth demands of the compute units.
Abstract: This letter presents a multi-TOPS AI accelerator core for deep learning training and inference. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture to provide high throughput and an on-chip scratchpad hierarchy to meet the bandwidth demands of the compute units. A custom 16b floating point (fp16) representation with 1 sign bit, 6 exponent bits, and 9 mantissa bits has also been developed for high model accuracy in training and inference as well as 1b/2b (binary/ternary) integer for aggressive inference performance. At 1.5 GHz, the AI core prototype achieves 1.5 TFLOPS fp16, 12 TOPS ternary, or 24 TOPS binary peak performance in 14-nm CMOS.

29 citations


Cited by
More filters
Patent
10 Dec 2012
TL;DR: In this paper, a system, method, and computer program product for a memory system is described, which includes a first semiconductor platform including at least one first circuit, and at least two additional semiconductor platforms stacked with the first and additional circuits.
Abstract: A system, method, and computer program product are provided for a memory system. The system includes a first semiconductor platform including at least one first circuit, and at least one additional semiconductor platform stacked with the first semiconductor platform and including at least one additional circuit.

387 citations

Proceedings Article
Naigang Wang1, Jungwook Choi, Daniel Brand1, Chia-Yu Chen1, Kailash Gopalakrishnan1 
19 Dec 2018
TL;DR: In this paper, the authors demonstrate the successful training of deep neural networks using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of deep learning models and datasets.
Abstract: The state-of-the-art hardware platforms for training deep neural networks are moving from traditional single precision (32-bit) computations towards 16 bits of precision - in large part due to the high energy efficiency and smaller bit storage associated with using reduced-precision representations However, unlike inference, training with numbers represented with less than 16 bits has been challenging due to the need to maintain fidelity of the gradient computations during back-propagation Here we demonstrate, for the first time, the successful training of deep neural networks using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of deep learning models and datasets In addition to reducing the data and computation precision to 8 bits, we also successfully reduce the arithmetic precision for additions (used in partial product accumulation and weight updates) from 32 bits to 16 bits through the introduction of a number of key ideas including chunk-based accumulation and floating point stochastic rounding The use of these novel techniques lays the foundation for a new generation of hardware training platforms with the potential for 2-4 times improved throughput over today's systems

231 citations

Journal ArticleDOI
TL;DR: An overview of the fundamentals of IMC is provided to better explain these challenges and then promising paths forward among the wide range of emerging research are identified.
Abstract: High-dimensionality matrix-vector multiplication (MVM) is a dominant kernel in signal-processing and machine-learning computations that are being deployed in a range of energy- and throughput-constrained applications. In-memory computing (IMC) exploits the structural alignment between a dense 2D array of bit cells and the dataflow in MVM, enabling opportunities to address computational energy and throughput. Recent prototypes have demonstrated the potential for 10 t benefits in both metrics. However, fitting computation within an array of constrained bit-cell circuits imposes a number of challenges, including the need for analog computation, efficient interfacing with conventional digital accelerators (enabling the required programmability), and efficient virtualization of the hardware to map software. This article provides an overview of the fundamentals of IMC to better explain these challenges and then identifies promising paths forward among the wide range of emerging research.

189 citations

Patent
09 Jun 2008
TL;DR: In this article, a software system is disclosed which facilitates the process of tracing the execution paths of a program, called a client or application, and traces corresponding to selected system resources that interact with the execution of the application are collected during the tracing operation and stored in an application signature.
Abstract: A software system is disclosed which facilitates the process of tracing the execution paths of a program, called a client or application. Trace data corresponding to selected system resources that interact with the execution of the application is collected during the tracing operation and stored in an application signature. A computer system user can generate trace options, trace the application, and compare the application signature to a known software configuration. The application signature is compared to a reference signature created by tracing the execution of the application on a system with the known software configuration. In another embodiment, the application signature is compared to a static configuration of a reference computer.

170 citations

Posted Content
TL;DR: Detailed characterizations of deep learning models used in many Facebook social network services are provided and the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers is highlighted.
Abstract: The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network services. We present computational characteristics of our models, describe high performance optimizations targeting existing systems, point out their limitations and make suggestions for the future general-purpose/accelerated inference hardware. Also, we highlight the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers.

155 citations