Home
/
Authors
/
Gary W. Maier

Author

Gary W. Maier

Bio: Gary W. Maier is an academic researcher from IBM. The author has contributed to research in topics: eDRAM & Dram. The author has an hindex of 11, co-authored 29 publications receiving 373 citations. Previous affiliations of Gary W. Maier include GlobalFoundries.

Topics: eDRAM, Dram, Integrated circuit, Device under test, Die (integrated circuit) ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference

[...]

Bruce M. Fleischer¹, Sunil Shukla¹, Matthew M. Ziegler¹, Joel Abraham Silberman¹, Jinwook Oh¹, Vijavalakshmi Srinivasan¹, Jungwook Choi¹, Silvia Melitta Mueller¹, Ankur Agrawal¹, Tina Babinsky¹, Nianzheng Cao¹, Chia-Yu Chen¹, Pierce Chuang¹, Thomas W. Fox¹, George D. Gristede¹, Michael A. Guillorn¹, Howard M. Haynie¹, Michael J. Klaiber¹, Dongsoo Lee¹, Shih-Hsien Lo¹, Gary W. Maier¹, Michael R. Scheuermann¹, Swagath Venkataramani¹, Christos Vezyrtzis¹, Naigang Wang¹, Fanchieh Yee¹, Ching Zhou¹, Pong-Fei Lu¹, Brian W. Curran¹, Lel Chang¹, Kailash Gopalakrishnan¹ - Show less +27 more•Institutions (1)

IBM¹

18 Jun 2018

TL;DR: A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers by employing a dataflow architecture and an on-chip scratchpad hierarchy.

...read moreread less

Abstract: A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture and an on-chip scratchpad hierarchy. Compute precision is optimized at 16b floating point (fp 16) for high model accuracy in training and inference as well as 1b/2b (bi-nary/ternary) integer for aggressive inference performance. At 1.5 GHz, the AI core prototype achieves 1.5 TFLOPS fp 16, 12 TOPS ternary, or 24 TOPS binary peak performance in 14nm CMOS.

...read moreread less

103 citations

Patent•

Array-built-in-self-test (ABIST) for efficient, fast, bitmapping of large embedded arrays in manufacturing test

[...]

Jay G. Heaslip¹, Gary W. Maier¹, Gerard M. Salem¹, Timothy J. Von Reyn¹•Institutions (1)

IBM¹

01 Aug 2000

TL;DR: In this paper, the authors propose a self-test (BIST) engine with a controller responsive to a test enable signal and operative to generate and store test data in the read/write memory; a comparator operative to compare retrieved data read from the read or write memory and the test data during a first pass test, the comparator identifying failed cycles where the retrieved data does not correspond correctly to test data.

...read moreread less

Abstract: A structure and method for an integrated circuit which includes read/write memory having a plurality of memory devices, each of the memory devices having a unique address; a built-in self-test (BIST) engine, the BIST engine having a controller responsive to a test enable signal and operative to generate and store test data in the read/write memory; a comparator operative to compare retrieved data read from the read/write memory and the test data during a first pass test, the comparator identifying failed cycles where the retrieved data does not correspond correctly to the test data; and a diagnostic unit operative to store the failed cycles and being responsive to the controller generating and storing the test data in the read/write memory and operative to store failed data and failing addresses during a first pass test, wherein the BIST engine stops only at each of the failed cycles during the first pass test.

...read moreread less

51 citations

Proceedings Article•DOI•

A 3D system prototype of an eDRAM cache stacked over processor-like logic using through-silicon vias

[...]

M. Wordeman¹, Joel Abraham Silberman¹, Gary W. Maier¹, Michael R. Scheuermann¹•Institutions (1)

IBM¹

03 Apr 2012

TL;DR: This work describes the design and operation of a prototype of a 3D system, constructed by stacking a memory layer, built with eDRAM and logic blocks from the IBM Power7™ processor L3 cache, and a “processor proxy” layer in 45nm CMOS technology enhanced to include through-silicon vias (TSVs).

...read moreread less

Abstract: 3D integration (3DI) holds promise for improved performance of integrated systems by increasing interconnect bandwidth [1]. A processor stacked with cache memory is one potential application of 3DI [2]. This work describes the design and operation of a prototype of a 3D system, constructed by stacking a memory layer, built with eDRAM [3] and logic blocks from the IBM Power7™ processor L3 cache, and a “processor proxy” layer in 45nm CMOS technology [4] enhanced to include through-silicon vias (TSVs) [5]. Unlike the previously reported 3D eDRAM [6], the 3D stack described here is constructed using 50μm pitch μC4's joining the front side of one thick chip to TSV connections on the back side of a thinned chip. TSVs are formed of Cu-filled vias that are ∼20μm in diameter and <100μm deep [5].

...read moreread less

45 citations

Journal Article•DOI•

Efficient AI System Design With Cross-Layer Approximate Computing

[...]

Swagath Venkataramani¹, Xiao Sun¹, Naigang Wang¹, Chia-Yu Chen¹, Jungwook Choi¹, Mingu Kang¹, Ankur Agarwal¹, Jinwook Oh¹, Shubham Jain¹, Tina Babinsky¹, Nianzheng Cao¹, Thomas W. Fox¹, Bruce M. Fleischer¹, George D. Gristede¹, Michael A. Guillorn¹, Howard M. Haynie¹, Hiroshi Inoue¹, Kazuaki Ishizaki¹, Michael J. Klaiber¹, Shih-Hsien Lo¹, Gary W. Maier¹, Silvia Melitta Mueller¹, Michael R. Scheuermann¹, Eri Ogawa¹, Marcel Schaal¹, Mauricio J. Serrano¹, Joel Abraham Silberman¹, Christos Vezyrtzis¹, Wei Wang¹, Fanchieh Yee¹, Jintao Zhang¹, Matthew M. Ziegler¹, Ching Zhou¹, Moriyoshi Ohara¹, Pong-Fei Lu¹, Brian W. Curran¹, Sunil Shukla¹, Vijayalakshmi Srinivasan¹, Leland Chang¹, Kailash Gopalakrishnan¹ - Show less +36 more•Institutions (1)

IBM¹

10 Nov 2020

TL;DR: RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core that is built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware, is presented.

...read moreread less

Abstract: Advances in deep neural networks (DNNs) and the availability of massive real-world data have enabled superhuman levels of accuracy on many AI tasks and ushered the explosive growth of AI workloads across the spectrum of computing devices. However, their superior accuracy comes at a high computational cost, which necessitates approaches beyond traditional computing paradigms to improve their operational efficiency. Leveraging the application-level insight of error resilience, we demonstrate how approximate computing (AxC) can significantly boost the efficiency of AI platforms and play a pivotal role in the broader adoption of AI-based applications and services. To this end, we present RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core (fabricated at 14-nm technology) that we built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware. We highlight the workload-guided systematic explorations of AxC techniques for AI, including custom number representations, quantization/pruning methodologies, mixed-precision architecture design, instruction sets, and compiler technologies with quality programmability, employed in the RaPiD accelerator.

...read moreread less

32 citations

Journal Article•DOI•

A Scalable Multi-TeraOPS Core for AI Training and Inference

[...]

Sunil Shukla¹, Bruce M. Fleischer¹, Matthew M. Ziegler¹, Joel Abraham Silberman¹, Jinwook Oh¹, Vijayalakshmi Srinivasan¹, Jungwook Choi¹, Silvia Melitta Mueller¹, Ankur Agrawal¹, Tina Babinsky¹, Nianzheng Cao¹, Chia-Yu Chen¹, Pierce Chuang¹, Thomas W. Fox¹, George D. Gristede¹, Michael A. Guillorn¹, Howard M. Haynie¹, Michael J. Klaiber¹, Dongsoo Lee¹, Shih-Hsien Lo¹, Gary W. Maier¹, Michael R. Scheuermann¹, Swagath Venkataramani¹, Christos Vezyrtzis¹, Naigang Wang¹, Fanchieh Yee¹, Ching Zhou¹, Pong-Fei Lu¹, Brian W. Curran¹, Leland Chang¹, Kailash Gopalakrishnan¹ - Show less +27 more•Institutions (1)

IBM¹

01 Dec 2018

TL;DR: This letter presents a multi-TOPS AI accelerator core for deep learning training and inference that achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture to provide high throughput and an on-chip scratchpad hierarchy to meet the bandwidth demands of the compute units.

...read moreread less

Abstract: This letter presents a multi-TOPS AI accelerator core for deep learning training and inference. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture to provide high throughput and an on-chip scratchpad hierarchy to meet the bandwidth demands of the compute units. A custom 16b floating point (fp16) representation with 1 sign bit, 6 exponent bits, and 9 mantissa bits has also been developed for high model accuracy in training and inference as well as 1b/2b (binary/ternary) integer for aggressive inference performance. At 1.5 GHz, the AI core prototype achieves 1.5 TFLOPS fp16, 12 TOPS ternary, or 24 TOPS binary peak performance in 14-nm CMOS.

...read moreread less

29 citations

1
2
3
4
…
5
6

Collapse

Cited by

PDF

Open Access

More filters

Patent•

System, method, and computer program product for improving memory systems

[...]

Michael S Smith¹•Institutions (1)

Wilmington University¹

10 Dec 2012

TL;DR: In this paper, a system, method, and computer program product for a memory system is described, which includes a first semiconductor platform including at least one first circuit, and at least two additional semiconductor platforms stacked with the first and additional circuits.

...read moreread less

Abstract: A system, method, and computer program product are provided for a memory system. The system includes a first semiconductor platform including at least one first circuit, and at least one additional semiconductor platform stacked with the first semiconductor platform and including at least one additional circuit.

...read moreread less

387 citations

Proceedings Article•

Training Deep Neural Networks with 8-bit Floating Point Numbers

[...]

Naigang Wang¹, Jungwook Choi, Daniel Brand¹, Chia-Yu Chen¹, Kailash Gopalakrishnan¹ - Show less +1 more•Institutions (1)

IBM¹

19 Dec 2018

TL;DR: In this paper, the authors demonstrate the successful training of deep neural networks using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of deep learning models and datasets.

...read moreread less

Abstract: The state-of-the-art hardware platforms for training deep neural networks are moving from traditional single precision (32-bit) computations towards 16 bits of precision - in large part due to the high energy efficiency and smaller bit storage associated with using reduced-precision representations However, unlike inference, training with numbers represented with less than 16 bits has been challenging due to the need to maintain fidelity of the gradient computations during back-propagation Here we demonstrate, for the first time, the successful training of deep neural networks using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of deep learning models and datasets In addition to reducing the data and computation precision to 8 bits, we also successfully reduce the arithmetic precision for additions (used in partial product accumulation and weight updates) from 32 bits to 16 bits through the introduction of a number of key ideas including chunk-based accumulation and floating point stochastic rounding The use of these novel techniques lays the foundation for a new generation of hardware training platforms with the potential for 2-4 times improved throughput over today's systems

...read moreread less

231 citations

Journal Article•DOI•

In-Memory Computing: Advances and prospects

[...]

Naveen Verma¹, Hongyang Jia¹, Hossein Valavi¹, Yinqi Tang¹, Murat Ozatay¹, Lung-Yen Chen¹, Bonan Zhang¹, Peter Deaville¹ - Show less +4 more•Institutions (1)

Princeton University¹

23 Aug 2019-IEEE Solid-State Circuits Magazine

TL;DR: An overview of the fundamentals of IMC is provided to better explain these challenges and then promising paths forward among the wide range of emerging research are identified.

...read moreread less

Abstract: High-dimensionality matrix-vector multiplication (MVM) is a dominant kernel in signal-processing and machine-learning computations that are being deployed in a range of energy- and throughput-constrained applications. In-memory computing (IMC) exploits the structural alignment between a dense 2D array of bit cells and the dataflow in MVM, enabling opportunities to address computational energy and throughput. Recent prototypes have demonstrated the potential for 10 t benefits in both metrics. However, fitting computation within an array of constrained bit-cell circuits imposes a number of challenges, including the need for analog computation, efficient interfacing with conventional digital accelerators (enabling the required programmability), and efficient virtualization of the hardware to map software. This article provides an overview of the fundamentals of IMC to better explain these challenges and then identifies promising paths forward among the wide range of emerging research.

...read moreread less

189 citations

Patent•

System and method for troubleshooting software configuration problems using application tracing

[...]

Valery Golender, Ido Ben Moshe, Shlomo Wygodny

09 Jun 2008

TL;DR: In this article, a software system is disclosed which facilitates the process of tracing the execution paths of a program, called a client or application, and traces corresponding to selected system resources that interact with the execution of the application are collected during the tracing operation and stored in an application signature.

...read moreread less

Abstract: A software system is disclosed which facilitates the process of tracing the execution paths of a program, called a client or application. Trace data corresponding to selected system resources that interact with the execution of the application is collected during the tracing operation and stored in an application signature. A computer system user can generate trace options, trace the application, and compare the application signature to a known software configuration. The application signature is compared to a reference signature created by tracing the execution of the application on a system with the known software configuration. In another embodiment, the application signature is compared to a static configuration of a reference computer.

...read moreread less

170 citations

Posted Content•

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications

[...]

Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah, Daya Shanker Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, Juan Pino, Martin Schatz, Alexander Sidorov, Viswanath Sivakumar, Andrew Tulloch, Xiaodong Wang, Yiming Wu, Hector Yuen, Utku Diril, Dmytro Dzhulgakov, Kim Hazelwood, Bill Jia, Yangqing Jia, Lin Qiao, Vijay Rao, Nadav Rotem, Sungjoo Yoo, Mikhail Smelyanskiy - Show less +24 more

24 Nov 2018-arXiv: Learning

TL;DR: Detailed characterizations of deep learning models used in many Facebook social network services are provided and the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers is highlighted.

...read moreread less

Abstract: The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network services. We present computational characteristics of our models, describe high performance optimizations targeting existing systems, point out their limitations and make suggestions for the future general-purpose/accelerated inference hardware. Also, we highlight the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers.

...read moreread less

155 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88

Collapse