Home
/
Authors
/
Howard M. Haynie

Author

Howard M. Haynie

Bio: Howard M. Haynie is an academic researcher from IBM. The author has contributed to research in topics: Network packet & Packet generator. The author has an hindex of 7, co-authored 23 publications receiving 174 citations.

Topics: Network packet, Packet generator, Queue, Queueing theory, Packet loss ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference

[...]

Bruce M. Fleischer¹, Sunil Shukla¹, Matthew M. Ziegler¹, Joel Abraham Silberman¹, Jinwook Oh¹, Vijavalakshmi Srinivasan¹, Jungwook Choi¹, Silvia Melitta Mueller¹, Ankur Agrawal¹, Tina Babinsky¹, Nianzheng Cao¹, Chia-Yu Chen¹, Pierce Chuang¹, Thomas W. Fox¹, George D. Gristede¹, Michael A. Guillorn¹, Howard M. Haynie¹, Michael J. Klaiber¹, Dongsoo Lee¹, Shih-Hsien Lo¹, Gary W. Maier¹, Michael R. Scheuermann¹, Swagath Venkataramani¹, Christos Vezyrtzis¹, Naigang Wang¹, Fanchieh Yee¹, Ching Zhou¹, Pong-Fei Lu¹, Brian W. Curran¹, Lel Chang¹, Kailash Gopalakrishnan¹ - Show less +27 more•Institutions (1)

IBM¹

18 Jun 2018

TL;DR: A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers by employing a dataflow architecture and an on-chip scratchpad hierarchy.

...read moreread less

Abstract: A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture and an on-chip scratchpad hierarchy. Compute precision is optimized at 16b floating point (fp 16) for high model accuracy in training and inference as well as 1b/2b (bi-nary/ternary) integer for aggressive inference performance. At 1.5 GHz, the AI core prototype achieves 1.5 TFLOPS fp 16, 12 TOPS ternary, or 24 TOPS binary peak performance in 14nm CMOS.

...read moreread less

103 citations

Proceedings Article•DOI•

A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling

[...]

Ankur Agrawal¹, Sae Kyu Lee¹, Joel Abraham Silberman¹, Matthew M. Ziegler¹, Mingu Kang¹, Swagath Venkataramani¹, Nianzheng Cao¹, Bruce M. Fleischer¹, Michael A. Guillorn¹, Matthew Cohen¹, Silvia Melitta Mueller¹, Jinwook Oh¹, Martin Lutz¹, Jinwook Jung¹, Siyu Koswatta¹, Ching Zhou¹, Vidhi Zalani¹, James J. Bonanno¹, Robert Casatuta¹, Chia-Yu Chen¹, Jungwook Choi², Howard M. Haynie¹, Alyssa Herbert¹, Radhika Jain¹, Monodeep Kar¹, Kyu-hyoun Kim¹, Li Yulong¹, Zhibin Ren¹, Scot H. Rider¹, Marcel Schaal¹, Kerstin Schelm¹, Michael R. Scheuermann¹, Xiao Sun¹, Hung Tran¹, Naigang Wang¹, Wei Wang¹, Xin Zhang¹, Vinay Velji Shah¹, Brian W. Curran¹, Vijayalakshmi Srinivasan¹, Pong-Fei Lu¹, Sunil Shukla¹, Leland Chang¹, Kailash Gopalakrishnan¹ - Show less +40 more•Institutions (2)

IBM¹, Hanyang University²

13 Feb 2021

TL;DR: In this article, a 4-core AI chip in 7nm EUV technology is presented to exploit cutting-edge algorithmic advances for iso-accurate models in low-precision training and inference to achieve leading-edge power-performance.

...read moreread less

Abstract: Low-precision computation is the key enabling factor to achieve high compute densities (T0PS/W and T0PS/mm2) in AI hardware accelerators across cloud and edge platforms. However, robust deep learning (DL) model accuracy equivalent to high-precision computation must be maintained. Improvements in bandwidth, architecture, and power management are also required to harness the benefit of reduced precision by feeding and supporting more parallel engines to achieve high sustained utilization and optimize performance within a given product power envelope. In this work, we present a 4-core AI chip in 7nm EUV technology that exploits cutting-edge algorithmic advances for iso-accurate models in low-precision training and inference [1, 2] and aggressive circuit/architecture optimization to achieve leading-edge power-performance. The chip supports fp16 (DLFIoat16 [8]) and hybrid-fp8(hfp8) [1] formats for training and inference of DL models, as well as int4 and int2 formats for highly scaled inference.

...read moreread less

56 citations

Proceedings Article•DOI•

RaPiD: AI accelerator for ultra-low precision training and inference

[...]

Swagath Venkataramani¹, Vijayalakshmi Srinivasan¹, Wei Wang¹, Sanchari Sen¹, Jintao Zhang¹, Ankur Agrawal¹, Monodeep Kar¹, Shubham Jain¹, Alberto Mannari¹, Hoang Tran¹, Li Yulong¹, Eri Ogawa¹, Kazuaki Ishizaki¹, Hiroshi Inoue¹, Marcel Schaal¹, Mauricio J. Serrano¹, Jungwook Choi¹, Xiao Sun¹, Naigang Wang¹, Chia-Yu Chen¹, Allison Allain¹, James Bonano¹, Nianzheng Cao¹, Robert Casatuta¹, Matthew Cohen¹, Bruce M. Fleischer¹, Michael A. Guillorn¹, Howard M. Haynie¹, Jinwook Jung¹, Mingu Kang¹, Kyu-hyoun Kim¹, Siyu Koswatta¹, Sae Kyu Lee¹, Martin Lutz¹, Silvia Melitta Mueller¹, Jinwook Oh¹, Ashish Ranjan¹, Zhibin Ren¹, Scot H. Rider¹, Kerstin Schelm¹, Michael R. Scheuermann¹, Joel Abraham Silberman¹, Jie Yang¹, Vidhi Zalani¹, Xin Zhang¹, Ching Zhou¹, Matt Ziegler¹, Vinay Velji Shah¹, Moriyoshi Ohara¹, Pong-Fei Lu¹, Brian W. Curran¹, Sunil Shukla¹, Leland Chang¹, Kailash Gopalakrishnan¹ - Show less +50 more•Institutions (1)

IBM¹

14 Jun 2021

TL;DR: RaPiD1 as mentioned in this paper is a 4-core AI accelerator chip supporting a spectrum of precisions, namely, 16 and 8-bit floating-point and 4 and 2-bit fixed-point.

...read moreread less

Abstract: The growing prevalence and computational demands of Artificial Intelligence (AI) workloads has led to widespread use of hardware accelerators in their execution. Scaling the performance of AI accelerators across generations is pivotal to their success in commercial deployments. The intrinsic error-resilient nature of AI workloads present a unique opportunity for performance/energy improvement through precision scaling. Motivated by the recent algorithmic advances in precision scaling for inference and training, we designed RaPiD1, a 4-core AI accelerator chip supporting a spectrum of precisions, namely, 16 and 8-bit floating-point and 4 and 2-bit fixed-point. The 36mm2 RaPiD chip fabricated in 7nm EUV technology delivers a peak 3.5 TFLOPS/W in HFP8 mode and 16.5 TOPS/W in INT4 mode at nominal voltage. Using a performance model calibrated to within 1% of the measurement results, we evaluated DNN inference using 4-bit fixed-point representation for a 4-core 1 RaPiD chip system and DNN training using 8-bit floating point representation for a 768 TFLOPs AI system comprising 4 32-core RaPiD chips. Our results show INT4 inference for batch size of 1 achieves 3 - 13.5 (average 7) TOPS/W and FP8 training for a mini-batch of 512 achieves a sustained 102 - 588 (average 203) TFLOPS across a wide range of applications.

...read moreread less

42 citations

Journal Article•DOI•

Efficient AI System Design With Cross-Layer Approximate Computing

[...]

Swagath Venkataramani¹, Xiao Sun¹, Naigang Wang¹, Chia-Yu Chen¹, Jungwook Choi¹, Mingu Kang¹, Ankur Agarwal¹, Jinwook Oh¹, Shubham Jain¹, Tina Babinsky¹, Nianzheng Cao¹, Thomas W. Fox¹, Bruce M. Fleischer¹, George D. Gristede¹, Michael A. Guillorn¹, Howard M. Haynie¹, Hiroshi Inoue¹, Kazuaki Ishizaki¹, Michael J. Klaiber¹, Shih-Hsien Lo¹, Gary W. Maier¹, Silvia Melitta Mueller¹, Michael R. Scheuermann¹, Eri Ogawa¹, Marcel Schaal¹, Mauricio J. Serrano¹, Joel Abraham Silberman¹, Christos Vezyrtzis¹, Wei Wang¹, Fanchieh Yee¹, Jintao Zhang¹, Matthew M. Ziegler¹, Ching Zhou¹, Moriyoshi Ohara¹, Pong-Fei Lu¹, Brian W. Curran¹, Sunil Shukla¹, Vijayalakshmi Srinivasan¹, Leland Chang¹, Kailash Gopalakrishnan¹ - Show less +36 more•Institutions (1)

IBM¹

10 Nov 2020

TL;DR: RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core that is built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware, is presented.

...read moreread less

Abstract: Advances in deep neural networks (DNNs) and the availability of massive real-world data have enabled superhuman levels of accuracy on many AI tasks and ushered the explosive growth of AI workloads across the spectrum of computing devices. However, their superior accuracy comes at a high computational cost, which necessitates approaches beyond traditional computing paradigms to improve their operational efficiency. Leveraging the application-level insight of error resilience, we demonstrate how approximate computing (AxC) can significantly boost the efficiency of AI platforms and play a pivotal role in the broader adoption of AI-based applications and services. To this end, we present RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core (fabricated at 14-nm technology) that we built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware. We highlight the workload-guided systematic explorations of AxC techniques for AI, including custom number representations, quantization/pruning methodologies, mixed-precision architecture design, instruction sets, and compiler technologies with quality programmability, employed in the RaPiD accelerator.

...read moreread less

32 citations

Journal Article•DOI•

A Scalable Multi-TeraOPS Core for AI Training and Inference

[...]

Sunil Shukla¹, Bruce M. Fleischer¹, Matthew M. Ziegler¹, Joel Abraham Silberman¹, Jinwook Oh¹, Vijayalakshmi Srinivasan¹, Jungwook Choi¹, Silvia Melitta Mueller¹, Ankur Agrawal¹, Tina Babinsky¹, Nianzheng Cao¹, Chia-Yu Chen¹, Pierce Chuang¹, Thomas W. Fox¹, George D. Gristede¹, Michael A. Guillorn¹, Howard M. Haynie¹, Michael J. Klaiber¹, Dongsoo Lee¹, Shih-Hsien Lo¹, Gary W. Maier¹, Michael R. Scheuermann¹, Swagath Venkataramani¹, Christos Vezyrtzis¹, Naigang Wang¹, Fanchieh Yee¹, Ching Zhou¹, Pong-Fei Lu¹, Brian W. Curran¹, Leland Chang¹, Kailash Gopalakrishnan¹ - Show less +27 more•Institutions (1)

IBM¹

01 Dec 2018

TL;DR: This letter presents a multi-TOPS AI accelerator core for deep learning training and inference that achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture to provide high throughput and an on-chip scratchpad hierarchy to meet the bandwidth demands of the compute units.

...read moreread less

Abstract: This letter presents a multi-TOPS AI accelerator core for deep learning training and inference. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture to provide high throughput and an on-chip scratchpad hierarchy to meet the bandwidth demands of the compute units. A custom 16b floating point (fp16) representation with 1 sign bit, 6 exponent bits, and 9 mantissa bits has also been developed for high model accuracy in training and inference as well as 1b/2b (binary/ternary) integer for aggressive inference performance. At 1.5 GHz, the AI core prototype achieves 1.5 TFLOPS fp16, 12 TOPS ternary, or 24 TOPS binary peak performance in 14-nm CMOS.

...read moreread less

29 citations

1
2
3
4
…
5

Cited by

PDF

Open Access

More filters

Patent•

System and method for a cloud computing abstraction layer with security zone facilities

[...]

Frank Martinez, Eric Pulier

19 Jan 2012

TL;DR: In this paper, the authors describe improved capabilities for a virtualization environment adapted for development and deployment of at least one software workload, the virtualisation environment having a metamodel framework that allows the association of a policy to the software workload upon development of the workload that is applied upon deployment of software workload.

...read moreread less

Abstract: In embodiments of the present invention improved capabilities are described for a virtualization environment adapted for development and deployment of at least one software workload, the virtualization environment having a metamodel framework that allows the association of a policy to the software workload upon development of the workload that is applied upon deployment of the software workload. This allows a developer to define a security zone and to apply at least one type of security policy with respect to the security zone including the type of security zone policy in the metamodel framework such that the type of security zone policy can be associated with the software workload upon development of the software workload, and if the type of security zone policy is associated with the software workload, automatically applying the security policy to the software workload when the software workload is deployed within the security zone.

...read moreread less

541 citations

Patent•

Cloud computing gateway, cloud computing hypervisor, and methods for implementing same

[...]

Duncan Christopher Hill

19 Jun 2009

TL;DR: In this article, the authors present a cloud gateway system, a cloud hypervisor system, and methods for implementing same, which extends the security, manageability, and quality of service membrane of a corporate enterprise network into cloud infrastructure provider networks, enabling cloud infrastructure to be interfaced as if it were on the enterprise network.

...read moreread less

Abstract: Embodiments of the present invention provide a cloud gateway system, a cloud hypervisor system, and methods for implementing same. The cloud gateway system extends the security, manageability, and quality of service membrane of a corporate enterprise network into cloud infrastructure provider networks, enabling cloud infrastructure to be interfaced as if it were on the enterprise network. The cloud hypervisor system provides an interface to cloud infrastructure provider management systems and infrastructure instances that enables existing enterprise systems management tools to manage cloud infrastructure substantially the same as they manage local virtual machines via common server hypervisor APIs.

...read moreread less

314 citations

Proceedings Article•

Training Deep Neural Networks with 8-bit Floating Point Numbers

[...]

Naigang Wang¹, Jungwook Choi, Daniel Brand¹, Chia-Yu Chen¹, Kailash Gopalakrishnan¹ - Show less +1 more•Institutions (1)

IBM¹

19 Dec 2018

TL;DR: In this paper, the authors demonstrate the successful training of deep neural networks using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of deep learning models and datasets.

...read moreread less

Abstract: The state-of-the-art hardware platforms for training deep neural networks are moving from traditional single precision (32-bit) computations towards 16 bits of precision - in large part due to the high energy efficiency and smaller bit storage associated with using reduced-precision representations However, unlike inference, training with numbers represented with less than 16 bits has been challenging due to the need to maintain fidelity of the gradient computations during back-propagation Here we demonstrate, for the first time, the successful training of deep neural networks using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of deep learning models and datasets In addition to reducing the data and computation precision to 8 bits, we also successfully reduce the arithmetic precision for additions (used in partial product accumulation and weight updates) from 32 bits to 16 bits through the introduction of a number of key ideas including chunk-based accumulation and floating point stochastic rounding The use of these novel techniques lays the foundation for a new generation of hardware training platforms with the potential for 2-4 times improved throughput over today's systems

...read moreread less

231 citations

Journal Article•DOI•

In-Memory Computing: Advances and prospects

[...]

Naveen Verma¹, Hongyang Jia¹, Hossein Valavi¹, Yinqi Tang¹, Murat Ozatay¹, Lung-Yen Chen¹, Bonan Zhang¹, Peter Deaville¹ - Show less +4 more•Institutions (1)

Princeton University¹

23 Aug 2019-IEEE Solid-State Circuits Magazine

TL;DR: An overview of the fundamentals of IMC is provided to better explain these challenges and then promising paths forward among the wide range of emerging research are identified.

...read moreread less

Abstract: High-dimensionality matrix-vector multiplication (MVM) is a dominant kernel in signal-processing and machine-learning computations that are being deployed in a range of energy- and throughput-constrained applications. In-memory computing (IMC) exploits the structural alignment between a dense 2D array of bit cells and the dataflow in MVM, enabling opportunities to address computational energy and throughput. Recent prototypes have demonstrated the potential for 10 t benefits in both metrics. However, fitting computation within an array of constrained bit-cell circuits imposes a number of challenges, including the need for analog computation, efficient interfacing with conventional digital accelerators (enabling the required programmability), and efficient virtualization of the hardware to map software. This article provides an overview of the fundamentals of IMC to better explain these challenges and then identifies promising paths forward among the wide range of emerging research.

...read moreread less

189 citations

Posted Content•

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications

[...]

Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah, Daya Shanker Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, Juan Pino, Martin Schatz, Alexander Sidorov, Viswanath Sivakumar, Andrew Tulloch, Xiaodong Wang, Yiming Wu, Hector Yuen, Utku Diril, Dmytro Dzhulgakov, Kim Hazelwood, Bill Jia, Yangqing Jia, Lin Qiao, Vijay Rao, Nadav Rotem, Sungjoo Yoo, Mikhail Smelyanskiy - Show less +24 more

24 Nov 2018-arXiv: Learning

TL;DR: Detailed characterizations of deep learning models used in many Facebook social network services are provided and the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers is highlighted.

...read moreread less

Abstract: The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network services. We present computational characteristics of our models, describe high performance optimizations targeting existing systems, point out their limitations and make suggestions for the future general-purpose/accelerated inference hardware. Also, we highlight the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers.

...read moreread less

155 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

Collapse