Home
/
Authors
/
Jintao Zhang

Author

Jintao Zhang

Bio: Jintao Zhang is an academic researcher from IBM. The author has contributed to research in topics: Data matrix (multivariate statistics) & Hardware acceleration. The author has an hindex of 2, co-authored 4 publications receiving 21 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

RaPiD: AI accelerator for ultra-low precision training and inference

[...]

Swagath Venkataramani¹, Vijayalakshmi Srinivasan¹, Wei Wang¹, Sanchari Sen¹, Jintao Zhang¹, Ankur Agrawal¹, Monodeep Kar¹, Shubham Jain¹, Alberto Mannari¹, Hoang Tran¹, Li Yulong¹, Eri Ogawa¹, Kazuaki Ishizaki¹, Hiroshi Inoue¹, Marcel Schaal¹, Mauricio J. Serrano¹, Jungwook Choi¹, Xiao Sun¹, Naigang Wang¹, Chia-Yu Chen¹, Allison Allain¹, James Bonano¹, Nianzheng Cao¹, Robert Casatuta¹, Matthew Cohen¹, Bruce M. Fleischer¹, Michael A. Guillorn¹, Howard M. Haynie¹, Jinwook Jung¹, Mingu Kang¹, Kyu-hyoun Kim¹, Siyu Koswatta¹, Sae Kyu Lee¹, Martin Lutz¹, Silvia Melitta Mueller¹, Jinwook Oh¹, Ashish Ranjan¹, Zhibin Ren¹, Scot H. Rider¹, Kerstin Schelm¹, Michael R. Scheuermann¹, Joel Abraham Silberman¹, Jie Yang¹, Vidhi Zalani¹, Xin Zhang¹, Ching Zhou¹, Matt Ziegler¹, Vinay Velji Shah¹, Moriyoshi Ohara¹, Pong-Fei Lu¹, Brian W. Curran¹, Sunil Shukla¹, Leland Chang¹, Kailash Gopalakrishnan¹ - Show less +50 more•Institutions (1)

IBM¹

14 Jun 2021

TL;DR: RaPiD1 as mentioned in this paper is a 4-core AI accelerator chip supporting a spectrum of precisions, namely, 16 and 8-bit floating-point and 4 and 2-bit fixed-point.

...read moreread less

Abstract: The growing prevalence and computational demands of Artificial Intelligence (AI) workloads has led to widespread use of hardware accelerators in their execution. Scaling the performance of AI accelerators across generations is pivotal to their success in commercial deployments. The intrinsic error-resilient nature of AI workloads present a unique opportunity for performance/energy improvement through precision scaling. Motivated by the recent algorithmic advances in precision scaling for inference and training, we designed RaPiD1, a 4-core AI accelerator chip supporting a spectrum of precisions, namely, 16 and 8-bit floating-point and 4 and 2-bit fixed-point. The 36mm2 RaPiD chip fabricated in 7nm EUV technology delivers a peak 3.5 TFLOPS/W in HFP8 mode and 16.5 TOPS/W in INT4 mode at nominal voltage. Using a performance model calibrated to within 1% of the measurement results, we evaluated DNN inference using 4-bit fixed-point representation for a 4-core 1 RaPiD chip system and DNN training using 8-bit floating point representation for a 768 TFLOPs AI system comprising 4 32-core RaPiD chips. Our results show INT4 inference for batch size of 1 achieves 3 - 13.5 (average 7) TOPS/W and FP8 training for a mini-batch of 512 achieves a sustained 102 - 588 (average 203) TFLOPS across a wide range of applications.

...read moreread less

42 citations

Journal Article•DOI•

Efficient AI System Design With Cross-Layer Approximate Computing

[...]

Swagath Venkataramani¹, Xiao Sun¹, Naigang Wang¹, Chia-Yu Chen¹, Jungwook Choi¹, Mingu Kang¹, Ankur Agarwal¹, Jinwook Oh¹, Shubham Jain¹, Tina Babinsky¹, Nianzheng Cao¹, Thomas W. Fox¹, Bruce M. Fleischer¹, George D. Gristede¹, Michael A. Guillorn¹, Howard M. Haynie¹, Hiroshi Inoue¹, Kazuaki Ishizaki¹, Michael J. Klaiber¹, Shih-Hsien Lo¹, Gary W. Maier¹, Silvia Melitta Mueller¹, Michael R. Scheuermann¹, Eri Ogawa¹, Marcel Schaal¹, Mauricio J. Serrano¹, Joel Abraham Silberman¹, Christos Vezyrtzis¹, Wei Wang¹, Fanchieh Yee¹, Jintao Zhang¹, Matthew M. Ziegler¹, Ching Zhou¹, Moriyoshi Ohara¹, Pong-Fei Lu¹, Brian W. Curran¹, Sunil Shukla¹, Vijayalakshmi Srinivasan¹, Leland Chang¹, Kailash Gopalakrishnan¹ - Show less +36 more•Institutions (1)

IBM¹

10 Nov 2020

TL;DR: RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core that is built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware, is presented.

...read moreread less

Abstract: Advances in deep neural networks (DNNs) and the availability of massive real-world data have enabled superhuman levels of accuracy on many AI tasks and ushered the explosive growth of AI workloads across the spectrum of computing devices. However, their superior accuracy comes at a high computational cost, which necessitates approaches beyond traditional computing paradigms to improve their operational efficiency. Leveraging the application-level insight of error resilience, we demonstrate how approximate computing (AxC) can significantly boost the efficiency of AI platforms and play a pivotal role in the broader adoption of AI-based applications and services. To this end, we present RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core (fabricated at 14-nm technology) that we built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware. We highlight the workload-guided systematic explorations of AxC techniques for AI, including custom number representations, quantization/pruning methodologies, mixed-precision architecture design, instruction sets, and compiler technologies with quality programmability, employed in the RaPiD accelerator.

...read moreread less

32 citations

Journal Article•DOI•

DeepTools : Compiler and Execution Runtime Extensions for RaPiD AI Accelerator

[...]

Swagath Venkataramani¹, Jungwook Choi¹, Vijayalakshmi Srinivasan¹, Wei Wang¹, Jintao Zhang¹, Marcel Schaal¹, Mauricio J. Serrano¹, Kazuaki Ishizaki¹, Hiroshi Inoue¹, Eri Ogawa¹, Moriyoshi Ohara¹, Leland Chang¹, Kailash Gopalakrishnan¹ - Show less +9 more•Institutions (1)

IBM¹

01 Sep 2019-IEEE Micro

TL;DR: A significant first step towards this goal is taken and an end-to-end software stack for the RaPiD AI accelerator developed by IBM Research is presented and a set of software extensions, called Deeptools, that leverage and work within popular deep learning frameworks are presented.

...read moreread less

Abstract: The ubiquitous adoption of systems specialized for AI requires bridging two seemingly conflicting challenges—the need to deliver extreme processing efficiencies while employing familiar programming interfaces, making them compelling even for non-expert users. We take a significant first step towards this goal and present an end-to-end software stack for the RaPiD AI accelerator developed by IBM Research. We present a set of software extensions, called Deeptools, that leverage and work within popular deep learning frameworks. DeepTools requires no additional user input and enables aggressive, accelerator-specific performance optimization akin to a full, custom framework. DeepTools has two key components: 1) a compiler runtime called DeepRT, which automatically identifies how best to execute a given DNN graph on RaPiD and constructs the requisite program binaries; and 2) an execution runtime called RaPiDLib, which triggers and manages the execution of compute and data-transfer operations on RaPiD. We integrate DeepTools with TensorFlow and map popular DNNs (AlexNet, VGG, ResNet, LSTM) to RaPiD. We demonstrate substantial improvement in performance over hand-tuned mappings.

...read moreread less

21 citations

Patent•

Matrix multiplication on a systolic array

[...]

Chia-Yu Chen¹, Jungwook Choi¹, Kailash Gopalakrishnan¹, Victor Han¹, Vijayalakshmi Srinivasan¹, Jintao Zhang¹ - Show less +2 more•Institutions (1)

IBM¹

11 Apr 2019

TL;DR: In this article, a computer-implemented method for matrix multiplication on a systolic array is described, which can include populating, by a system operatively coupled to a processor, respective first registers of one or more processing elements of a sy stolic array structure with respective input data bits of a first data matrix.

...read moreread less

Abstract: Techniques facilitating matrix multiplication on a systolic array are provided. A computer-implemented method can comprise populating, by a system operatively coupled to a processor, respective first registers of one or more processing elements of a systolic array structure with respective input data bits of a first data matrix. The one or more processing elements can comprise a first processing element that comprises a first input data bit of the first data matrix and a first activation bit of a second data matrix. The method can also include determining, by the system, at the first processing element, a first partial sum of a third data matrix. Further, the method can include streaming, by the system, the first partial sum of the third data matrix from the first processing element.

...read moreread less

2 citations

Proceedings Article•DOI•

Mining Syntactic Relationships via Recursion and Wandering on A Dependency Tree for Aspect-Based Sentiment Analysis

[...]

Jintao Zhang, Xiao Sun, Yuanlin Li

18 Jul 2022

TL;DR: A neural network model named ARWAT is proposed, which performs on the reconstructed tree to learn informative contextual words and grammatical information effectively and demonstrates the superior performance of the proposed model against multiple baselines on five benchmark datasets.

...read moreread less

Abstract: Aspect-Based Sentiment Analysis aims to determine the emotional orientation of a particular aspect of an online comment. Considering that syntax and even sentences themselves are a specific graph structure, most researchers' recent work mainly focuses on the dependency tree. They construct sentences into tree structures according to dependency parser. However, because a sentence can contain many aspects, it is difficult to correctly associate each aspect with the corresponding important part of the sentence by directly using the original dependency tree. To strengthen this association, a reconstructed dependency tree with aspect as the root is constructed by fine-tuning the original dependency tree for each aspect. We propose a neural network model named ARWAT, which performs on the reconstructed tree to learn informative contextual words and grammatical information effectively. Extensive experiment results demonstrate the superior performance of our proposed model against multiple baselines on five benchmark datasets.

...read moreread less

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

GAMMA: automating the HW mapping of DNN models on accelerators via genetic algorithm

[...]

Sheng-Chun Kao¹, Tushar Krishna¹•Institutions (1)

Georgia Institute of Technology¹

02 Nov 2020

TL;DR: This paper constructs an extremely flexible map-space and shows that GAMMA can explore the space and determine an optimized mapping with high sample efficiency, and quantitatively compare GAMMA with many popular optimization methods and observe GAMMA consistently finds better solutions.

...read moreread less

Abstract: DNN layers are multi-dimensional loops that can be ordered, tiled, and scheduled in myriad ways across space and time on DNN accelerators. Each of these choices is called a mapping. It has been shown that the mapping plays an extremely crucial role in overall performance and efficiency, as it directly determines the amount of reuse that the accelerator can leverage from the DNN. Moreover, instead of using a fixed mapping for every DNN layer, research has revealed the benefit of optimizing per-layer mappings. However, determining the right mapping, given an accelerator and layer is still an open question. The immense space of mappings (or map-space) makes brute-forced exhaustive search methods unapproachable. In this paper, we propose a domain-specific genetic algorithm-based method, GAMMA, which is specially designed for this HW-mapping problem. In contrast to prior works that either target simple rigid accelerators with a limited map-space or choose from a restricted set of mappings, we construct an extremely flexible map-space and show that GAMMA can explore the space and determine an optimized mapping with high sample efficiency. We quantitatively compare GAMMA with many popular optimization methods and observe GAMMA consistently finds better solutions.

...read moreread less

77 citations

Proceedings Article•DOI•

RaPiD: AI accelerator for ultra-low precision training and inference

[...]

IBM¹

14 Jun 2021

TL;DR: RaPiD1 as mentioned in this paper is a 4-core AI accelerator chip supporting a spectrum of precisions, namely, 16 and 8-bit floating-point and 4 and 2-bit fixed-point.

...read moreread less

42 citations

Journal Article•DOI•

Efficient AI System Design With Cross-Layer Approximate Computing

[...]

IBM¹

10 Nov 2020

...read moreread less

32 citations

Journal Article•DOI•

Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

[...]

Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jorg Henkel

16 Mar 2022-ACM Computing Surveys

TL;DR: This article provides a comprehensive survey and analysis of hardware approximation techniques for DNN accelerators and presents how Approximate Computing for Dnn accelerators can go beyond energy efficiency and address reliability and security issues, as well.

...read moreread less

Abstract: Deep Neural Networks (DNNs) are very popular because of their high performance in various cognitive tasks in Machine Learning (ML). Recent advancements in DNNs have brought levels beyond human accuracy in many tasks, but at the cost of high computational complexity. To enable efficient execution of DNN inference, more and more research works, therefore, are exploiting the inherent error resilience of DNNs and employing Approximate Computing (AC) principles to address the elevated energy demands of DNN accelerators. This article provides a comprehensive survey and analysis of hardware approximation techniques for DNN accelerators. First, we analyze the state of the art, and by identifying approximation families, we cluster the respective works with respect to the approximation type. Next, we analyze the complexity of the performed evaluations (with respect to the dataset and DNN size) to assess the efficiency, potential, and limitations of approximate DNN accelerators. Moreover, a broad discussion is provided regarding error metrics that are more suitable for designing approximate units for DNN accelerators as well as accuracy recovery approaches that are tailored to DNN inference. Finally, we present how Approximate Computing for DNN accelerators can go beyond energy efficiency and address reliability and security issues as well.

...read moreread less

25 citations

Journal Article•DOI•

Neural Architecture Search and Hardware Accelerator Co-Search: A Survey

[...]

Lukas Sekanina¹•Institutions (1)

Brno University of Technology¹

13 Nov 2021-IEEE Access

TL;DR: In this paper, the authors classified neural architecture search (NAS) methods into three major classes: single-objective NAS, hardware-aware NAS, and NAS with hardware co-optimization.

...read moreread less

Abstract: Deep neural networks (DNN) are now dominating in the most challenging applications of machine learning. As DNNs can have complex architectures with millions of trainable parameters (the so-called weights), their design and training are difficult even for highly qualified experts. In order to reduce human effort, neural architecture search (NAS) methods have been developed to automate the entire design process. The NAS methods typically combine searching in the space of candidate architectures and optimizing (learning) the weights using a gradient method. In this paper, we survey the key elements of NAS methods that – to various extents – consider hardware implementation of the resulting DNNs. We classified these methods into three major classes: single-objective NAS (no hardware is considered), hardware-aware NAS (DNN is optimized for a particular hardware platform), and NAS with hardware co-optimization (hardware is directly co-optimized with DNN as a part of NAS). Compared to previous surveys, we emphasize the multi-objective design approach that must be adopted in NAS and focus on co-design algorithms developed for concurrent optimization of DNN architectures and hardware platforms. As most research in this area deals with NAS for image classification using convolutional neural networks, we follow this trajectory in our paper. After reading the paper, the reader should understand why and how NAS and hardware co-optimization are currently used to build cutting-edge implementations of DNNs.

...read moreread less

16 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Collapse