scispace - formally typeset
Search or ask a question
Author

Jintao Zhang

Bio: Jintao Zhang is an academic researcher from IBM. The author has contributed to research in topics: Data matrix (multivariate statistics) & Hardware acceleration. The author has an hindex of 2, co-authored 4 publications receiving 21 citations.

Papers
More filters
Proceedings ArticleDOI
14 Jun 2021
TL;DR: RaPiD1 as mentioned in this paper is a 4-core AI accelerator chip supporting a spectrum of precisions, namely, 16 and 8-bit floating-point and 4 and 2-bit fixed-point.
Abstract: The growing prevalence and computational demands of Artificial Intelligence (AI) workloads has led to widespread use of hardware accelerators in their execution. Scaling the performance of AI accelerators across generations is pivotal to their success in commercial deployments. The intrinsic error-resilient nature of AI workloads present a unique opportunity for performance/energy improvement through precision scaling. Motivated by the recent algorithmic advances in precision scaling for inference and training, we designed RaPiD1, a 4-core AI accelerator chip supporting a spectrum of precisions, namely, 16 and 8-bit floating-point and 4 and 2-bit fixed-point. The 36mm2 RaPiD chip fabricated in 7nm EUV technology delivers a peak 3.5 TFLOPS/W in HFP8 mode and 16.5 TOPS/W in INT4 mode at nominal voltage. Using a performance model calibrated to within 1% of the measurement results, we evaluated DNN inference using 4-bit fixed-point representation for a 4-core 1 RaPiD chip system and DNN training using 8-bit floating point representation for a 768 TFLOPs AI system comprising 4 32-core RaPiD chips. Our results show INT4 inference for batch size of 1 achieves 3 - 13.5 (average 7) TOPS/W and FP8 training for a mini-batch of 512 achieves a sustained 102 - 588 (average 203) TFLOPS across a wide range of applications.

42 citations

Journal ArticleDOI
10 Nov 2020
TL;DR: RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core that is built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware, is presented.
Abstract: Advances in deep neural networks (DNNs) and the availability of massive real-world data have enabled superhuman levels of accuracy on many AI tasks and ushered the explosive growth of AI workloads across the spectrum of computing devices. However, their superior accuracy comes at a high computational cost, which necessitates approaches beyond traditional computing paradigms to improve their operational efficiency. Leveraging the application-level insight of error resilience, we demonstrate how approximate computing (AxC) can significantly boost the efficiency of AI platforms and play a pivotal role in the broader adoption of AI-based applications and services. To this end, we present RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core (fabricated at 14-nm technology) that we built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware. We highlight the workload-guided systematic explorations of AxC techniques for AI, including custom number representations, quantization/pruning methodologies, mixed-precision architecture design, instruction sets, and compiler technologies with quality programmability, employed in the RaPiD accelerator.

32 citations

Journal ArticleDOI
TL;DR: A significant first step towards this goal is taken and an end-to-end software stack for the RaPiD AI accelerator developed by IBM Research is presented and a set of software extensions, called Deeptools, that leverage and work within popular deep learning frameworks are presented.
Abstract: The ubiquitous adoption of systems specialized for AI requires bridging two seemingly conflicting challenges—the need to deliver extreme processing efficiencies while employing familiar programming interfaces, making them compelling even for non-expert users. We take a significant first step towards this goal and present an end-to-end software stack for the RaPiD AI accelerator developed by IBM Research. We present a set of software extensions, called Deeptools, that leverage and work within popular deep learning frameworks. DeepTools requires no additional user input and enables aggressive, accelerator-specific performance optimization akin to a full, custom framework. DeepTools has two key components: 1) a compiler runtime called DeepRT, which automatically identifies how best to execute a given DNN graph on RaPiD and constructs the requisite program binaries; and 2) an execution runtime called RaPiDLib, which triggers and manages the execution of compute and data-transfer operations on RaPiD. We integrate DeepTools with TensorFlow and map popular DNNs (AlexNet, VGG, ResNet, LSTM) to RaPiD. We demonstrate substantial improvement in performance over hand-tuned mappings.

21 citations

Patent
11 Apr 2019
TL;DR: In this article, a computer-implemented method for matrix multiplication on a systolic array is described, which can include populating, by a system operatively coupled to a processor, respective first registers of one or more processing elements of a sy stolic array structure with respective input data bits of a first data matrix.
Abstract: Techniques facilitating matrix multiplication on a systolic array are provided. A computer-implemented method can comprise populating, by a system operatively coupled to a processor, respective first registers of one or more processing elements of a systolic array structure with respective input data bits of a first data matrix. The one or more processing elements can comprise a first processing element that comprises a first input data bit of the first data matrix and a first activation bit of a second data matrix. The method can also include determining, by the system, at the first processing element, a first partial sum of a third data matrix. Further, the method can include streaming, by the system, the first partial sum of the third data matrix from the first processing element.

2 citations

Proceedings ArticleDOI
18 Jul 2022
TL;DR: A neural network model named ARWAT is proposed, which performs on the reconstructed tree to learn informative contextual words and grammatical information effectively and demonstrates the superior performance of the proposed model against multiple baselines on five benchmark datasets.
Abstract: Aspect-Based Sentiment Analysis aims to determine the emotional orientation of a particular aspect of an online comment. Considering that syntax and even sentences themselves are a specific graph structure, most researchers' recent work mainly focuses on the dependency tree. They construct sentences into tree structures according to dependency parser. However, because a sentence can contain many aspects, it is difficult to correctly associate each aspect with the corresponding important part of the sentence by directly using the original dependency tree. To strengthen this association, a reconstructed dependency tree with aspect as the root is constructed by fine-tuning the original dependency tree for each aspect. We propose a neural network model named ARWAT, which performs on the reconstructed tree to learn informative contextual words and grammatical information effectively. Extensive experiment results demonstrate the superior performance of our proposed model against multiple baselines on five benchmark datasets.

Cited by
More filters
Proceedings ArticleDOI
02 Nov 2020
TL;DR: This paper constructs an extremely flexible map-space and shows that GAMMA can explore the space and determine an optimized mapping with high sample efficiency, and quantitatively compare GAMMA with many popular optimization methods and observe GAMMA consistently finds better solutions.
Abstract: DNN layers are multi-dimensional loops that can be ordered, tiled, and scheduled in myriad ways across space and time on DNN accelerators. Each of these choices is called a mapping. It has been shown that the mapping plays an extremely crucial role in overall performance and efficiency, as it directly determines the amount of reuse that the accelerator can leverage from the DNN. Moreover, instead of using a fixed mapping for every DNN layer, research has revealed the benefit of optimizing per-layer mappings. However, determining the right mapping, given an accelerator and layer is still an open question. The immense space of mappings (or map-space) makes brute-forced exhaustive search methods unapproachable. In this paper, we propose a domain-specific genetic algorithm-based method, GAMMA, which is specially designed for this HW-mapping problem. In contrast to prior works that either target simple rigid accelerators with a limited map-space or choose from a restricted set of mappings, we construct an extremely flexible map-space and show that GAMMA can explore the space and determine an optimized mapping with high sample efficiency. We quantitatively compare GAMMA with many popular optimization methods and observe GAMMA consistently finds better solutions.

77 citations

Proceedings ArticleDOI
14 Jun 2021
TL;DR: RaPiD1 as mentioned in this paper is a 4-core AI accelerator chip supporting a spectrum of precisions, namely, 16 and 8-bit floating-point and 4 and 2-bit fixed-point.
Abstract: The growing prevalence and computational demands of Artificial Intelligence (AI) workloads has led to widespread use of hardware accelerators in their execution. Scaling the performance of AI accelerators across generations is pivotal to their success in commercial deployments. The intrinsic error-resilient nature of AI workloads present a unique opportunity for performance/energy improvement through precision scaling. Motivated by the recent algorithmic advances in precision scaling for inference and training, we designed RaPiD1, a 4-core AI accelerator chip supporting a spectrum of precisions, namely, 16 and 8-bit floating-point and 4 and 2-bit fixed-point. The 36mm2 RaPiD chip fabricated in 7nm EUV technology delivers a peak 3.5 TFLOPS/W in HFP8 mode and 16.5 TOPS/W in INT4 mode at nominal voltage. Using a performance model calibrated to within 1% of the measurement results, we evaluated DNN inference using 4-bit fixed-point representation for a 4-core 1 RaPiD chip system and DNN training using 8-bit floating point representation for a 768 TFLOPs AI system comprising 4 32-core RaPiD chips. Our results show INT4 inference for batch size of 1 achieves 3 - 13.5 (average 7) TOPS/W and FP8 training for a mini-batch of 512 achieves a sustained 102 - 588 (average 203) TFLOPS across a wide range of applications.

42 citations

Journal ArticleDOI
10 Nov 2020
TL;DR: RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core that is built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware, is presented.
Abstract: Advances in deep neural networks (DNNs) and the availability of massive real-world data have enabled superhuman levels of accuracy on many AI tasks and ushered the explosive growth of AI workloads across the spectrum of computing devices. However, their superior accuracy comes at a high computational cost, which necessitates approaches beyond traditional computing paradigms to improve their operational efficiency. Leveraging the application-level insight of error resilience, we demonstrate how approximate computing (AxC) can significantly boost the efficiency of AI platforms and play a pivotal role in the broader adoption of AI-based applications and services. To this end, we present RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core (fabricated at 14-nm technology) that we built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware. We highlight the workload-guided systematic explorations of AxC techniques for AI, including custom number representations, quantization/pruning methodologies, mixed-precision architecture design, instruction sets, and compiler technologies with quality programmability, employed in the RaPiD accelerator.

32 citations

Journal ArticleDOI
TL;DR: This article provides a comprehensive survey and analysis of hardware approximation techniques for DNN accelerators and presents how Approximate Computing for Dnn accelerators can go beyond energy efficiency and address reliability and security issues, as well.
Abstract: Deep Neural Networks (DNNs) are very popular because of their high performance in various cognitive tasks in Machine Learning (ML). Recent advancements in DNNs have brought levels beyond human accuracy in many tasks, but at the cost of high computational complexity. To enable efficient execution of DNN inference, more and more research works, therefore, are exploiting the inherent error resilience of DNNs and employing Approximate Computing (AC) principles to address the elevated energy demands of DNN accelerators. This article provides a comprehensive survey and analysis of hardware approximation techniques for DNN accelerators. First, we analyze the state of the art, and by identifying approximation families, we cluster the respective works with respect to the approximation type. Next, we analyze the complexity of the performed evaluations (with respect to the dataset and DNN size) to assess the efficiency, potential, and limitations of approximate DNN accelerators. Moreover, a broad discussion is provided regarding error metrics that are more suitable for designing approximate units for DNN accelerators as well as accuracy recovery approaches that are tailored to DNN inference. Finally, we present how Approximate Computing for DNN accelerators can go beyond energy efficiency and address reliability and security issues as well.

25 citations

Journal ArticleDOI
TL;DR: In this paper, the authors classified neural architecture search (NAS) methods into three major classes: single-objective NAS, hardware-aware NAS, and NAS with hardware co-optimization.
Abstract: Deep neural networks (DNN) are now dominating in the most challenging applications of machine learning. As DNNs can have complex architectures with millions of trainable parameters (the so-called weights), their design and training are difficult even for highly qualified experts. In order to reduce human effort, neural architecture search (NAS) methods have been developed to automate the entire design process. The NAS methods typically combine searching in the space of candidate architectures and optimizing (learning) the weights using a gradient method. In this paper, we survey the key elements of NAS methods that – to various extents – consider hardware implementation of the resulting DNNs. We classified these methods into three major classes: single-objective NAS (no hardware is considered), hardware-aware NAS (DNN is optimized for a particular hardware platform), and NAS with hardware co-optimization (hardware is directly co-optimized with DNN as a part of NAS). Compared to previous surveys, we emphasize the multi-objective design approach that must be adopted in NAS and focus on co-design algorithms developed for concurrent optimization of DNN architectures and hardware platforms. As most research in this area deals with NAS for image classification using convolutional neural networks, we follow this trajectory in our paper. After reading the paper, the reader should understand why and how NAS and hardware co-optimization are currently used to build cutting-edge implementations of DNNs.

16 citations