scispace - formally typeset
Search or ask a question
Author

Jaeha Kung

Bio: Jaeha Kung is an academic researcher from Daegu Gyeongbuk Institute of Science and Technology. The author has contributed to research in topics: Computer science & Artificial neural network. The author has an hindex of 12, co-authored 34 publications receiving 585 citations. Previous affiliations of Jaeha Kung include Georgia Institute of Technology & Pohang University of Science and Technology.

Papers
More filters
Journal ArticleDOI
18 Jun 2016
TL;DR: The basic architecture of the Neurocube is presented and an analysis of the logic tier synthesized in 28nm and 15nm process technologies are presented and the performance is evaluated through the mapping of a Convolutional Neural Network and estimating the subsequent power and performance for both training and inference.
Abstract: This paper presents a programmable and scalable digital neuromorphic architecture based on 3D high-density memory integrated with logic tier for efficient neural computing. The proposed architecture consists of clusters of processing engines, connected by 2D mesh network as a processing tier, which is integrated in 3D with multiple tiers of DRAM. The PE clusters access multiple memory channels (vaults) in parallel. The operating principle, referred to as the memory centric computing, embeds specialized state-machines within the vault controllers of HMC to drive data into the PE clusters. The paper presents the basic architecture of the Neurocube and an analysis of the logic tier synthesized in 28nm and 15nm process technologies. The performance of the Neurocube is evaluated and illustrated through the mapping of a Convolutional Neural Network and estimating the subsequent power and performance for both training and inference.

415 citations

Proceedings ArticleDOI
14 May 2017
TL;DR: It is shown that batch normalization on input sequences can help speed up training with low precision as well as high precision, and the piecewise linear activation function with stochastic rounding can achieve comparable training results with floating point precision.
Abstract: Training of neural network can be accelerated by limited numerical precision together with specialized low-precision hardware. This paper studies how low precision can impact on entire training of RNNs. We emulate low precision training for recently proposed gated recurrent unit (GRU) and use dynamic fixed point as a target numeric format. We first show that batch normalization on input sequences can help speed up training with low precision as well as high precision. We also show that the overflow rate should be carefully controlled for dynamic fixed point. We study low precision training with various rounding options including bit truncation, round to nearest, and stochastic rounding. Stochastic rounding shows superior results than the other options. The effect of fully low precision training is also analyzed by comparing partial low precision training. We show that the piecewise linear activation function with stochastic rounding can achieve comparable training results with floating point precision. Low precision multiplier and accumulator (MAC) with linear-feedback shift register (LFSR) is implemented with 28nm Synopsys PDK for energy and performance analysis. Implementation results show low precision hardware is 4.7x faster, and energy per task is up to 4.55x lower than that of floating point hardware.

39 citations

Proceedings ArticleDOI
22 Jul 2015
TL;DR: A power-aware digital feedforward neural network platform that utilizes the backpropagation algorithm during training to enable energy-quality trade-off and reduces system power by ~38% with 0.4% lower recognition accuracy in a classification problem.
Abstract: This paper proposes a power-aware digital feedforward neural network platform that utilizes the backpropagation algorithm during training to enable energy-quality trade-off. Given a quality constraint, the proposed approach identifies a set of synaptic weights for approximation in a neural network. The approach selects synapses with small impact on output error, estimated by the backpropagation algorithm, for approximation. The approximations are achieved by a coupled software (reduced bit-width) and hardware (approximate multiplication in the processing engine) based design approaches. The full-chip design in 130nm CMOS shows, compared to a baseline accurate design, the proposed approach reduces system power by ∼38% with 0.4% lower recognition accuracy in a classification problem.

35 citations

Proceedings ArticleDOI
24 Jul 2016
TL;DR: Simulation results show that ReRAM-RNN can provide higher computation efficiency and/or more compact design than software realization of RNN, and dedicated CMOS based digital-and analog- RNN.
Abstract: We present a programmable high-efficient Recurrent Neural Network (RNN) with Synapses design using Resistive Random Access Memory (ReRAM). The presented ReRAM-RNN employs crossbar ReRAM arrays as synapses. A fast synapses programming methodology is realized by CMOS-based neuron with in-built programming circuitry. The simulations are performed using experimentally verified physical resistive switching model, instead of only functional models, providing better estimate of system speed and power efficiency. Simulation results show that ReRAM-RNN can provide higher computation efficiency and/or more compact design than software realization of RNN, and dedicated CMOS based digital-and analog-RNN. We show that the efficiency improvement of ReRAM-based neural network design is more significant in feedback networks than in feedforward networks.

33 citations

Journal ArticleDOI
TL;DR: This paper proposes that approximation by reducing bit-precision and using inexact multiplier can save power consumption of digital multilayer perceptron accelerator during the classification of MNIST (inference) with negligible accuracy degradation.
Abstract: This paper proposes that approximation by reducing bit-precision and using inexact multiplier can save power consumption of digital multilayer perceptron accelerator during the classification of MNIST (inference) with negligible accuracy degradation. Based on the error sensitivity precomputed during the training, synaptic weights with less sensitivity are approximated. Under given bit-precision modes, our proposed algorithm determines bit precision for all synapse to minimize power consumption for given target accuracy. For entire network, earlier layer can be more approximated since it has lower error sensitivity. Proposed algorithm can save power 57.4 percent while accuracy is degraded about 1.7 percent. After approximation, retraining with few iterations can improve the accuracy while maintaining power consumption. The impact of different training conditions on the approximation is also studied. Training with small quantization error (less bit precision) allows more power saving in inference. It also shows that enough number of iteration during the training is important for approximation in inference. Network with more layers is more sensitive to the approximation.

31 citations


Cited by
More filters
Posted Content
TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

3,067 citations

Proceedings ArticleDOI
24 Jun 2017
TL;DR: The Tensor Processing Unit (TPU) as discussed by the authors is a custom ASIC deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) using a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS).
Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

2,679 citations

Journal ArticleDOI
20 Nov 2017
TL;DR: In this paper, the authors provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs, and discuss various hardware platforms and architectures that support DNN, and highlight key trends in reducing the computation cost of deep neural networks either solely via hardware design changes or via joint hardware and DNN algorithm changes.
Abstract: Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems. This article aims to provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic codesigns, being proposed in academia and industry. The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the tradeoffs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.

2,391 citations

Journal ArticleDOI
18 Jun 2016
TL;DR: This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner.
Abstract: A number of recent efforts have attempted to design accelerators for popular machine learning algorithms, such as those involving convolutional and deep neural networks (CNNs and DNNs). These algorithms typically involve a large number of multiply-accumulate (dot-product) operations. A recent project, DaDianNao, adopts a near data processing approach, where a specialized neural functional unit performs all the digital arithmetic operations and receives input weights from adjacent eDRAM banks.This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner. While the use of crossbar memory as an analog dot-product engine is well known, no prior work has designed or characterized a full-fledged accelerator based on crossbars. In particular, our work makes the following contributions: (i) We design a pipelined architecture, with some crossbars dedicated for each neural network layer, and eDRAM buffers that aggregate data between pipeline stages. (ii) We define new data encoding techniques that are amenable to analog computations and that can reduce the high overheads of analog-to-digital conversion (ADC). (iii) We define the many supporting digital components required in an analog CNN accelerator and carry out a design space exploration to identify the best balance of memristor storage/compute, ADCs, and eDRAM storage on a chip. On a suite of CNN and DNN workloads, the proposed ISAAC architecture yields improvements of 14.8×, 5.5×, and 7.5× in throughput, energy, and computational density (respectively), relative to the state-of-the-art DaDianNao architecture.

1,558 citations

Posted Content
TL;DR: In this article, the authors provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs, and discuss various hardware platforms and architectures that support deep neural networks.
Abstract: Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems. This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry. The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.

677 citations