scispace - formally typeset
Search or ask a question
Author

Jeng-Hau Lin

Other affiliations: National Taiwan University
Bio: Jeng-Hau Lin is an academic researcher from University of California, San Diego. The author has contributed to research in topics: Matrix exponential & Matrix decomposition. The author has an hindex of 8, co-authored 19 publications receiving 475 citations. Previous affiliations of Jeng-Hau Lin include National Taiwan University.

Papers
More filters
Proceedings ArticleDOI
22 Feb 2017
TL;DR: The design of a BNN accelerator is presented that is synthesized from C++ to FPGA-targeted Verilog and outperforms existing FPGAs-based CNN accelerators in GOPS as well as energy and resource efficiency.
Abstract: Convolutional neural networks (CNN) are the current stateof-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications are typically run on clusters of CPUs or GPUs. Studies into the FPGA acceleration of CNN workloads has achieved reductions in power and energy consumption. However, large GPUs outperform modern FPGAs in throughput, and the existence of compatible deep learning frameworks give GPUs a significant advantage in programmability. Recent research in machine learning demonstrates the potential of very low precision CNNs -- i.e., CNNs with binarized weights and activations. Such binarized neural networks (BNNs) appear well suited for FPGA implementation, as their dominant computations are bitwise logic operations and their memory requirements are reduced. A combination of low-precision networks and high-level design methodology may help address the performance and productivity gap between FPGAs and GPUs. In this paper, we present the design of a BNN accelerator that is synthesized from C++ to FPGA-targeted Verilog. The accelerator outperforms existing FPGA-based CNN accelerators in GOPS as well as energy and resource efficiency.

379 citations

Journal ArticleDOI
TL;DR: In this paper, a fast methodology that employs only two anti-polarity one-bit data patterns instead of the pseudo-random bit sequence as input sources to simulate the worst-case eye diagram was proposed.
Abstract: As the speed of signal through an interconnection increases toward the multigigabit ranges, the effects of lossy transmission lines on the signal quality of printed circuit boards becomes a critical issue. To evaluate the eye diagram and thus the signal integrity in the modern digital systems, this paper proposes a fast methodology that employs only two anti-polarity one-bit data patterns instead of the pseudo-random bit sequence as input sources to simulate the worst-case eye diagram. Analytic expressions are derived for the impulse response of the lossy transmission lines due to the skin-effect loss, while the Kramers-Kronig relations are employed to deal with the noncausal problem related to the dielectric loss. Two design graphs that can be used to rapidly predict the eye diagram characteristics versus the conductive and dielectric losses are then constructed and based on which, the maximally usable length of transmission lines under a certain signal specification can be easily acquired. At last, the time-domain simulations and experiments are implemented to verify the exactitude of proposed concept.

40 citations

Proceedings ArticleDOI
13 Nov 2017
TL;DR: This paper outlines the assessment methodology and use of a cross-layer evaluation approach that extracts hardware-level errors from twenty different operating conditions and then inject such errors back to the software layer in an attempt to answer the second question posed above.
Abstract: As a problem solving method, neural networks have shown broad applicability from medical applications, speech recognition, and natural language processing. This success has even led to implementation of neural network algorithms into hardware. In this paper, we explore two questions: (a) to what extent microelectronic variations affects the quality of results by neural networks; and (b) if the answer to first question represents an opportunity to optimize the implementation of neural network algorithms. Regarding first question, variations are now increasingly common in aggressive process nodes and typically manifest as an increased frequency of timing errors. Combating variations - due to process and/or operating conditions - usually results in increased guardbands in circuit and architectural design, thus reducing the gains from process technology advances. Given the inherent resilience of neural networks due to adaptation of their learning parameters, one would expect the quality of results produced by neural networks to be relatively insensitive to the rising timing error rates caused by increased variations. On the contrary, using two frequently used neural networks (MLP and CNN), our results show that variations can significantly affect the inference accuracy. This paper outlines our assessment methodology and use of a cross-layer evaluation approach that extracts hardware-level errors from twenty different operating conditions and then inject such errors back to the software layer in an attempt to answer the second question posed above.

33 citations

Proceedings ArticleDOI
TL;DR: In this paper, a distributed framework for transient simulation of power distribution networks (PDNs) is proposed, which utilizes matrix exponential kernel with Krylov subspace approximations to solve differential equations of linear circuit.
Abstract: We proposed MATEX, a distributed framework for transient simulation of power distribution networks (PDNs). MATEX utilizes matrix exponential kernel with Krylov subspace approximations to solve differential equations of linear circuit. First, the whole simulation task is divided into subtasks based on decompositions of current sources, in order to reduce the computational overheads. Then these subtasks are distributed to different computing nodes and processed in parallel. Within each node, after the matrix factorization at the beginning of simulation, the adaptive time stepping solver is performed without extra matrix re-factorizations. MATEX overcomes the stiff-ness hinder of previous matrix exponential-based circuit simulator by rational Krylov subspace method, which leads to larger step sizes with smaller dimensions of Krylov subspace bases and highly accelerates the whole computation. MATEX outperforms both traditional fixed and adaptive time stepping methods, e.g., achieving around 13X over the trapezoidal framework with fixed time step for the IBM power grid benchmarks.

28 citations

Proceedings ArticleDOI
01 Jun 2014
TL;DR: MATEX overcomes the stiffness hinder of previous matrix exponential-based circuit simulator by rational Krylov subspace method, which leads to larger step sizes with smaller dimensions of KrylovSubspace bases and highly accelerates the whole computation.
Abstract: We proposed MATEX, a distributed framework for transient simulation of power distribution networks (PDNs). MATEX utilizes matrix exponential kernel with Krylov subspace approximations to solve differential equations of linear circuit. First, the whole simulation task is divided into subtasks based on decompositions of current sources, in order to reduce the computational overheads. Then these subtasks are distributed to different computing nodes and processed in parallel. Within each node, after the matrix factorization at the beginning of simulation, the adaptive time stepping solver is performed without extra matrix re-factorizations. MATEX overcomes the stiffness hinder of previous matrix exponential-based circuit simulator by rational Krylov subspace method, which leads to larger step sizes with smaller dimensions of Krylov subspace bases and highly accelerates the whole computation. MATEX outperforms both traditional fixed and adaptive time stepping methods, e.g., achieving around 13X over the trapezoidal framework with fixed time step for the IBM power grid benchmarks.

26 citations


Cited by
More filters
Journal ArticleDOI
Y.L. Kuo1, M.L. Liou
01 Jun 1977
TL;DR: One of the books that can be recommended for new readers is computer aided analysis of electronic circuits algorithms and computational techniques, which is not kind of difficult book to read.
Abstract: Preparing the books to read every day is enjoyable for many people. However, there are still many people who also don't like reading. This is a problem. But, when you can support others to start reading, it will be better. One of the books that can be recommended for new readers is computer aided analysis of electronic circuits algorithms and computational techniques. This book is not kind of difficult book to read. It can be read and understand by the new readers.

621 citations

Proceedings ArticleDOI
02 Jun 2018
TL;DR: This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI, and achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1.5 teraflops.
Abstract: Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models—aka ""real-time AI"". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling atypically high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.

498 citations

Journal ArticleDOI
TL;DR: A comprehensive survey of algorithms proposed for binary neural networks, mainly categorized into the native solutions directly conducting binarization, and the optimized ones using techniques like minimizing the quantization error, improving the network loss function, and reducing the gradient error are presented.

346 citations

Journal ArticleDOI
TL;DR: A survey on two types of network compression: pruning and quantization is provided, which compare current techniques, analyze their strengths and weaknesses, provide guidance for compressing networks, and discuss possible future compression techniques.

266 citations