scispace - formally typeset
Search or ask a question
Author

Henk Corporaal

Bio: Henk Corporaal is an academic researcher from Eindhoven University of Technology. The author has contributed to research in topics: Very long instruction word & Compiler. The author has an hindex of 40, co-authored 466 publications receiving 7111 citations. Previous affiliations of Henk Corporaal include Delft University of Technology & IMEC.


Papers
More filters
Proceedings ArticleDOI
07 Nov 2013
TL;DR: It is shown that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload and ensures that on-chip memory size is minimized, which reduces area and energy usage.
Abstract: In the near future, cameras will be used everywhere as flexible sensors for numerous applications. For mobility and privacy reasons, the required image processing should be local on embedded computer platforms with performance requirements and energy constraints. Dedicated acceleration of Convolutional Neural Networks (CNN) can achieve these targets with enough flexibility to perform multiple vision tasks. A challenging problem for the design of efficient accelerators is the limited amount of external memory bandwidth. We show that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload. The efficiency of the on-chip memories is maximized by our scheduler that uses tiling to optimize for data locality. Our design flow ensures that on-chip memory size is minimized, which reduces area and energy usage. The design flow is evaluated by a High Level Synthesis implementation on a Virtex 6 FPGA board. Compared to accelerators with standard scratchpad memories the FPGA resources can be reduced up to 13× while maintaining the same performance. Alternatively, when the same amount of FPGA resources is used our accelerators are up to 11× faster.

361 citations

Book
01 Jan 1997
TL;DR: This work introduces a new type of computer architecture, the transport triggered architecture (TTA), designed to alleviate delay problems in existing instruction-level parallel architectures.
Abstract: From the Publisher: The market for single chip microprocessors is huge and their performance continues to increase, driven by the on-going demand for more powerful applications, particularly in the control and signal processing domains. This work introduces a new type of computer architecture, the transport triggered architecture (TTA), designed to alleviate delay problems in existing instruction-level parallel architectures.

286 citations

Proceedings ArticleDOI
04 Jun 2007
TL;DR: A new resource allocation strategy which works directly on SDFGs is presented, building on an efficient technique to calculate throughput of a bound and scheduled SDFG, and experimental results show that the strategy is effective in terms of run-time and allocated resources.
Abstract: Embedded multimedia systems often run multiple time-constrained applications simultaneously. These systems use multiprocessor systems-on-chip of which it must be guaranteed that enough resources are available for each application to meet its throughput constraints. This requires a task binding and scheduling mechanism that provides timing guarantees for each application independent of other applications while taking into account the available processor space, memory and communication bandwidth. Synchronous dataflow graphs (SDFGs) are used to model time-constrained multimedia applications. They allow modeling of cyclic, multi- rate dependencies between tasks. However, existing resource allocation techniques can only deal with acyclic and/or single-rate dependencies. Dependencies in an SDFG can be expressed in single-rate form, but then the problem size may increase exponentially making resource allocation infeasible. This paper presents a new resource allocation strategy which works directly on SDFGs, building on an efficient technique to calculate throughput of a bound and scheduled SDFG. Experimental results show that the strategy is effective in terms of run-time and allocated resources.

191 citations

Proceedings ArticleDOI
09 Mar 2015
TL;DR: The paper first highlights some challenges of the new born Big Data paradigm and shows that the increase of the data size has already surpassed the capabilities of today's computation architectures suffering from the limited bandwidth, programmability overhead, energy inefficiency, and limited scalability.
Abstract: One of the most critical challenges for today's and future data-intensive and big-data problems is data storage and analysis. This paper first highlights some challenges of the new born Big Data paradigm and shows that the increase of the data size has already surpassed the capabilities of today's computation architectures suffering from the limited bandwidth, programmability overhead, energy inefficiency, and limited scalability. Thereafter, the paper introduces a new memristor-based architecture for data-intensive applications. The potential of such an architecture in solving data-intensive problems is illustrated by showing its capability to increase the computation efficiency, solving the communication bottleneck, reducing the leakage currents, etc. Finally, the paper discusses why memristor technology is very suitable for the realization of such an architecture; using memristors to implement dual functions (storage and logic) is illustrated.

159 citations

Journal ArticleDOI
TL;DR: A generic and systematic design-time/run-time methodology for handling the dynamic nature of modern embedded systems, based on the concept of system scenarios, that allows the system to learn on-the-fly during its execution, and to adapt itself to the current input stimuli.
Abstract: In the past decade, real-time embedded systems have become much more complex due to the introduction of a lot of new functionality in one application, and due to running multiple applications concurrently. This increases the dynamic nature of today's applications and systems, and tightens the requirements for their constraints in terms of deadlines and energy consumption. State-of-the-art design methodologies try to cope with these novel issues by identifying several most used cases and dealing with them separately, reducing the newly introduced complexity. This article presents a generic and systematic design-time/run-time methodology for handling the dynamic nature of modern embedded systems, which can be utilized by existing design methodologies to increase their efficiency. It is based on the concept of system scenarios, which group system behaviors that are similar from a multidimensional cost perspective—such as resource requirements, delay, and energy consumption—in such a way that the system can be configured to exploit this cost similarity. At design-time, these scenarios are individually optimized. Mechanisms for predicting the current scenario at run-time, and for switching between scenarios, are also derived. This design trajectory is augmented with a run-time calibration mechanism, which allows the system to learn on-the-fly during its execution, and to adapt itself to the current input stimuli, by extending the scenario set, changing the scenario definitions, and both the prediction and switching mechanisms. To show the generality of our methodology, we show how it has been applied on four very different real-life design problems. In all presented case studies, substantial energy reductions were obtained by exploiting scenarios.

158 citations


Cited by
More filters
Journal Article
TL;DR: This book by a teacher of statistics (as well as a consultant for "experimenters") is a comprehensive study of the philosophical background for the statistical design of experiment.
Abstract: THE DESIGN AND ANALYSIS OF EXPERIMENTS. By Oscar Kempthorne. New York, John Wiley and Sons, Inc., 1952. 631 pp. $8.50. This book by a teacher of statistics (as well as a consultant for \"experimenters\") is a comprehensive study of the philosophical background for the statistical design of experiment. It is necessary to have some facility with algebraic notation and manipulation to be able to use the volume intelligently. The problems are presented from the theoretical point of view, without such practical examples as would be helpful for those not acquainted with mathematics. The mathematical justification for the techniques is given. As a somewhat advanced treatment of the design and analysis of experiments, this volume will be interesting and helpful for many who approach statistics theoretically as well as practically. With emphasis on the \"why,\" and with description given broadly, the author relates the subject matter to the general theory of statistics and to the general problem of experimental inference. MARGARET J. ROBERTSON

13,333 citations

Posted Content
TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

3,067 citations

Journal ArticleDOI
20 Nov 2017
TL;DR: In this paper, the authors provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs, and discuss various hardware platforms and architectures that support DNN, and highlight key trends in reducing the computation cost of deep neural networks either solely via hardware design changes or via joint hardware and DNN algorithm changes.
Abstract: Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems. This article aims to provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic codesigns, being proposed in academia and industry. The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the tradeoffs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.

2,391 citations