scispace - formally typeset
Journal ArticleDOI

Automatic generation of efficient accelerators for reconfigurable hardware

TLDR
A hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing is described.
Abstract
Acceleration in the form of customized datapaths offer large performance and energy improvements over general purpose processors. Reconfigurable fabrics such as FPGAs are gaining popularity for use in implementing application-specific accelerators, thereby increasing the importance of having good high-level FPGA design tools. However, current tools for targeting FPGAs offer inadequate support for high-level programming, resource estimation, and rapid and automatic design space exploration.We describe a design framework that addresses these challenges. We introduce a new representation of hardware using parameterized templates that captures locality and parallelism information at multiple levels of nesting. This representation is designed to be automatically generated from high-level languages based on parallel patterns. We describe a hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing. Our runtime estimation accounts for off-chip memory accesses. We use our estimation capabilities to rapidly explore a large space of designs across tile sizes, parallelization factors, and optional coarse-grained pipelining, all at multiple loop levels. We show that estimates average 4.8% error for logic resources, 6.1% error for runtimes, and are 279 to 6533 times faster than a commercial high-level synthesis tool. We compare the best-performing designs to optimized CPU code running on a server-grade 6 core processor and show speedups of up to 16.7×.

read more

Citations
More filters
Proceedings ArticleDOI

An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems

TL;DR: This paper presents DeathStarBench, a novel, open-source benchmark suite built with microservices that is representative of large end-to-end services, modular and extensible, and uses it to study the architectural characteristics of microservices, their implications in networking and operating systems, their challenges with respect to cluster management, and their trade-offs in terms of application design and programming frameworks.
Proceedings ArticleDOI

Plasticine: A Reconfigurable Architecture For Parallel Paterns

TL;DR: This work designs Plasticine, a new spatially reconfigurable architecture designed to efficiently execute applications composed of parallel patterns that provide an improvement of up to 76.9× in performance-per-Watt over a conventional FPGA over a wide range of dense and sparse applications.
Proceedings ArticleDOI

Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent

TL;DR: The DMGC model is introduced, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and it is shown that it provides a way to both classify these algorithms and model their performance.
Proceedings ArticleDOI

Spatial: a language and compiler for application accelerators

TL;DR: This work describes a new domain-specific language and compiler called Spatial for higher level descriptions of application accelerators, and summarizes the compiler passes required to support these abstractions, including pipeline scheduling, automatic memory banking, and automated design tuning driven by active machine learning.
Journal ArticleDOI

A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications

TL;DR: The architecture and design of CGRAs are reviewed thoroughly, a novel multidimensional taxonomy is proposed, and major challenges and the corresponding state-of-the-art techniques are surveyed and analyzed.
References
More filters
Journal ArticleDOI

Measuring the Gap Between FPGAs and ASICs

TL;DR: Experimental measurements of the differences between a 90- nm CMOS field programmable gate array (FPGA) and 90-nm CMOS standard-cell application-specific integrated circuits (ASICs) in terms of logic density, circuit speed, and power consumption for core logic are presented.
Proceedings ArticleDOI

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.

Benchmarking modern multiprocessors

TL;DR: A methodology to design effective benchmark suites is developed and its effectiveness is demonstrated by developing and deploying a benchmark suite for evaluating multiprocessors called PARSEC, which has been adopted by many architecture groups in both research and industry.
Journal ArticleDOI

High-Level Synthesis for FPGAs: From Prototyping to Deployment

TL;DR: AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx are used as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains.
Related Papers (5)