Automatic generation of efficient accelerators for reconfigurable hardware

doi:10.1145/3007787.3001150

Journal ArticleDOI

Automatic generation of efficient accelerators for reconfigurable hardware

- Vol. 44, Iss: 3, pp 115-127

TLDR

A hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing is described.

Abstract:

Acceleration in the form of customized datapaths offer large performance and energy improvements over general purpose processors. Reconfigurable fabrics such as FPGAs are gaining popularity for use in implementing application-specific accelerators, thereby increasing the importance of having good high-level FPGA design tools. However, current tools for targeting FPGAs offer inadequate support for high-level programming, resource estimation, and rapid and automatic design space exploration.We describe a design framework that addresses these challenges. We introduce a new representation of hardware using parameterized templates that captures locality and parallelism information at multiple levels of nesting. This representation is designed to be automatically generated from high-level languages based on parallel patterns. We describe a hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing. Our runtime estimation accounts for off-chip memory accesses. We use our estimation capabilities to rapidly explore a large space of designs across tile sizes, parallelization factors, and optional coarse-grained pipelining, all at multiple loop levels. We show that estimates average 4.8% error for logic resources, 6.1% error for runtimes, and are 279 to 6533 times faster than a commercial high-level synthesis tool. We compare the best-performing designs to optimized CPU code running on a server-grade 6 core processor and show speedups of up to 16.7×.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems

Yu Gan, +23 more

TL;DR: This paper presents DeathStarBench, a novel, open-source benchmark suite built with microservices that is representative of large end-to-end services, modular and extensible, and uses it to study the architectural characteristics of microservices, their implications in networking and operating systems, their challenges with respect to cluster management, and their trade-offs in terms of application design and programming frameworks.

...read moreread less

Proceedings ArticleDOI

Plasticine: A Reconfigurable Architecture For Parallel Paterns

Raghu Prabhakar, +8 more

TL;DR: This work designs Plasticine, a new spatially reconfigurable architecture designed to efficiently execute applications composed of parallel patterns that provide an improvement of up to 76.9× in performance-per-Watt over a conventional FPGA over a wide range of dense and sparse applications.

...read moreread less

Proceedings ArticleDOI

Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent

Christopher De Sa, +3 more

TL;DR: The DMGC model is introduced, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and it is shown that it provides a way to both classify these algorithms and model their performance.

...read moreread less

Proceedings ArticleDOI

Spatial: a language and compiler for application accelerators

David Koeplinger, +10 more

TL;DR: This work describes a new domain-specific language and compiler called Spatial for higher level descriptions of application accelerators, and summarizes the compiler passes required to support these abstractions, including pipeline scheduling, automatic memory banking, and automated design tuning driven by active machine learning.

...read moreread less

Journal ArticleDOI

A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications

Leibo Liu, +7 more

- 16 Oct 2019 -

ACM Computing Surveys

TL;DR: The architecture and design of CGRAs are reviewed thoroughly, a novel multidimensional taxonomy is proposed, and major challenges and the corresponding state-of-the-art techniques are surveyed and analyzed.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Measuring the Gap Between FPGAs and ASICs

Ian Kuon, +1 more

- 01 Feb 2007 -

IEEE Transactions on Computer-Aided Desi...

TL;DR: Experimental measurements of the differences between a 90- nm CMOS field programmable gate array (FPGA) and 90-nm CMOS standard-cell application-specific integrated circuits (ASICs) in terms of logic density, circuit speed, and power consumption for core logic are presented.

...read moreread less

Proceedings ArticleDOI

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Jonathan Ragan-Kelley, +5 more

TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.

...read moreread less

Benchmarking modern multiprocessors

Kai Li, +1 more

TL;DR: A methodology to design effective benchmark suites is developed and its effectiveness is demonstrated by developing and deploying a benchmark suite for evaluating multiprocessors called PARSEC, which has been adopted by many architecture groups in both research and industry.

...read moreread less

Journal ArticleDOI

A reconfigurable fabric for accelerating large-scale datacenter services

Andrew Putnam, +22 more

- 28 Oct 2016 -

Communications of The ACM

TL;DR: The authors deployed the reconfigurable fabric in a bed of 1,632 servers and FPGAs in a production datacenter and successfully used it to accelerate the ranking portion of the Bing Web search engine by nearly a factor of two.

...read moreread less

Journal ArticleDOI

High-Level Synthesis for FPGAs: From Prototyping to Deployment

Jason Cong, +5 more

- 01 Apr 2011 -

IEEE Transactions on Computer-Aided Desi...

TL;DR: AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx are used as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains.

...read moreread less