scispace - formally typeset
Search or ask a question
Journal ArticleDOI

High-Level Synthesis for FPGAs: From Prototyping to Deployment

TL;DR: AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx are used as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains.
Abstract: Escalating system-on-chip design complexity is pushing the design community to raise the level of abstraction beyond register transfer level. Despite the unsuccessful adoptions of early generations of commercial high-level synthesis (HLS) systems, we believe that the tipping point for transitioning to HLS msystem-on-chip design complexityethodology is happening now, especially for field-programmable gate array (FPGA) designs. The latest generation of HLS tools has made significant progress in providing wide language coverage and robust compilation technology, platform-based modeling, advancement in core HLS algorithms, and a domain-specific approach. In this paper, we use AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains. Complex industrial designs targeting Xilinx FPGAs are also presented as case studies, including comparison of HLS solutions versus optimized manual designs. In particular, the experiment on a sphere decoder shows that the HLS solution can achieve an 11-31% reduction in FPGA resource usage with improved design productivity compared to hand-coded design.

Summary (8 min read)

I. INTRODUCTION

  • HE RAPID INCREASE of complexity in System-on-a-Chip (SoC) design has encouraged the design community to seek design abstractions with better productivity than RTL.
  • In addition to the line-count reduction in design specifications, behavioral synthesis has the added value of allowing efficient reuse of behavioral IPs.
  • The wide availability of SystemC functional models directly drives the need for SystemC-based HLS solutions, which can automatically generate RTL code through a series of formal constructive transformations.
  • These pre-defined building blocks can be modeled precisely ahead of time for each FPGA platform and, to a large extent, confine the design space.
  • In Sections IV-VIII, using a state-of-art HLS tool as an example, the authors discuss some key reasons for the wider adoption of HLS solutions in the FPGA design community, including wide language coverage and robust compilation technology, platform-based modeling, advancement in core HLS algorithms, improvements on simulation and verification flow, and the availability of domain-specific design templates.

II. EVOLUTION OF HIGH-LEVEL SYNTHESIS FOR FPGA

  • Compilers for high-level languages have been successful in practice since the 1950s.
  • The idea of automatically generating circuit implementations from high-level behavioral specifications arises naturally with the increasing design complexity of integrated circuits.
  • Most of those tools, however, made rather simplistic assumptions about the target platform and were not widely used.
  • Early commercialization efforts in the 1990s and early 2000s attracted considerable interest among designers, but also failed to gain wide adoption, due in part to usability issues and poor quality of results.
  • More recent efforts in high-level synthesis have improved usability by increasing input language coverage and platform integration, as well as improving quality of results.

A. Early Efforts

  • Since the history of HLS is considerably longer than that of FPGAs, most early HLS tools targeted ASIC designs.
  • In the subsequent years in the 1980s and early 1990s, a number of similar high-level synthesis tools were built, mostly for research.
  • The list scheduling algorithm and its variants are widely used to solve scheduling problems with resource constraints [70] ; the forcedirected scheduling algorithm developed in HAL [73] is able to optimize resource requirements under a performance constraint; the path-based scheduling algorithm in the Yorktown Silicon Compiler is useful to optimize performance with conditional branches [12] .
  • The Silage language, along with the Cathedral-II tool, represented an early domain-specific approach in high-level synthesis.
  • These tools received wide attention, but failed to widely replace RTL design.

B. Recent efforts

  • Since 2000, a new generation of high-level synthesis tools has been developed in both academia and industry.
  • The use of C-based languages also makes it easy to leverage the newest technologies in software compilers for parallelization and optimization in the synthesis tools.
  • (ii) C and C++ have complex language constructs, such as pointers, dynamic memory management, recursion, polymorphism, etc., which do have efficient hardware counterparts and lead to difficulty in synthesis.
  • Handel-C allows the user to specify clock boundaries explicitly in the source code.
  • FPGAs have continually improved in capacity and speed in recent years, and their programmability makes them an attractive platform for many applications in signal processing, communication, and high-performance computing.

C. Lessons Learned

  • The authors believe that past failures are due to one or several of the following reasons:.
  • The first generation of the HLS synthesis tools could not synthesize high-level programming languages.
  • Instead, untimed or partially timed behavioral HDL was used.
  • C and C++ lack the necessary constructs and semantics to represent hardware attributes such as design hierarchy, timing, synchronization, and explicit concurrency.

Lack of reusable and portable design specification:

  • Many HLS tools have required users to embed detailed timing and interface information as well as the synthesis constraints into the source code.
  • Lack of satisfactory quality of results (QoR):.
  • There was no dependable RTL to GDSII foundation to support HLS, which made it difficult to consistently measure, track, and enhance HLS results.
  • As a result, the final implementation often fails to meet timing/power requirements.
  • Another major factor limiting quality of result was the limited capability of HLS tools to exploit performance-optimized and power-efficient IP blocks on a specific platform, such as the versatile DSP blocks and on-chip memories on modern FPGA platforms.

Lack of a compelling reason/event to adopt a new design methodology:

  • The first-generation HLS tools were clearly ahead of their time, as the design complexity was still manageable at the register transfer level in late 1990s.
  • Like any major transition in the EDA industry, designers needed a compelling reason or event to push them over the "tipping point," i.e., to adopt the HLS design methodology.
  • This goal is not generally practical for HLS to achieve.
  • It is critical that these optimizations be carefully implemented using scalable and predictable algorithms, keeping tool runtimes acceptable for large programs and the results understandable by designers.
  • The code should be readable by algorithm specialists.

2. Effectively generate efficient parallel architectures

  • With minimal modification of the C code, for parallelizable algorithms.
  • Allow an optimization-oriented design process, where a designer can improve the performance of the resulting implementation by successive code modification and refactoring.
  • Generate implementations that are competitive with synthesizable RTL designs after automatic and manual optimization.
  • Moreover, the authors are pleased to see that the latest generation of HLS tools has made significant progress in providing wide language coverage and robust compilation technology, platform-based modeling, and advanced core HLS algorithms.
  • The authors shall discuss these advancements in more detail in the next few sections.

III. CASE STUDY OF STATE-OF-ART OF HIGH-LEVEL

  • SYNTHESIS FOR FPGAS AutoPilot is one of the most recent HLS tools, and is representative of the capabilities of the state-of-art commercial HLS tools available today.
  • AutoPilot outputs RTL in Verilog, VHDL or cycle-accurate SystemC for simulation and verification.
  • These SystemC wrappers connect high-level interfacing objects in the behavioral test bench with pin-level signals in RTL.
  • The reports include a breakdown of performance and area metrics by individual modules, functions and loops in the source code.
  • Finally, the generated HDL files and design constraints feed into the Xilinx RTL tools for implementation.

Improved design quality:

  • Comprehensive language support allows designers to take full advantage of rich C/C++ constructs to maximize simulation speed, design modularity and reusability, as well as synthesis QoR.
  • In fact, many early C-based synthesis tools only handle a very limited language subset, which typically includes the native integer data types (e.g., char, short, int, etc.), onedimensional arrays, if-then-else conditionals, and for loops.
  • The arbitrary-precision fixed-point (ap_fixed) data types support all common algorithmic operations.
  • Designers can explore the accuracy and cost tradeoff by modifying the resolution and fixed-point location and experimenting with various quantization and saturation modes.
  • AutoPilot also supports the OCSI synthesizable subset [113] for SystemC synthesis.

B. Use of state-of-the-art compiler technologies

  • AutoPilot tightly integrates the LLVM compiler infrastructure [59] [110] to leverage leading-edge compiler technologies.
  • AutoPilot uses the llvm-gcc front end to obtain an intermediate representation (IR) based on the LLVM instruction set.
  • In particular, the following classes of transformations and analyses have shown to be very useful for hardware synthesis: SSA-based code optimizations such as constant propagation, dead code elimination, and redundant code elimination based on global value numbering [2] .
  • Memory optimizations such as memory reuse, array scalarization, and array partitioning [19] to reduce the number of memory accesses and improve memory bandwidth.
  • In other words, the code can be optimized without considering the source language.

A. Platform modeling for Xilinx FPGAs

  • AutoPilot uses detailed target platform information to carry out informed and target-specific synthesis and optimization.
  • The resulting characterization data is then used to make implementation choices during synthesis.
  • Notably, the cost of implementing hardware on FPGAs is often different from that for ASIC technology.
  • On FPGAs, multiplexors typically have the same cost and delay as an adder (approximately one LUT/output).
  • FPGA technology also features heterogeneous on-chip resources, including not only LUTs and flip flops but also other prefabricated architecture blocks such as DSP48s and Block RAMs.

B. Integration with Xilinx toolset

  • In order to raise the level of design abstraction more completely, AutoPilot attempts to hide details of the downstream RTL flow from users as much as possible.
  • Otherwise, a user may be overwhelmed by the details of vendor-specific tools such as the formats of constraint and configuration files, implementation and optimization options, or directory structure requirements.
  • As shown in Figure 1 AutoPilot instantiates these interfaces along with adapter logic and appropriate EDK meta-information to enable a generated module can be quickly connected in an EDK system.

A. Efficient mathematical programming formulations to scheduling

  • Classical approaches to the scheduling problem in highlevel synthesis use either conventional heuristics such as list scheduling [1] and force-directed scheduling [73] , which often lead to sub-optimal solutions, due to the nature of local optimization methods, or exact formulations such as integerlinear programming [45] , which can be difficult to scale to large designs.
  • Unlike previous approaches where using O(m×n) binary variables to encode a scheduling solution with n operations and m steps [45] , SDC uses a continuous representation of time with only O(n) variables: for each operation i, a scheduling variable s i is introduced to represent the time step at which the operation is scheduled.
  • A linear program with a totally unimodular constraint matrix is guaranteed to have integral solutions.
  • Many commonly encountered constraints in high-level synthesis can be expressed in the form of integer-difference constraints.
  • Other complex constraints can be handled in similar ways, using approximations or other heuristics.

B. Soft constraints and applications for platform-based optimization

  • In a typical synthesis tool, design intentions are often expressed as constraints.
  • While some of these constraints are essential for the design to function correctly, many others are not.
  • It is possible that a solution with a slight nominal timing violation can still meet the frequency requirement, considering inaccuracy in interconnect delay estimation and various timing optimization procedures in later design stages, such as logic refactoring, retiming, and interconnect optimization.
  • The approach is based on the SDC formulation discussed in the preceding subsection, but allows some constraints to be violated.
  • Consider the scheduling problem with both hard constraints and soft constraints formulated as follows.

Gs ≤ p hard constraints

  • Here G and H corresponds to the matrices representing hard constraints and soft constraints, respectively, and they are both totally unimodular as shown in [15] .
  • Hard constraints and soft constraints are generated based on the functional specification and QoR targets.
  • This approach offers a powerful yet flexible framework to address various considerations in scheduling.
  • Take the DSP48E block in Xilinx Virtex 5 FPGAs for example: each of the DSP48E blocks contains a multiplier and a post-adder, allowing efficient implementations of multiplication and multiply-accumulation.

C. Pattern mining for efficient sharing

  • A typical target architecture for HLS may introduce multiplexers when functional units, storage units or interconnects are shared by multiple operations/variables in a time-multiplexed manner.
  • Multiplexers (especially large ones) can be particularly expensive on FPGA platforms.
  • Thus, careless decisions on resource sharing could introduce more overhead than benefit.
  • The method tries to extract common structures or patterns in the data-flow graph, so that different instances of the same pattern can share resources with little overhead.
  • Pruning techniques are proposed based on characteristic vectors and locality-sensitive hashing.

D. Memory analysis and optimizations

  • While application-specific computation platforms such as FPGAs typically have considerable computational capability, their performance is often limited by available communication or memory bandwidth.
  • Typical FPGAs, such as the Xilinx Virtex series, have a considerable number of block RAMs.
  • Consider a loop that accesses array A with subscripts i, 2×i+1, and 3×i+1, in the ith iteration.
  • If the loop is targeted to be pipelined with the initiation interval of one, i.e., a new loop iteration starts every clock cycle, the schedule in (b) will lead to port conflicts, because (i+1) mod 2 = (2×(i+1)+1) mod 2 = (3×i+1) mod 2, when i is even; this will lead to three simultaneous accesses to the first bank.
  • Then, an iterative algorithm is used to perform both scheduling and memory partitioning guided by the conflict graph.

VII. ADVANCES IN SIMULATION AND VERIFICATION

  • Besides the many advantages of automated synthesis, such as quick design space exploration and automatic complex architectural changes like pipelining, resource sharing and scheduling, HLS also enables a more efficient debugging and verification flow at the higher abstraction levels.
  • Since HLS provides an automatic path to implementable RTL from behavioral/functional models, designers do not have to wait until manual RTL models to be available to conduct verification.
  • Instead, they can develop, debug and functionally verify a design at an earlier stage with high-level programming languages and tools.
  • This can significantly reduce the verification effort due to the following reasons: (i) It is easier to trace, identify and fix bugs at higher abstraction levels with more compact and readable design descriptions.
  • (ii) Simulation at the higher level is typically orders of magnitude faster than RTL simulation, allowing more comprehensive tests and greater coverage.

A. Automatic co-simulation

  • At present, simulation is the still prevalent technique to check if the resulting RTL complies with the high-level specification.
  • To reduce effort spent on RTL simulation, the latest HLS technologies have made important improvements on automatic co-simulation [86].
  • A C-to-RTL transactor is created to connect highlevel interfacing constructs (such as parameters and global variables) with pin-level signals in RTL.
  • This wrapper also includes additional control logic to manage the communication between the testing module and the RTL design under test (DUT).
  • A pipelined design may require that the test bench feed input data into the DUT at a fixed rate.

PLATFORMS

  • In the end, the time-to-market of an FPGA system design is dependent on many factors, such as availability of reference designs, development boards, and in the end, FPGA devices themselves.
  • This integration often includes a wide variety of system-level design concerns, including embedded software, system integration, and verification [104] .
  • As a result, these cores are not easily amenable to high-level synthesis and form part of the system infrastructure of a design.
  • Subsystem PSS is responsible for executing the relatively low-performance processing in the system.
  • The portion of a design generated using HLS represents the bulk of the FPGA design and communicates with the system infrastructure through standardized wire-level interfaces, such as AXI4 memory-mapped and streaming interfaces [96] shown in Figure 7 .

A. High-level design of cognitive radios project

  • Cognitive radio systems typically contain both computationally intensive processing with high data rates in the radio processing, along with complex, but relatively lowrate processing to control the radio processing.
  • Efficient interaction with the processor is an important part of the overall system complexity.
  • The processor subsystem contains standard hardware modules and is capable of running a standard embedded operating system, such as Linux.
  • The accelerator subsystem is used for implementing components with high computational requirements in hardware.
  • Components also expose a configuration interface with multiple parameters, allowing them to be reconfigured in an executing system by user-defined control code executing in the processor subsystem.

B. Video Starter Kit

  • Video processing systems implemented in FPGA include a wide variety of applications from embedded computer-vision and picture quality improvement to image and video compression.
  • Typically these systems include two significant pieces of complexity.
  • This platform is derived from the Xilinx EDKbased reference designs provided with the Xilinx Spartan 3ADSP Video Starter Kit and has been ported to several Xilinx Virtex 5 and Spartan 6 based development boards, targeting high-definition HD video processing with pixel clocks up to 150 MHz.
  • The incoming video data is analyzed by the Frame Decoder block to determine the frame size of the incoming video, which is passed to the application block, enabling different video formats to be processed.
  • The interface to external memory used for frame buffers is implemented using the Xilinx Multi-ported Memory Controller (MPMC) [118] which provides access to external memory to the Application Block and to the Microblaze control processor, if necessary.

A. Summary of BDTI HLS Certification

  • Xilinx has worked with BDTI Inc. [99] to implement an HLS Tool Certification Program [100] .
  • This program was designed to compare the results of an HLS Tool and the Xilinx Spartan 3 FPGA that is part of the Video Starter Kit, with the result of a conventional DSP processor and with the results of a good manual RTL implementation.
  • There were two applications used in this Certification Program, an optical flow algorithm, which is characteristic for a demanding image processing application and a wireless application for which a very representative implementation in RTL was available.
  • The DSP processor implementation rated "fair", while the AutoPilot implementation rated "good", indicating that less source code modification was necessary to achieve high performance when using AutoPilot.
  • BDTI also assessed overall ease of use of the DSP tool flow and the FPGA tool flow, combining HLS with the low-level implementation tools.

B. Sphere Decoder

  • Xilinx has implemented a sphere decoder for a multi-input multi-output (MIMO) wireless communication system using AutoPilot [67] [85] .
  • The application exhibits a large amount of parallelism, since the operations must be executed on each of 360 independent subcarriers which form the overall communication channel and the processing for each channel can generally be pipelined.
  • The resulting HLS code for the application makes heavy use of C++ templates to describe arbitrary-precision integer data types and parameterized code blocks used to process different matrix sizes at different points in the application.
  • Both designs were implemented as standalone cores using ISE 12.1, targeting Xilinx Virtex 5 speed grade 2 at 225 MHz.
  • Using AutoPilot Version 2010.07.ft, the authors were able to generate a design that was smaller than the reference implementation in less time than a hand RTL implementation by refactoring and optimizing the algorithmic C model.

Toplevel Block Diagram

  • Design time for the RTL design was estimated from work logs by the original authors of [28] , and includes only the time for an algorithm expert and experienced tool user to enter and verify the RTL architecture in System Generator.
  • Given the significant time familiarizing ourselves with the application and structure of the code, the authors believe that an application expert familiar with the code would be able to create such a design at least twice as fast.
  • To meet the required throughput, one row of the systolic array is instantiated, consisting of one diagonal cell and 8 off-diagonal cells, and the remaining rows are time multiplexed over the single row.
  • In the 4x4 case, the off-diagonal cell implements finegrained resource sharing, with one resource-shared complex multiplier.
  • The authors do observe that AutoPilot uses additional BRAM to implement this block relative to the RTL implementation, because AutoPilot requires tool-implemented double-buffers to only be read or written in a single loop.

X. CONCLUSIONS AND CHALLENGES AHEAD

  • It seems clear that the latest generation of FPGA HLS tools has made significant progress in providing wide language coverage, robust compilation technology, platform-based modeling, and domain-specific system-level integration.
  • As a result, they can quickly provide highly competitive quality of results, in many cases comparable or better than manual RTL designs.
  • For the FPGA design community, it appears that HLS technology may be transitioning from research and investigation to selected deployment.
  • The authors also see many opportunities for HLS tools to further improve.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

>
FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE)
<
1
High-Level Synthesis for FPGAs: From Prototyping to Deployment
Jason Cong
1,2
, Fellow, IEEE, Bin Liu
1,2
,
Stephen Neuendorffer
3
, Member, IEEE, Juanjo Noguera
3
,
Kees Vissers
3
, Member, IEEE and Zhiru Zhang
1
, Member, IEEE
1
AutoESL Design Technologies, Inc.
2
University of California, Los Angeles
3
Xilinx, Inc.
Abstract—Escalating System-on-Chip design complexity is
pushing the design community to raise the level of abstraction
beyond RTL. Despite the unsuccessful adoptions of early
generations of commercial high-level synthesis (HLS) systems, we
believe that the tipping point for transitioning to HLS
methodology is happening now, especially for FPGA designs. The
latest generation of HLS tools has made significant progress in
providing wide language coverage and robust compilation
technology, platform-based modeling, advancement in core HLS
algorithms, and a domain-specific approach. In this paper we use
AutoESL’s AutoPilot HLS tool coupled with domain-specific
system-level implementation platforms developed by Xilinx as an
example to demonstrate the effectiveness of state-of-art C-to-
FPGA synthesis solutions targeting multiple application domains.
Complex industrial designs targeting Xilinx FPGAs are also
presented as case studies, including comparison of HLS solutions
versus optimized manual designs.
Index TermsDomain-specific design, field-programmable
gate array (FPGA), high-level synthesis (HLS), quality of results
(QoR).
I. I
NTRODUCTION
HE RAPID INCREASE
of complexity in System-on-a-Chip
(SoC) design has encouraged the design community to
seek design abstractions with better productivity than RTL.
Electronic system-level (ESL) design automation has been
widely identified as the next productivity boost for the
semiconductor industry, where HLS plays a central role,
enabling the automatic synthesis of high-level, untimed or
partially timed specifications (such as in C or SystemC) to a
low-level cycle-accurate register-transfer level (RTL)
specifications for efficient implementation in ASICs or
FPGAs. This synthesis can be optimized taking into account
the performance, power, and cost requirements of a particular
system.
Despite the past failure of the early generations of
commercial HLS systems (started in the 1990s), we see a
rapidly growing demand for innovative, high-quality HLS
solutions for the following reasons:
Embedded processors are in almost every SoC: With
the coexistence of micro-processors, DSPs, memories
and custom logic on a single chip, more software
elements are involved in the process of designing a
modern embedded system. An automated HLS flow
allows designers to specify design functionality in high-
level programming languages such as C/C++ for both
embedded software and customized hardware logic on
the SoC. This way, they can quickly experiment with
different hardware/software boundaries and explore
various area/power/performance tradeoffs from a single
common functional specification.
Huge silicon capacity requires a higher level of
abstraction: Design abstraction is one of the most
effective methods for controlling complexity and
improving design productivity. For example, the study
from NEC [90] shows that a 1M-gate design typically
requires about 300K lines of RTL code, which cannot be
easily handled by a human designer. However, the code
density can be easily reduced by 7X to 10X when moved
to high-level specification in C, C++, or SystemC. In this
case, the same 1M-gate design can be described in 30K
to 40K lines of lines of behavioral description, resulting
in a much reduced design complexity.
Behavioral IP reuse improves design productivity: In
addition to the line-count reduction in design
specifications, behavioral synthesis has the added value
of allowing efficient reuse of behavioral IPs. As opposed
to RTL IP which has fixed microarchitecture and
interface protocols, behavioral IP can be retargeted to
different implementation technologies or system
requirements.
Verification drives the acceptance of high-level
specification: Transaction-level modeling (TLM) with
SystemC [107] or similar C/C++ based extensions has
become a very popular approach to system-level
verification [35]. Designers commonly use SystemC
TLMs to describe virtual software/hardware platforms,
which serve three important purposes: early embedded
software development, architectural modeling and
exploration, and functional verification. The wide
availability of SystemC functional models directly drives
the need for SystemC-based HLS solutions, which can
automatically generate RTL code through a series of
formal constructive transformations. This avoids slow
and error-prone manual RTL re-coding, which is the
standard practice in the industry today.
Trend towards extensive use of accelerators and
heterogeneous SoCs: Many SoCs, or even CMPs (chip
multi-processors) move towards inclusion of many
accelerators (or algorithmic blocks), which are built with
custom architectures, largely to reduce power compared
to using multiple programmable processors. According
to ITRS prediction [109], the number of on-chip
accelerators will reach 3000 by 2024. In FPGAs, custom
T

>
FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE)
<
2
architecture for algorithmic blocks provides higher
performance in a given amount of FPGA resources than
synthesized soft processors. These algorithmic blocks
are particularly appropriate for HLS.
Although these reasons for adopting HLS design
methodology are common to both ASIC and FPGA designers,
we also see additional forces that push the FPGA designers for
faster adoption of HLS tools.
Less pressure for formal verification: The ASIC
manufacturing cost in nanometer IC technologies is well
over $1M [109]. There is tremendous pressure for the
ASIC designers to achieve first tape-out success. Yet
formal verification tools for HLS are not mature, and
simulation coverage can be limited for multi-million gate
SOC designs. This is a significant barrier for HLS
adoption in the ASIC world. However, for FPGA
designs, in-system simulation is possible with much
wider simulation coverage. Design iterations can be
done quickly and inexpensively without huge
manufacturing costs.
Ideal for platform-based synthesis: Modern FPGAs
embed many pre-defined/fabricated IP components, such
as arithmetic function units, embedded memories,
embedded processors, and embedded system buses.
These pre-defined building blocks can be modeled
precisely ahead of time for each FPGA platform and, to
a large extent, confine the design space. As a result, it is
possible for modern HLS tools to apply a platform-based
design methodology [51] and achieve higher quality of
results (QoR).
More pressure for time-to-market: FPGA platforms
are often selected for systems where time-to-market is
critical, in order to avoid long chip design and
manufacturing cycles. Hence, designers may accept
increased performance, power, or cost in order to reduce
design time. As shown in Section IX, modern HLS tools
put this tradeoff in the hands of a designer allowing
significant reduction in design time or, with additional
effort, quality of result comparable to hand-written RTL.
Accelerated or reconfigurable computing calls for
C/C++ based compilation/synthesis to FPGAs: Recent
advances in FPGAs have made reconfigurable
computing platforms feasible to accelerate many high-
performance computing (HPC) applications, such as
image and video processing, financial analytics,
bioinformatics, and scientific computing applications.
Since RTL programming in VHDL or Verilog is
unacceptable to most application software developers, it
is essential to provide a highly automated
compilation/synthesis flow from C/C++ to FPGAs.
As a result, a growing number of FPGA designs are
produced using HLS tools. Some example application
domains include 3G/4G wireless systems [38][81], aerospace
applications [75], image processing [27], lithography
simulation [13], and cosmology data analysis [52]. Xilinx is
also in the process of incorporating HLS solutions in their
Video Development Kit [116] and DSP Develop Kit [97] for
all Xilinx customers.
This paper discusses the reasons behind the recent success
in deploying HLS solutions to the FPGA community. In
Section II we review the evolution of HLS systems and
summarize the key lessons learned. In Sections IV-VIII, using
a state-of-art HLS tool as an example, we discuss some key
reasons for the wider adoption of HLS solutions in the FPGA
design community, including wide language coverage and
robust compilation technology, platform-based modeling,
advancement in core HLS algorithms, improvements on
simulation and verification flow, and the availability of
domain-specific design templates. Then, in Section IX, we
present the HLS results on several real-life industrial designs
and compare with manual RTL implementations. Finally, in
Section X, we conclude the paper with discussions of future
challenges and opportunities.
II. E
VOLUTION OF HIGH
-
LEVEL SYNTHESIS FOR
FPGA
In this section we briefly review the evolution of high-level
synthesis by looking at representative tools. Compilers for
high-level languages have been successful in practice since the
1950s. The idea of automatically generating circuit
implementations from high-level behavioral specifications
arises naturally with the increasing design complexity of
integrated circuits. Early efforts (in the 1980s and early 1990s)
on high-level synthesis were mostly research projects, where
multiple prototype tools were developed to call attention to the
methodology and to experiment with various algorithms. Most
of those tools, however, made rather simplistic assumptions
about the target platform and were not widely used. Early
commercialization efforts in the 1990s and early 2000s
attracted considerable interest among designers, but also failed
to gain wide adoption, due in part to usability issues and poor
quality of results. More recent efforts in high-level synthesis
have improved usability by increasing input language
coverage and platform integration, as well as improving
quality of results.
A. Early Efforts
Since the history of HLS is considerably longer than that of
FPGAs, most early HLS tools targeted ASIC designs. A
pioneering high-level synthesis tool, CMU-DA, was built by
researchers at Carnegie Mellon University in the 1970s
[29][71]. In this tool the design is specified at behavior level
using the ISPS (Instruction Set Processor Specification)
language [4]. It is then translated into an intermediate data-
flow representation called the Value Trace [79] before
producing RTL. Many common code-transformation
techniques in software compilers, including dead-code
elimination, constant propagation, redundant sub-expression
elimination, code motion, and common sub-expression
extraction could be performed. The synthesis engine also
included many steps familiar in hardware synthesis, such as
datapath allocation, module selection, and controller
generation. CMU-DA also supported hierarchical design and
included a simulator of the original ISPS language. Although
many of the methods used were very preliminary, the

>
FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE)
<
3
innovative flow and the design of toolsets in CMU-DA
quickly generated considerable research interest.
In the subsequent years in the 1980s and early 1990s, a
number of similar high-level synthesis tools were built, mostly
for research. Examples of academic efforts include the ADAM
system developed at the University of Southern California
[37][46], HAL developed at Bell-Northern Research [72],
MIMOLA developed at University of Kiel, Germany [62], the
Hercules/Hebe high-level synthesis system (part of the
Olympus system) developed at Stanford University [24][25]
[55], the Hyper/Hyper-LP system developed at University of
California, Berkeley [10][77]. Industry efforts include
Cathedral/Cathedral-II and their successors developed at
IMEC [26], the IBM Yorktown Silicon Compiler [11] and the
GM BSSC system [92], among many others. Like CMU-DA,
these tools typically decompose the synthesis task into a few
steps, including code transformation, module selection,
operation scheduling, datapath allocation, and controller
generation. Many fundamental algorithms addressing these
individual problems were also developed. For example, the list
scheduling algorithm and its variants are widely used to solve
scheduling problems with resource constraints [70]; the force-
directed scheduling algorithm developed in HAL [73] is able
to optimize resource requirements under a performance
constraint; the path-based scheduling algorithm in the
Yorktown Silicon Compiler is useful to optimize performance
with conditional branches [12]. The Sehwa tool in ADAM is
able to generate pipelined implementations and explore the
design space by generating multiple solutions [69]. The
relative scheduling technique developed in Hebe is an elegant
way to handle operations with unbounded delay [56]. Conflict-
graph coloring techniques were developed and used in several
systems to share resources in the datapath [57][72].
These early high-level tools often used custom languages
for design specification. Besides the ISPS language used in
CMD-DA, a few other languages were notable. HardwareC is
a language designed for use in the Hercules system [54].
Based on the popular C programming language, it supports
both procedural and declarative semantics and has built-in
mechanisms to support design constraints and interface
specifications. This is one of the earliest C-based hardware
synthesis languages for high-level synthesis and is interesting
to compare with similar languages later. The Silage language
used in Cathedral/Cathedral-II was specifically designed for
the synthesis of digital signal processing hardware [26]. It has
built-in support for customized data types, and allows easy
transformations [77][10]. The Silage language, along with the
Cathedral-II tool, represented an early domain-specific
approach in high-level synthesis.
These early research projects helped to create a basis for
algorithmic synthesis with many innovations, and some were
even used to produce real chips. However, these efforts did
not lead to wide adoption among designers. A major reason is
that the methodology of using RTL synthesis was not yet
widely accepted at that time and RTL synthesis tools were not
yet mature. Thus, high-level synthesis, built on top of RTL
synthesis, did not have a sound foundation in practice. In
addition, simplistic assumptions were often made in these
early systems—many of them were “technology independent”
(such as Olympus), and inevitably led to suboptimal results.
With improvements in RTL synthesis tools and the wide
adoption of RTL-based design flows in the 1990s, industrial
deployment of high-level synthesis tools became more
practical. Proprietary tools were built in major semiconductor
design houses including IBM [5], Motorola [58], Philips [61],
and Simens [6]. Major EDA vendors also began to provide
commercial high-level synthesis tools. In 1995, Synopsys
announced Behavioral Compiler [88], which generates RTL
implementations from behavioral HDL code and connects to
downstream tools. Similar tools include Monet from Mentor
Graphics [33] and Visual Architect from Cadence [43]. These
tools received wide attention, but failed to widely replace RTL
design. One reason is due to the use of behavioral HDLs as the
input language, which
is not popular among
algorithm and
system designers.
B. Recent efforts
Since 2000, a new generation of high-level synthesis tools
has been developed in both academia and industry. Unlike
many predecessors, most of these tools focus on using C/C++
or C-like languages to capture design intent. This makes the
tools much more accessible to algorithm and system designers
compared to previous tools that only accept HDL languages. It
also enables hardware and software to be built using a
common model, facilitating software/hardware co-design and
co-verification. The use of C-based languages also makes it
easy to leverage the newest technologies in software compilers
for parallelization and optimization in the synthesis tools.
In fact, there has been an ongoing debate on whether C-
based languages are proper choices for HLS [31][78]. Despite
the many advantages of using C-based languages, opponents
often criticize C/C++ as languages only suitable for describing
sequential software that runs on microprocessors. Specifically,
the deficiencies of C/C++ include the following:
(i) Standard C/C++ lack built-in constructs to explicitly
specify bit accuracy, timing, concurrency, synchronization,
hierarchy, etc., which are critical to hardware design.
(ii) C and C++ have complex language constructs, such as
pointers, dynamic memory management, recursion,
polymorphism, etc., which do have efficient hardware
counterparts and lead to difficulty in synthesis.
To address these deficiencies, modern C-based HLS tools
have introduced additional language extensions and
restrictions to make C inputs more amenable to hardware
synthesis. Common approaches include both restriction to a
synthesizable subset that discourages or disallows the use of
dynamic constructs (as required by most tools) and
introduction of hardware-oriented language extensions
(HardwareC [54], SpecC [34], Handel-C [95]), libraries
(SystemC [107]), and compiler directives to specify
concurrency, timing, and other constraints. For example,
Handel-C allows the user to specify clock boundaries
explicitly in the source code. Clock edges and events can also
be explicitly specified in SpecC and SystemC. Pragmas and

>
FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE)
<
4
directives along with a subset of ANSI C/C++ are used in
many commercial tools. An advantage of this approach is that
the input program can be compiled using standard C/C++
compilers without change, so that such a program or a module
of it can be easily moved between software and hardware and
co-simulation of hardware and software can be performed
without code rewriting. At present, most commercial HLS
tools use some form of C-based design entry, although tools
using other input languages (e.g., BlueSpec [102], Esterel [30],
Matlab [42], etc.) also exist.
Another notable difference between the new generation of
high-level synthesis tools and their predecessors is that many
tools are built targeting implementation on FPGA. FPGAs
have continually improved in capacity and speed in recent
years, and their programmability makes them an attractive
platform for many applications in signal processing,
communication, and high-performance computing. There has
been a strong desire to make FPGA programming easier, and
many high-level synthesis tools are designed to specifically
target FPGAs, including ASC [64], CASH [9], C2H from
Altera [98], DIME-C from Nallatech [112], GAUT [22],
Handel-C compiler (now part of Mentor Graphics DK Design
Suite) [95], Impulse C [74], ROCCC [87][39], SPARK
[41][40], Streams-C compiler [36], and Trident [82][83], .
ASIC tools also commonly provide support for targeting an
FPGA tool flow in order to enable system emulation.
Among these high-level synthesis tools, many are designed
to focus on a specific application domain. For example, the
Trident compiler, developed at Los Alamos National Lab, is
an open-source tool focusing on the implementation of
floating-point scientific computing applications on FPGA.
Many tools, including GAUT, Streams-C, ROCCC, ASC, and
Impulse C, target streaming DSP applications. Following the
tradition of Cathedral, these tools implement architectures
consisting of a number of modules connected using FIFO
channels. Such architectures can be integrated either as a
standalone DSP pipeline, or integrated to accelerate code
running on a processor (as in ROCCC).
As of 2010, major commercial C-based high-level synthesis
tools include AutoESL’s AutoPilot [94] (originated from
UCLA xPilot project [17]), Cadence’s C-to-Silicon Compiler
[3][103], Forte’s Cynthesizer [65], Mentor’s Catapult C [7],
NEC’s Cyber Workbench [89][91], and Synopsys Synphony C
[115] (formerly Synfora’s PICO Express, originated from a
long range research effort in HP Labs [49]).
C. Lessons Learned
Despite extensive development efforts, most commercial
HLS efforts have failed. We believe that past failures are due
to one or several of the following reasons:
Lack of comprehensive design language support: The
first generation of the HLS synthesis tools could not
synthesize high-level programming languages. Instead,
untimed or partially timed behavioral HDL was used.
Such design entry marginally raised the abstraction
level, while imposing a steep learning curve on both
software and hardware developers.
Although early C-based HLS technologies have
considerably improved the ease of use and the level of
design abstraction, many C-based tools still have glaring
deficiencies. For instance, C and C++ lack the necessary
constructs and semantics to represent hardware attributes
such as design hierarchy, timing, synchronization, and
explicit concurrency. SystemC, on the other hand, is
ideal for system-level specification with
software/hardware co-design. However, it is foreign to
algorithmic designers and has slow simulation speed
compared to pure ANSI C/C++ descriptions.
Unfortunately, most early HLS solutions commit to only
one of these input languages, restricting their usage to
niche application domains.
Lack of reusable and portable design specification:
Many HLS tools have required users to embed detailed
timing and interface information as well as the synthesis
constraints into the source code. As a result, the
functional specification became highly tool-dependent,
target-dependent, and/or implementation-platform
dependent. Therefore, it could not be easily ported to
alternative implementation targets.
7arrow focus on datapath synthesis: Many HLS tools
focus primarily on datapath synthesis, while leaving
other important aspects unattended, such as interfaces to
other hardware/software modules and platform
integration. Solving the system integration problem then
becomes a critical design bottleneck, limiting the value
in moving to a higher-level design abstraction for IP in a
design.
Lack of satisfactory quality of results (QoR): When
early generations of HLS tools were introduced in the
mid-1990s to early 2000s, the EDA industry was still
struggling with timing closure between logic and
physical designs. There was no dependable RTL to
GDSII foundation to support HLS, which made it
difficult to consistently measure, track, and enhance
HLS results. Highly automated RTL to GDSII solutions
only became available in late 2000s (e.g., provided by
the IC Compiler from Synopsys [114] or the
BlastFusion/Talus from Magma [111]). Moreover, many
HLS tools are weak in optimizing real-life design
metrics. For example, the commonly used algorithms
mainly focus on reducing functional unit count and
latency, which do not necessarily correlate to actual
silicon area, power, and performance. As a result, the
final implementation often fails to meet timing/power
requirements. Another major factor limiting quality of
result was the limited capability of HLS tools to exploit
performance-optimized and power-efficient IP blocks on
a specific platform, such as the versatile DSP blocks and
on-chip memories on modern FPGA platforms. Without
the ability to match the QoR achievable with an RTL
design flow, most designers were unwilling to explore
potential gains in design productivity.
Lack of a compelling reason/event to adopt a new
design methodology: The first-generation HLS tools

>
FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE)
<
5
were clearly ahead of their time, as the design
complexity was still manageable at the register transfer
level in late 1990s. Even as the second-generation of
HLS tools showed interesting capabilities to raise the
level of design abstraction, most designers were
reluctant to take the risk of moving away from the
familiar RTL design methodology to embrace a new
unproven one, despite its potential large benefits. Like
any major transition in the EDA industry, designers
needed a compelling reason or event to push them over
the “tipping point,” i.e., to adopt the HLS design
methodology.
Another important lesson learned is that tradeoffs must be
made in the design of the tool. Although a designer might
wish for a tool that takes any input program and generates the
“best” hardware architecture, this goal is not generally
practical for HLS to achieve. Whereas compilers for
processors tend to focus on local optimizations with the sole
goal of increasing performance, HLS tools must automatically
balance performance and implementation cost using global
optimizations. However, it is critical that these optimizations
be carefully implemented using scalable and predictable
algorithms, keeping tool runtimes acceptable for large
programs and the results understandable by designers.
Moreover, in the inevitable case that the automatic
optimizations are insufficient, there must be a clear path for a
designer to identify further optimization opportunities and
execute them by rewriting the original source code.
Hence, it is important to focus on several design goals for a
high-level synthesis tool:
1. Capture designs at a bit-accurate, algorithmic level in
C code. The code should be readable by algorithm
specialists.
2. Effectively generate efficient parallel architectures
with minimal modification of the C code, for
parallelizable algorithms.
3. Allow an optimization-oriented design process, where
a designer can improve the performance of the
resulting implementation by successive code
modification and refactoring.
4. Generate implementations that are competitive with
synthesizable RTL designs after automatic and manual
optimization.
We believe that the tipping point for transitioning to HLS
methodology is happening now, given the reasons discussed in
Section I and the conclusions by others [14][84]. Moreover,
we are pleased to see that the latest generation of HLS tools
has made significant progress in providing wide language
coverage and robust compilation technology, platform-based
modeling, and advanced core HLS algorithms. We shall
discuss these advancements in more detail in the next few
sections.
III. C
ASE
S
TUDY OF
S
TATE
-
OF
-
ART OF HIGH
-
LEVEL
SYNTHESIS FOR
FPGA
S
AutoPilot is one of the most recent HLS tools, and is
representative of the capabilities of the state-of-art commercial
HLS tools available today. Figure 1 shows the AutoESL
AutoPilot development flow targeting Xilinx FPGAs.
AutoPilot accepts synthesizable ANSI C, C++, and OSCI
SystemC (based on the synthesizable subset of the IEEE-1666
standard [113]) as input and performs advanced platform-
based code transformations and synthesis optimizations to
generate optimized synthesizable RTL.
AutoPilot outputs RTL in Verilog, VHDL or cycle-accurate
SystemC for simulation and verification. To enable automatic
co-simulation, AutoPilot creates test bench wrappers and
transactors in SystemC so that designers can leverage the
original test framework in C/C++/SystemC to verify the
correctness of the RTL output. These SystemC wrappers
connect high-level interfacing objects in the behavioral test
bench with pin-level signals in RTL. AutoPilot also generates
appropriate simulation scripts for use with 3
rd
-party RTL
simulators. Thus designers can easily use their existing
simulation environment to verify the generated RTL.
AutoPilot
Synthesis
AutoPilot
Simulation
AutoPilot
Module
Generation
High-level
Spec (C,
C++,
Design
Test Bench
RTL
(SystemC,
VHDL,
Verilog)
Design
Wrapper
Synthesis
Directives
Simulation
Scripts
Implementation
Scripts
RTL/Netlist
Xilinx ISE
EDK
Xilinx
CoreGen
RTL
Simulator
FPGA Platform Libs
Bitstream
Figure 1. AutoESL and Xilinx C-to-FPGA design flow.
In addition to generating RTL, AutoPilot also creates
synthesis reports that estimate FPGA resource utilization, as
well as the timing, latency and throughput of the synthesized
design. The reports include a breakdown of performance and
area metrics by individual modules, functions and loops in the
source code. This allows users to quickly identify specific
areas for QoR improvement and then adjust synthesis
directives or refine the source design accordingly.
Finally, the generated HDL files and design constraints feed
into the Xilinx RTL tools for implementation. The Xilinx ISE
tool chain (such as CoreGen, XST, PAR, etc.) and Embedded
Development Kit (EDK) are used to transform that RTL
implementation into a complete FPGA implementation in the
form of a bitstream for programming the target FPGA
platform.

Citations
More filters
Proceedings ArticleDOI
28 Sep 2016
TL;DR: This article describes design methodologies and tools, implementation and first results of created VHDL backend for RPython compiler, a RPython based High-Level synthesis (HLS) compiler.
Abstract: The development of FPGA technology and the increasing complexity of applications in recent decades have forced compilers to move to higher abstraction levels. Compilers interprets an algorithmic description of a desired behavior written in High-Level Languages (HLLs) and translate it to Hardware Description Languages (HDLs). This paper presents a RPython based High-Level synthesis (HLS) compiler. The compiler get the configuration parameters and map RPython program to VHDL. Then, VHDL code can be used to program FPGA chips. In comparison of other technologies usage, FPGAs have the potential to achieve far greater performance than software as a result of omitting the fetch-decode-execute operations of General Purpose Processors (GPUs), and introduce more parallel computation. This can be exploited by utilizing many resources at the same time. Creating parallel algorithms computed with FPGAs in pure HDL is difficult and time consuming. Implementation time can be greatly reduced with High-Level Synthesis compiler. This article describes design methodologies and tools, implementation and first results of created VHDL backend for RPython compiler.

1 citations


Cites methods from "High-Level Synthesis for FPGAs: Fro..."

  • ...The HLS steps of the compilation process is presented in Figure 2 and can be summarized as follows[2-13]: The code of each source functions is converted to a Control Flow Graph(CFG) by the flow graph builder....

    [...]

Proceedings ArticleDOI
11 Sep 2015
TL;DR: This paper presents a python to VHDL compiler, which interprets an algorithmic description of a desired behavior written in Python and translates it to V HDL, and shows how High-Level Synthesis compiler implementation time can be reduced.
Abstract: This paper presents a python to VHDL compiler. The compiler interprets an algorithmic description of a desiredbehavior written in Python and translate it to VHDL. FPGA combines many benefits of both software and ASICimplementations. Like software, the programmed circuit is flexible, and can be reconfigured over the lifetime of the system. FPGAs have the potential to achieve far greater performance than software as a result of bypassing the fetch-decode-execute operations of traditional processors. and possibly exploiting a greater level of parallelism. This can be achived by using many computational resources at the same time. Creating parallel programs implemented in FPGAs in pure HDL is difficult and time consuming. Using higher level of abstraction and High-Level Synthesis compilerimplementation time can be reduced. The compiler has been implemented using the Python language. This articledescribes design, implementation and results of created tools. Keywords: FPGA, Algorithmic Synthesis, High-Level Synthesis, Behavioral Synthesis, Hot Plasma Physics Experiment, Python, Compiler.

1 citations


Cites methods from "High-Level Synthesis for FPGAs: Fro..."

  • ...The created compiler have built-in methods [1-12] to facilitate fast implementation of parallel programs into FPGAs: • Algorithm description is the process of capturing specifications as program-like description and making these available for next synthesis subtasks....

    [...]

Book ChapterDOI
Cesar Torres-Huitzil1
01 Jan 2016
TL;DR: This chapter presents a review of hardware implementations of feature detectors using FPGAs targeted to embedded computing scenarios, and addresses a broad range of techniques, methods, systems and solutions related to algorithm-to-hardware mapping of image interest point detectors.
Abstract: Fast and accurate image feature detectors are an important challenge in computer vision as they are the basis for high-level image processing analysis and understanding. However, image feature detectors cannot be easily applied in real-time embedded computing scenarios, such as autonomous robots and vehicles, mainly due to the fact that they are time consuming and require considerable computational resources. For embedded and low power devices, speed and memory efficiency is of main concern, and therefore, there have been several recent attempts to improve this performance gap through dedicated hardware implementations of feature detectors. Thanks to the fine grain massive parallelism and flexibility of software-like methodologies, reconfigurable hardware devices, such as Field Programmable Gate Arrays (FPGAs), have become a common choice to speed up computations. In this chapter, a review of hardware implementations of feature detectors using FPGAs targeted to embedded computing scenarios is presented. The necessary background and fundamentals to introduce feature detectors and their mapping to FPGA-based hardware implementations are presented. Then we provide an analysis of some relevant state-of-the-art hardware implementations, which represent current research solutions proposed in this field. The review addresses a broad range of techniques, methods, systems and solutions related to algorithm-to-hardware mapping of image interest point detectors. Our goal is not only to analyze, compare and consolidate past research work but also to appreciate their findings and discuss their applicability. Some possible directions for future research are presented.

1 citations


Cites background from "High-Level Synthesis for FPGAs: Fro..."

  • ...ductor industry, where High-Level Synthesis (HLS) plays a central role, enabling the automatic synthesis of high-level untimed specifications, to low-level cycle-accurate RTL specifications for efficient implementation in FPGAs [38]....

    [...]

Journal ArticleDOI
TL;DR: TARO as discussed by the authors is a framework that automatically applies the free-running optimization on HLS-based streaming applications without degrading the clock frequency or the performance of the original design.
Abstract: Streaming applications have become one of the key application domains for high-level synthesis (HLS) tools. For a streaming application, there is a potential to simplify the control logic by regulating each task with a stream of input and output data. This is called free-running optimization. But it is difficult to understand when such optimization can be applied without changing the functionality of the original design. Moreover, it takes a large effort to manually apply the optimization across legacy codes. In this article, we present the TARO framework which automatically applies the free-running optimization on HLS-based streaming applications. TARO simplifies the control logic without degrading the clock frequency or the performance. Experiments on Alveo U250 shows that we can obtain an average of 16% LUT and 45% FF reduction for streaming-based systolic array designs.

1 citations

Proceedings ArticleDOI
23 May 2016
TL;DR: A Domain-Specific Language (DSL) based on Scala is proposed to specify the architecture of accelerator-based SoCs and is leverage to coordinate commercial High-Level Synthesis (HLS) tools in order to create the corresponding accelerators with proper standard interfaces for system-level integration.
Abstract: Nowadays, thanks to technology miniaturization and industrial standards, it is possible to create System-on-Chip (SoC) architectures featuring a combination of many components, like processor cores and specialized hardware accelerators. However, designing an SoC to accelerate an embedded application is particularly complex. After decomposing this application into tasks and assigning each of them to a processing element, the designer must create the required hardware components and integrate them into the final system. Currently, this process is not well supported by commercial tool flows and has to be manually performed. This is time consuming and error prone. This paper proposes a Domain-Specific Language (DSL) based on Scala to specify the architecture of accelerator-based SoCs. We leverage this DSL to coordinate commercial High-Level Synthesis (HLS) tools in order to create the corresponding accelerators with proper standard interfaces for system-level integration.

1 citations


Cites background or methods from "High-Level Synthesis for FPGAs: Fro..."

  • ...However research is also active in the study of new HLS solutions alternative to the commercial ones [4], [17]....

    [...]

  • ...Reconfigurable hardware, like Field Programmable Gate Array (FPGA) devices, is playing a key role in SoC design [4]....

    [...]

  • ...In this context, commercial HLS tools help the designer in automatically translating high-level language descriptions (coming from the input application) into the corresponding Hardware Description Language (HDL) implementations (required to synthesize the accelerators) [4], [7]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: It is suggested that input and output are basic primitives of programming and that parallel composition of communicating sequential processes is a fundamental program structuring method.
Abstract: This paper suggests that input and output are basic primitives of programming and that parallel composition of communicating sequential processes is a fundamental program structuring method. When combined with a development of Dijkstra's guarded command, these concepts are surprisingly versatile. Their use is illustrated by sample solutions of a variety of a familiar programming exercises.

11,419 citations

Book
01 Jan 1985

9,210 citations

Proceedings ArticleDOI
20 Mar 2004
TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.
Abstract: We describe LLVM (low level virtual machine), a compiler framework designed to support transparent, lifelong program analysis and transformation for arbitrary programs, by providing high-level information to compiler transformations at compile-time, link-time, run-time, and in idle time between runs. LLVM defines a common, low-level code representation in static single assignment (SSA) form, with several novel features: a simple, language-independent type-system that exposes the primitives commonly used to implement high-level language features; an instruction for typed address arithmetic; and a simple mechanism that can be used to implement the exception handling features of high-level languages (and setjmp/longjmp in C) uniformly and efficiently. The LLVM compiler framework and code representation together provide a combination of key capabilities that are important for practical, lifelong analysis and transformation of programs. To our knowledge, no existing compilation approach provides all these capabilities. We describe the design of the LLVM representation and compiler framework, and evaluate the design in three ways: (a) the size and effectiveness of the representation, including the type information it provides; (b) compiler performance for several interprocedural problems; and (c) illustrative examples of the benefits LLVM provides for several challenging compiler problems.

4,841 citations


Additional excerpts

  • ...infrastructure [59][110] to leverage leading-edge compiler...

    [...]

Proceedings Article
01 Jan 1974
TL;DR: A simple language for parallel programming is described and its mathematical properties are studied to make a case for more formal languages for systems programming and the design of operating systems.
Abstract: In this paper, we describe a simple language for parallel programming. Its semantics is studied thoroughly. The desirable properties of this language and its deficiencies are exhibited by this theoretical study. Basic results on parallel program schemata are given. We hope in this way to make a case for more formal (i.e. mathematical) approach to the design of languages for systems programming and the design of operating systems. There is a wide disagreement among systems designers as to what are the best primitives for writing systems programs. In this paper, we describe a simple language for parallel programming and study its mathematical properties. 1. A SIMPLE LANGUAGE FOR PARALLEL PROGRAMMING The features of our mini-language are exhibited on the sample program S on Figure 1. The conventions are close to Algol1 and we only insist upon the new features. The program S consists of a set of declarations and a body. Variables of type integer channel are declared at line (1), and for any simple type σ (boolean, real, etc. . . ) we could have declared a σ channel. Then processes f , g and h are declared, much like procedures. Aside from usual parameters (passed by value in this example, like INIT at line (3)), we can declare in the heading of the process how it is linked to other processes : at line (2) f is stated to communicate via two input lines that can carry integers, and one similar output line. The body of a process is an usual Algol program except for invocation of wait until something on an input line (e.g. at (4)) or send a variable on a line of compatible type (e.g. at (5)). The process stays blocked on a wait until something is being sent on this line by another process, but nothing can prevent a process from performing a send on a line. In others words, processes communicate via first-in first-out (fifo) queues. Calling instances of the processes is done in the body of the main program at line (6) where the actual names of he channels are bound to the formal parameters of the processes. The infix operator par initiates the concurrent activation of the processes. Such a style of programming is close to may systems using EVENT mechanisms ([1, 2, 3, 4]). A pictorial representation of the program is the schema P on Figure 2, where the nodes represent processes and the arcs communication channels between these processes. What sort of things would we like to prove on a program like S? Firstly, that all processes in S run forever. Secondly, Begin (1) In t eg e r channel X, Y, Z , T1 , T2 ; (2 ) Process f ( i n t e r g e r in U,V; i n t e r g e r out W) ; Begin i n t e g e r I ; l o g i c a l B; B := true ; Repeat Begin (4 ) I := i f B then wait (U) e l s e wait (V) ; (7 ) p r in t ( I ) ; (5 ) send I on W; B := not B; End ; End ; Process g ( i n t e g e r in U ; i n t e g e r out V, W) ; Begin i n t e g e r I ; l o g i c a l B; B := true ; Repeat Begin I := wait (U) ; i f B then send I on V e l s e send I on W : B := not B; End ; End ; (3 ) Process h( i n t e g e r in U; i n t e g e r out V; i n t e g e r INIT ) ; Begin i n t e g e r I ; send INIT on V; Repeat Begin I := wait (U) ; send I on V; End ; End ; Comment : body o f mainprogram ; (6 ) f (X,Y,Z) par g (X,T1 ,T2) par h(T1 ,Y, 0 ) par h(T2 , Z , 1 ) ; End ; Figure 1: Sample parallel program S. more precisely, that S prints out (at line (7)) an alternating sequence of 0’s and 1’s forever. Third, that if one of the processes were to stop at some time for an extraneous reason, the whole systems would stop. The ability to state formally this kind of property of a parallel program and to prove them within a formal logical framework is the central motivation for the theoretical study of the next sections. 2. PARALLEL COMPUTATION Informally speaking, a parallel computation is organized in the following way: some autonomous computing stations are connected to each other in a network by communication lines. Computing stations exchange information through these lines. A given station computes on data coming along

2,478 citations

Journal ArticleDOI
TL;DR: In this article, the authors present new algorithms that efficiently compute static single assignment forms and control dependence graphs for arbitrary control flow graphs using the concept of {\em dominance frontiers} and give analytical and experimental evidence that these data structures are usually linear in the size of the original program.
Abstract: In optimizing compilers, data structure choices directly influence the power and efficiency of practical program optimization. A poor choice of data structure can inhibit optimization or slow compilation to the point that advanced optimization features become undesirable. Recently, static single assignment form and the control dependence graph have been proposed to represent data flow and control flow properties of programs. Each of these previously unrelated techniques lends efficiency and power to a useful class of program optimizations. Although both of these structures are attractive, the difficulty of their construction and their potential size have discouraged their use. We present new algorithms that efficiently compute these data structures for arbitrary control flow graphs. The algorithms use {\em dominance frontiers}, a new concept that may have other applications. We also give analytical and experimental evidence that all of these data structures are usually linear in the size of the original program. This paper thus presents strong evidence that these structures can be of practical use in optimization.

2,198 citations