TL;DR: AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx are used as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains.
Abstract: Escalating system-on-chip design complexity is pushing the design community to raise the level of abstraction beyond register transfer level. Despite the unsuccessful adoptions of early generations of commercial high-level synthesis (HLS) systems, we believe that the tipping point for transitioning to HLS msystem-on-chip design complexityethodology is happening now, especially for field-programmable gate array (FPGA) designs. The latest generation of HLS tools has made significant progress in providing wide language coverage and robust compilation technology, platform-based modeling, advancement in core HLS algorithms, and a domain-specific approach. In this paper, we use AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains. Complex industrial designs targeting Xilinx FPGAs are also presented as case studies, including comparison of HLS solutions versus optimized manual designs. In particular, the experiment on a sphere decoder shows that the HLS solution can achieve an 11-31% reduction in FPGA resource usage with improved design productivity compared to hand-coded design.
HE RAPID INCREASE of complexity in System-on-a-Chip (SoC) design has encouraged the design community to seek design abstractions with better productivity than RTL.
In addition to the line-count reduction in design specifications, behavioral synthesis has the added value of allowing efficient reuse of behavioral IPs.
The wide availability of SystemC functional models directly drives the need for SystemC-based HLS solutions, which can automatically generate RTL code through a series of formal constructive transformations.
These pre-defined building blocks can be modeled precisely ahead of time for each FPGA platform and, to a large extent, confine the design space.
In Sections IV-VIII, using a state-of-art HLS tool as an example, the authors discuss some key reasons for the wider adoption of HLS solutions in the FPGA design community, including wide language coverage and robust compilation technology, platform-based modeling, advancement in core HLS algorithms, improvements on simulation and verification flow, and the availability of domain-specific design templates.
II. EVOLUTION OF HIGH-LEVEL SYNTHESIS FOR FPGA
Compilers for high-level languages have been successful in practice since the 1950s.
The idea of automatically generating circuit implementations from high-level behavioral specifications arises naturally with the increasing design complexity of integrated circuits.
Most of those tools, however, made rather simplistic assumptions about the target platform and were not widely used.
Early commercialization efforts in the 1990s and early 2000s attracted considerable interest among designers, but also failed to gain wide adoption, due in part to usability issues and poor quality of results.
More recent efforts in high-level synthesis have improved usability by increasing input language coverage and platform integration, as well as improving quality of results.
A. Early Efforts
Since the history of HLS is considerably longer than that of FPGAs, most early HLS tools targeted ASIC designs.
In the subsequent years in the 1980s and early 1990s, a number of similar high-level synthesis tools were built, mostly for research.
The list scheduling algorithm and its variants are widely used to solve scheduling problems with resource constraints [70] ; the forcedirected scheduling algorithm developed in HAL [73] is able to optimize resource requirements under a performance constraint; the path-based scheduling algorithm in the Yorktown Silicon Compiler is useful to optimize performance with conditional branches [12] .
The Silage language, along with the Cathedral-II tool, represented an early domain-specific approach in high-level synthesis.
These tools received wide attention, but failed to widely replace RTL design.
B. Recent efforts
Since 2000, a new generation of high-level synthesis tools has been developed in both academia and industry.
The use of C-based languages also makes it easy to leverage the newest technologies in software compilers for parallelization and optimization in the synthesis tools.
(ii) C and C++ have complex language constructs, such as pointers, dynamic memory management, recursion, polymorphism, etc., which do have efficient hardware counterparts and lead to difficulty in synthesis.
Handel-C allows the user to specify clock boundaries explicitly in the source code.
FPGAs have continually improved in capacity and speed in recent years, and their programmability makes them an attractive platform for many applications in signal processing, communication, and high-performance computing.
C. Lessons Learned
The authors believe that past failures are due to one or several of the following reasons:.
The first generation of the HLS synthesis tools could not synthesize high-level programming languages.
Instead, untimed or partially timed behavioral HDL was used.
C and C++ lack the necessary constructs and semantics to represent hardware attributes such as design hierarchy, timing, synchronization, and explicit concurrency.
Lack of reusable and portable design specification:
Many HLS tools have required users to embed detailed timing and interface information as well as the synthesis constraints into the source code.
Lack of satisfactory quality of results (QoR):.
There was no dependable RTL to GDSII foundation to support HLS, which made it difficult to consistently measure, track, and enhance HLS results.
As a result, the final implementation often fails to meet timing/power requirements.
Another major factor limiting quality of result was the limited capability of HLS tools to exploit performance-optimized and power-efficient IP blocks on a specific platform, such as the versatile DSP blocks and on-chip memories on modern FPGA platforms.
Lack of a compelling reason/event to adopt a new design methodology:
The first-generation HLS tools were clearly ahead of their time, as the design complexity was still manageable at the register transfer level in late 1990s.
Like any major transition in the EDA industry, designers needed a compelling reason or event to push them over the "tipping point," i.e., to adopt the HLS design methodology.
This goal is not generally practical for HLS to achieve.
It is critical that these optimizations be carefully implemented using scalable and predictable algorithms, keeping tool runtimes acceptable for large programs and the results understandable by designers.
The code should be readable by algorithm specialists.
With minimal modification of the C code, for parallelizable algorithms.
Allow an optimization-oriented design process, where a designer can improve the performance of the resulting implementation by successive code modification and refactoring.
Generate implementations that are competitive with synthesizable RTL designs after automatic and manual optimization.
Moreover, the authors are pleased to see that the latest generation of HLS tools has made significant progress in providing wide language coverage and robust compilation technology, platform-based modeling, and advanced core HLS algorithms.
The authors shall discuss these advancements in more detail in the next few sections.
III. CASE STUDY OF STATE-OF-ART OF HIGH-LEVEL
SYNTHESIS FOR FPGAS AutoPilot is one of the most recent HLS tools, and is representative of the capabilities of the state-of-art commercial HLS tools available today.
AutoPilot outputs RTL in Verilog, VHDL or cycle-accurate SystemC for simulation and verification.
These SystemC wrappers connect high-level interfacing objects in the behavioral test bench with pin-level signals in RTL.
The reports include a breakdown of performance and area metrics by individual modules, functions and loops in the source code.
Finally, the generated HDL files and design constraints feed into the Xilinx RTL tools for implementation.
Improved design quality:
Comprehensive language support allows designers to take full advantage of rich C/C++ constructs to maximize simulation speed, design modularity and reusability, as well as synthesis QoR.
In fact, many early C-based synthesis tools only handle a very limited language subset, which typically includes the native integer data types (e.g., char, short, int, etc.), onedimensional arrays, if-then-else conditionals, and for loops.
The arbitrary-precision fixed-point (ap_fixed) data types support all common algorithmic operations.
Designers can explore the accuracy and cost tradeoff by modifying the resolution and fixed-point location and experimenting with various quantization and saturation modes.
AutoPilot also supports the OCSI synthesizable subset [113] for SystemC synthesis.
B. Use of state-of-the-art compiler technologies
AutoPilot tightly integrates the LLVM compiler infrastructure [59] [110] to leverage leading-edge compiler technologies.
AutoPilot uses the llvm-gcc front end to obtain an intermediate representation (IR) based on the LLVM instruction set.
In particular, the following classes of transformations and analyses have shown to be very useful for hardware synthesis: SSA-based code optimizations such as constant propagation, dead code elimination, and redundant code elimination based on global value numbering [2] .
Memory optimizations such as memory reuse, array scalarization, and array partitioning [19] to reduce the number of memory accesses and improve memory bandwidth.
In other words, the code can be optimized without considering the source language.
A. Platform modeling for Xilinx FPGAs
AutoPilot uses detailed target platform information to carry out informed and target-specific synthesis and optimization.
The resulting characterization data is then used to make implementation choices during synthesis.
Notably, the cost of implementing hardware on FPGAs is often different from that for ASIC technology.
On FPGAs, multiplexors typically have the same cost and delay as an adder (approximately one LUT/output).
FPGA technology also features heterogeneous on-chip resources, including not only LUTs and flip flops but also other prefabricated architecture blocks such as DSP48s and Block RAMs.
B. Integration with Xilinx toolset
In order to raise the level of design abstraction more completely, AutoPilot attempts to hide details of the downstream RTL flow from users as much as possible.
Otherwise, a user may be overwhelmed by the details of vendor-specific tools such as the formats of constraint and configuration files, implementation and optimization options, or directory structure requirements.
As shown in Figure 1 AutoPilot instantiates these interfaces along with adapter logic and appropriate EDK meta-information to enable a generated module can be quickly connected in an EDK system.
A. Efficient mathematical programming formulations to scheduling
Classical approaches to the scheduling problem in highlevel synthesis use either conventional heuristics such as list scheduling [1] and force-directed scheduling [73] , which often lead to sub-optimal solutions, due to the nature of local optimization methods, or exact formulations such as integerlinear programming [45] , which can be difficult to scale to large designs.
Unlike previous approaches where using O(m×n) binary variables to encode a scheduling solution with n operations and m steps [45] , SDC uses a continuous representation of time with only O(n) variables: for each operation i, a scheduling variable s i is introduced to represent the time step at which the operation is scheduled.
A linear program with a totally unimodular constraint matrix is guaranteed to have integral solutions.
Many commonly encountered constraints in high-level synthesis can be expressed in the form of integer-difference constraints.
Other complex constraints can be handled in similar ways, using approximations or other heuristics.
B. Soft constraints and applications for platform-based optimization
In a typical synthesis tool, design intentions are often expressed as constraints.
While some of these constraints are essential for the design to function correctly, many others are not.
It is possible that a solution with a slight nominal timing violation can still meet the frequency requirement, considering inaccuracy in interconnect delay estimation and various timing optimization procedures in later design stages, such as logic refactoring, retiming, and interconnect optimization.
The approach is based on the SDC formulation discussed in the preceding subsection, but allows some constraints to be violated.
Consider the scheduling problem with both hard constraints and soft constraints formulated as follows.
Gs ≤ p hard constraints
Here G and H corresponds to the matrices representing hard constraints and soft constraints, respectively, and they are both totally unimodular as shown in [15] .
Hard constraints and soft constraints are generated based on the functional specification and QoR targets.
This approach offers a powerful yet flexible framework to address various considerations in scheduling.
Take the DSP48E block in Xilinx Virtex 5 FPGAs for example: each of the DSP48E blocks contains a multiplier and a post-adder, allowing efficient implementations of multiplication and multiply-accumulation.
C. Pattern mining for efficient sharing
A typical target architecture for HLS may introduce multiplexers when functional units, storage units or interconnects are shared by multiple operations/variables in a time-multiplexed manner.
Multiplexers (especially large ones) can be particularly expensive on FPGA platforms.
Thus, careless decisions on resource sharing could introduce more overhead than benefit.
The method tries to extract common structures or patterns in the data-flow graph, so that different instances of the same pattern can share resources with little overhead.
Pruning techniques are proposed based on characteristic vectors and locality-sensitive hashing.
D. Memory analysis and optimizations
While application-specific computation platforms such as FPGAs typically have considerable computational capability, their performance is often limited by available communication or memory bandwidth.
Typical FPGAs, such as the Xilinx Virtex series, have a considerable number of block RAMs.
Consider a loop that accesses array A with subscripts i, 2×i+1, and 3×i+1, in the ith iteration.
If the loop is targeted to be pipelined with the initiation interval of one, i.e., a new loop iteration starts every clock cycle, the schedule in (b) will lead to port conflicts, because (i+1) mod 2 = (2×(i+1)+1) mod 2 = (3×i+1) mod 2, when i is even; this will lead to three simultaneous accesses to the first bank.
Then, an iterative algorithm is used to perform both scheduling and memory partitioning guided by the conflict graph.
VII. ADVANCES IN SIMULATION AND VERIFICATION
Besides the many advantages of automated synthesis, such as quick design space exploration and automatic complex architectural changes like pipelining, resource sharing and scheduling, HLS also enables a more efficient debugging and verification flow at the higher abstraction levels.
Since HLS provides an automatic path to implementable RTL from behavioral/functional models, designers do not have to wait until manual RTL models to be available to conduct verification.
Instead, they can develop, debug and functionally verify a design at an earlier stage with high-level programming languages and tools.
This can significantly reduce the verification effort due to the following reasons: (i) It is easier to trace, identify and fix bugs at higher abstraction levels with more compact and readable design descriptions.
(ii) Simulation at the higher level is typically orders of magnitude faster than RTL simulation, allowing more comprehensive tests and greater coverage.
A. Automatic co-simulation
At present, simulation is the still prevalent technique to check if the resulting RTL complies with the high-level specification.
To reduce effort spent on RTL simulation, the latest HLS technologies have made important improvements on automatic co-simulation [86].
A C-to-RTL transactor is created to connect highlevel interfacing constructs (such as parameters and global variables) with pin-level signals in RTL.
This wrapper also includes additional control logic to manage the communication between the testing module and the RTL design under test (DUT).
A pipelined design may require that the test bench feed input data into the DUT at a fixed rate.
PLATFORMS
In the end, the time-to-market of an FPGA system design is dependent on many factors, such as availability of reference designs, development boards, and in the end, FPGA devices themselves.
This integration often includes a wide variety of system-level design concerns, including embedded software, system integration, and verification [104] .
As a result, these cores are not easily amenable to high-level synthesis and form part of the system infrastructure of a design.
Subsystem PSS is responsible for executing the relatively low-performance processing in the system.
The portion of a design generated using HLS represents the bulk of the FPGA design and communicates with the system infrastructure through standardized wire-level interfaces, such as AXI4 memory-mapped and streaming interfaces [96] shown in Figure 7 .
A. High-level design of cognitive radios project
Cognitive radio systems typically contain both computationally intensive processing with high data rates in the radio processing, along with complex, but relatively lowrate processing to control the radio processing.
Efficient interaction with the processor is an important part of the overall system complexity.
The processor subsystem contains standard hardware modules and is capable of running a standard embedded operating system, such as Linux.
The accelerator subsystem is used for implementing components with high computational requirements in hardware.
Components also expose a configuration interface with multiple parameters, allowing them to be reconfigured in an executing system by user-defined control code executing in the processor subsystem.
B. Video Starter Kit
Video processing systems implemented in FPGA include a wide variety of applications from embedded computer-vision and picture quality improvement to image and video compression.
Typically these systems include two significant pieces of complexity.
This platform is derived from the Xilinx EDKbased reference designs provided with the Xilinx Spartan 3ADSP Video Starter Kit and has been ported to several Xilinx Virtex 5 and Spartan 6 based development boards, targeting high-definition HD video processing with pixel clocks up to 150 MHz.
The incoming video data is analyzed by the Frame Decoder block to determine the frame size of the incoming video, which is passed to the application block, enabling different video formats to be processed.
The interface to external memory used for frame buffers is implemented using the Xilinx Multi-ported Memory Controller (MPMC) [118] which provides access to external memory to the Application Block and to the Microblaze control processor, if necessary.
A. Summary of BDTI HLS Certification
Xilinx has worked with BDTI Inc. [99] to implement an HLS Tool Certification Program [100] .
This program was designed to compare the results of an HLS Tool and the Xilinx Spartan 3 FPGA that is part of the Video Starter Kit, with the result of a conventional DSP processor and with the results of a good manual RTL implementation.
There were two applications used in this Certification Program, an optical flow algorithm, which is characteristic for a demanding image processing application and a wireless application for which a very representative implementation in RTL was available.
The DSP processor implementation rated "fair", while the AutoPilot implementation rated "good", indicating that less source code modification was necessary to achieve high performance when using AutoPilot.
BDTI also assessed overall ease of use of the DSP tool flow and the FPGA tool flow, combining HLS with the low-level implementation tools.
B. Sphere Decoder
Xilinx has implemented a sphere decoder for a multi-input multi-output (MIMO) wireless communication system using AutoPilot [67] [85] .
The application exhibits a large amount of parallelism, since the operations must be executed on each of 360 independent subcarriers which form the overall communication channel and the processing for each channel can generally be pipelined.
The resulting HLS code for the application makes heavy use of C++ templates to describe arbitrary-precision integer data types and parameterized code blocks used to process different matrix sizes at different points in the application.
Both designs were implemented as standalone cores using ISE 12.1, targeting Xilinx Virtex 5 speed grade 2 at 225 MHz.
Using AutoPilot Version 2010.07.ft, the authors were able to generate a design that was smaller than the reference implementation in less time than a hand RTL implementation by refactoring and optimizing the algorithmic C model.
Toplevel Block Diagram
Design time for the RTL design was estimated from work logs by the original authors of [28] , and includes only the time for an algorithm expert and experienced tool user to enter and verify the RTL architecture in System Generator.
Given the significant time familiarizing ourselves with the application and structure of the code, the authors believe that an application expert familiar with the code would be able to create such a design at least twice as fast.
To meet the required throughput, one row of the systolic array is instantiated, consisting of one diagonal cell and 8 off-diagonal cells, and the remaining rows are time multiplexed over the single row.
In the 4x4 case, the off-diagonal cell implements finegrained resource sharing, with one resource-shared complex multiplier.
The authors do observe that AutoPilot uses additional BRAM to implement this block relative to the RTL implementation, because AutoPilot requires tool-implemented double-buffers to only be read or written in a single loop.
X. CONCLUSIONS AND CHALLENGES AHEAD
It seems clear that the latest generation of FPGA HLS tools has made significant progress in providing wide language coverage, robust compilation technology, platform-based modeling, and domain-specific system-level integration.
As a result, they can quickly provide highly competitive quality of results, in many cases comparable or better than manual RTL designs.
For the FPGA design community, it appears that HLS technology may be transitioning from research and investigation to selected deployment.
The authors also see many opportunities for HLS tools to further improve.
TL;DR: This work uses a first-published methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources.
Abstract: High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing today’s system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is particularly interesting for designing field-programmable gate array circuits, where hardware implementations can be easily refined and replaced in the target device. Recent years have seen much activity in the HLS research community, with a plethora of HLS tool offerings, from both industry and academia. All these tools may have different input languages, perform different internal optimizations, and produce results of different quality, even for the very same input description. Hence, it is challenging to compare their performance and understand which is the best for the hardware to be implemented. We present a comprehensive analysis of recent HLS tools, as well as overview the areas of active interest in the HLS research community. We also present a first-published methodology to evaluate different HLS tools. We use our methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources.
433 citations
Cites methods from "High-Level Synthesis for FPGAs: Fro..."
...We
first introduce the academic HLS tools evaluated in this study, before moving onto highlight features of other HLS tools available in the community (either commercial or academic)....
TL;DR: The design of a BNN accelerator is presented that is synthesized from C++ to FPGA-targeted Verilog and outperforms existing FPGAs-based CNN accelerators in GOPS as well as energy and resource efficiency.
Abstract: Convolutional neural networks (CNN) are the current stateof-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications are typically run on clusters of CPUs or GPUs. Studies into the FPGA acceleration of CNN workloads has achieved reductions in power and energy consumption. However, large GPUs outperform modern FPGAs in throughput, and the existence of compatible deep learning frameworks give GPUs a significant advantage in programmability. Recent research in machine learning demonstrates the potential of very low precision CNNs -- i.e., CNNs with binarized weights and activations. Such binarized neural networks (BNNs) appear well suited for FPGA implementation, as their dominant computations are bitwise logic operations and their memory requirements are reduced. A combination of low-precision networks and high-level design methodology may help address the performance and productivity gap between FPGAs and GPUs. In this paper, we present the design of a BNN accelerator that is synthesized from C++ to FPGA-targeted Verilog. The accelerator outperforms existing FPGA-based CNN accelerators in GOPS as well as energy and resource efficiency.
379 citations
Cites methods from "High-Level Synthesis for FPGAs: Fro..."
...Our HLS implementation leverages these optimizations, and further propose novel BNN-specific hardware constructs to ensure full throughput and hardware utilization across the different input feature sizes....
[...]
...We make use of Xilinx SDSoC 2016.1 as the primary design tool, which leverages Vivado HLS and Vivado to perform the actual HLS compilation and FPGA implementation....
[...]
...It invokes Vivado HLS under the hood to synthesize the “hardware” portion into RTL....
[...]
...The rest of this paper is organized as follows: Section 2 gives a primer on CNNs and BNNs; Section 3 describes our BNN accelerator design; Section 4 provides some details on our HLS code; Section 5 reports our experimental findings, Section 6 reviews previous work on FPGA-based CNN accelerators; and we conclude the paper in Section 7....
[...]
...Zhang et al. [27] describe how to optimize an HLS design by reordering and tiling loops, inserting the proper pragmas, and organizing external memory transfers....
TL;DR: This work proposes a new approximate matrix inversion algorithm relying on a Neumann series expansion, which substantially reduces the complexity of linear data detection in single-carrier frequency-division multiple access (SC-FDMA)-based large-scale MIMO systems.
Abstract: Large-scale (or massive) multiple-input multiple-out put (MIMO) is expected to be one of the key technologies in next-generation multi-user cellular systems based on the upcoming 3GPP LTE Release 12 standard, for example. In this work, we propose-to the best of our knowledge-the first VLSI design enabling high-throughput data detection in single-carrier frequency-division multiple access (SC-FDMA)-based large-scale MIMO systems. We propose a new approximate matrix inversion algorithm relying on a Neumann series expansion, which substantially reduces the complexity of linear data detection. We analyze the associated error, and we compare its performance and complexity to those of an exact linear detector. We present corresponding VLSI architectures, which perform exact and approximate soft-output detection for large-scale MIMO systems with various antenna/user configurations. Reference implementation results for a Xilinx Virtex-7 XC7VX980T FPGA show that our designs are able to achieve more than 600 Mb/s for a 128 antenna, 8 user 3GPP LTE-based large-scale MIMO system. We finally provide a performance/complexity trade-off comparison using the presented FPGA designs, which reveals that the detector circuit of choice is determined by the ratio between BS antennas and users, as well as the desired error-rate performance.
TL;DR: The techniques investigated in this paper represent the recent trends in the FPGA-based accelerators of deep learning networks and are expected to direct the future advances on efficient hardware accelerators and to be useful for deep learning researchers.
Abstract: Due to recent advances in digital technologies, and availability of credible data, an area of artificial intelligence, deep learning, has emerged and has demonstrated its ability and effectiveness in solving complex learning problems not possible before. In particular, convolutional neural networks (CNNs) have demonstrated their effectiveness in the image detection and recognition applications. However, they require intensive CPU operations and memory bandwidth that make general CPUs fail to achieve the desired performance levels. Consequently, hardware accelerators that use application-specific integrated circuits, field-programmable gate arrays (FPGAs), and graphic processing units have been employed to improve the throughput of CNNs. More precisely, FPGAs have been recently adopted for accelerating the implementation of deep learning networks due to their ability to maximize parallelism and their energy efficiency. In this paper, we review the recent existing techniques for accelerating deep learning networks on FPGAs. We highlight the key features employed by the various techniques for improving the acceleration performance. In addition, we provide recommendations for enhancing the utilization of FPGAs for CNNs acceleration. The techniques investigated in this paper represent the recent trends in the FPGA-based accelerators of deep learning networks. Thus, this paper is expected to direct the future advances on efficient hardware accelerators and to be useful for deep learning researchers.
TL;DR: It is suggested that input and output are basic primitives of programming and that parallel composition of communicating sequential processes is a fundamental program structuring method.
Abstract: This paper suggests that input and output are basic primitives of programming and that parallel composition of communicating sequential processes is a fundamental program structuring method. When combined with a development of Dijkstra's guarded command, these concepts are surprisingly versatile. Their use is illustrated by sample solutions of a variety of a familiar programming exercises.
TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.
Abstract: We describe LLVM (low level virtual machine), a compiler framework designed to support transparent, lifelong program analysis and transformation for arbitrary programs, by providing high-level information to compiler transformations at compile-time, link-time, run-time, and in idle time between runs. LLVM defines a common, low-level code representation in static single assignment (SSA) form, with several novel features: a simple, language-independent type-system that exposes the primitives commonly used to implement high-level language features; an instruction for typed address arithmetic; and a simple mechanism that can be used to implement the exception handling features of high-level languages (and setjmp/longjmp in C) uniformly and efficiently. The LLVM compiler framework and code representation together provide a combination of key capabilities that are important for practical, lifelong analysis and transformation of programs. To our knowledge, no existing compilation approach provides all these capabilities. We describe the design of the LLVM representation and compiler framework, and evaluate the design in three ways: (a) the size and effectiveness of the representation, including the type information it provides; (b) compiler performance for several interprocedural problems; and (c) illustrative examples of the benefits LLVM provides for several challenging compiler problems.
4,841 citations
Additional excerpts
...infrastructure [59][110] to leverage leading-edge compiler...
TL;DR: A simple language for parallel programming is described and its mathematical properties are studied to make a case for more formal languages for systems programming and the design of operating systems.
Abstract: In this paper, we describe a simple language for parallel programming. Its semantics is studied thoroughly. The desirable properties of this language and its deficiencies are exhibited by this theoretical study. Basic results on parallel program schemata are given. We hope in this way to make a case for more formal (i.e. mathematical) approach to the design of languages for systems programming and the design of operating systems. There is a wide disagreement among systems designers as to what are the best primitives for writing systems programs. In this paper, we describe a simple language for parallel programming and study its mathematical properties. 1. A SIMPLE LANGUAGE FOR PARALLEL PROGRAMMING The features of our mini-language are exhibited on the sample program S on Figure 1. The conventions are close to Algol1 and we only insist upon the new features. The program S consists of a set of declarations and a body. Variables of type integer channel are declared at line (1), and for any simple type σ (boolean, real, etc. . . ) we could have declared a σ channel. Then processes f , g and h are declared, much like procedures. Aside from usual parameters (passed by value in this example, like INIT at line (3)), we can declare in the heading of the process how it is linked to other processes : at line (2) f is stated to communicate via two input lines that can carry integers, and one similar output line. The body of a process is an usual Algol program except for invocation of wait until something on an input line (e.g. at (4)) or send a variable on a line of compatible type (e.g. at (5)). The process stays blocked on a wait until something is being sent on this line by another process, but nothing can prevent a process from performing a send on a line. In others words, processes communicate via first-in first-out (fifo) queues. Calling instances of the processes is done in the body of the main program at line (6) where the actual names of he channels are bound to the formal parameters of the processes. The infix operator par initiates the concurrent activation of the processes. Such a style of programming is close to may systems using EVENT mechanisms ([1, 2, 3, 4]). A pictorial representation of the program is the schema P on Figure 2, where the nodes represent processes and the arcs communication channels between these processes. What sort of things would we like to prove on a program like S? Firstly, that all processes in S run forever. Secondly, Begin (1) In t eg e r channel X, Y, Z , T1 , T2 ; (2 ) Process f ( i n t e r g e r in U,V; i n t e r g e r out W) ; Begin i n t e g e r I ; l o g i c a l B; B := true ; Repeat Begin (4 ) I := i f B then wait (U) e l s e wait (V) ; (7 ) p r in t ( I ) ; (5 ) send I on W; B := not B; End ; End ; Process g ( i n t e g e r in U ; i n t e g e r out V, W) ; Begin i n t e g e r I ; l o g i c a l B; B := true ; Repeat Begin I := wait (U) ; i f B then send I on V e l s e send I on W : B := not B; End ; End ; (3 ) Process h( i n t e g e r in U; i n t e g e r out V; i n t e g e r INIT ) ; Begin i n t e g e r I ; send INIT on V; Repeat Begin I := wait (U) ; send I on V; End ; End ; Comment : body o f mainprogram ; (6 ) f (X,Y,Z) par g (X,T1 ,T2) par h(T1 ,Y, 0 ) par h(T2 , Z , 1 ) ; End ; Figure 1: Sample parallel program S. more precisely, that S prints out (at line (7)) an alternating sequence of 0’s and 1’s forever. Third, that if one of the processes were to stop at some time for an extraneous reason, the whole systems would stop. The ability to state formally this kind of property of a parallel program and to prove them within a formal logical framework is the central motivation for the theoretical study of the next sections. 2. PARALLEL COMPUTATION Informally speaking, a parallel computation is organized in the following way: some autonomous computing stations are connected to each other in a network by communication lines. Computing stations exchange information through these lines. A given station computes on data coming along
TL;DR: In this article, the authors present new algorithms that efficiently compute static single assignment forms and control dependence graphs for arbitrary control flow graphs using the concept of {\em dominance frontiers} and give analytical and experimental evidence that these data structures are usually linear in the size of the original program.
Abstract: In optimizing compilers, data structure choices directly influence the power and efficiency of practical program optimization. A poor choice of data structure can inhibit optimization or slow compilation to the point that advanced optimization features become undesirable. Recently, static single assignment form and the control dependence graph have been proposed to represent data flow and control flow properties of programs. Each of these previously unrelated techniques lends efficiency and power to a useful class of program optimizations. Although both of these structures are attractive, the difficulty of their construction and their potential size have discouraged their use. We present new algorithms that efficiently compute these data structures for arbitrary control flow graphs. The algorithms use {\em dominance frontiers}, a new concept that may have other applications. We also give analytical and experimental evidence that all of these data structures are usually linear in the size of the original program. This paper thus presents strong evidence that these structures can be of practical use in optimization.