scispace - formally typeset
Search or ask a question

Showing papers on "Compiler published in 2010"


Proceedings ArticleDOI
01 Jan 2010
TL;DR: This paper illustrates how to use Theano, outlines the scope of the compiler, provides benchmarks on both CPU and GPU processors, and explains its overall design.
Abstract: Theano is a compiler for mathematical expressions in Python that combines the convenience of NumPy's syntax with the speed of optimized native machine language. The user composes mathematical expressions in a high-level description that mimics NumPy's syntax and semantics, while being statically typed and functional (as opposed to imperative). These expressions allow Theano to provide symbolic differentiation. Before performing computation, Theano optimizes the choice of expressions, translates them into C++ (or CUDA for GPU), compiles them into dynamically loaded Python modules, all automatically. Common machine learn- ing algorithms implemented with Theano are from 1:6 to 7:5 faster than competitive alternatives (including those implemented with C/C++, NumPy/SciPy and MATLAB) when compiled for the CPU and between 6:5 and 44 faster when compiled for the GPU. This paper illustrates how to use Theano, outlines the scope of the compiler, provides benchmarks on both CPU and GPU processors, and explains its overall design.

939 citations


Proceedings ArticleDOI
05 Jun 2010
TL;DR: Green enables programmers to approximate expensive functions and loops and operates in two phases and can produce significant improvements in performance and energy consumption with small and controlled QoS degradation.
Abstract: Energy-efficient computing is important in several systems ranging from embedded devices to large scale data centers. Several application domains offer the opportunity to tradeoff quality of service/solution (QoS) for improvements in performance and reduction in energy consumption. Programmers sometimes take advantage of such opportunities, albeit in an ad-hoc manner and often without providing any QoS guarantees.We propose a system called Green that provides a simple and flexible framework that allows programmers to take advantage of such approximation opportunities in a systematic manner while providing statistical QoS guarantees. Green enables programmers to approximate expensive functions and loops and operates in two phases. In the calibration phase, it builds a model of the QoS loss produced by the approximation. This model is used in the operational phase to make approximation decisions based on the QoS constraints specified by the programmer. The operational phase also includes an adaptation function that occasionally monitors the runtime behavior and changes the approximation decisions and QoS model to provide strong statistical QoS guarantees.To evaluate the effectiveness of Green, we implemented our system and language extensions using the Phoenix compiler framework. Our experiments using benchmarks from domains such as graphics, machine learning, signal processing, and finance, and an in-production, real-world web search engine, indicate that Green can produce significant improvements in performance and energy consumption with small and controlled QoS degradation.

455 citations


Journal ArticleDOI
TL;DR: This paper describes the design, implementation and evaluation of Native Client, a sandbox for untrusted x86 native code that combines software fault isolation and a secure runtime to direct system interaction and side effects through interfaces managed by Native Client.
Abstract: This paper describes the design, implementation and evaluation of Native Client, a sandbox for untrusted x86 native code. Native Client aims to give browser-based applications the computational performance of native applications without compromising safety. Native Client uses software fault isolation and a secure runtime to direct system interaction and side effects through interfaces managed by Native Client. Native Client provides operating system portability for binary code while supporting performance-oriented features generally absent from web application programming environments, such as thread support, instruction set extensions such as SSE, and use of compiler intrinsics and hand-coded assembler. We combine these properties in an open architecture that encourages community review and 3rd-party tools.

434 citations


Journal ArticleDOI
TL;DR: A new x86-TSO programmer's model is presented that is mathematically precise but can be presented as an intuitive abstract machine which should be widely accessible to working programmers and put x86 multiprocessor system building on a more solid foundation.
Abstract: Exploiting the multiprocessors that have recently become ubiquitous requires high-performance and reliable concurrent systems code, for concurrent data structures, operating system kernels, synchronization libraries, compilers, and so on. However, concurrent programming, which is always challenging, is made much more so by two problems. First, real multiprocessors typically do not provide the sequentially consistent memory that is assumed by most work on semantics and verification. Instead, they have relaxed memory models, varying in subtle ways between processor families, in which different hardware threads may have only loosely consistent views of a shared memory. Second, the public vendor architectures, supposedly specifying what programmers can rely on, are often in ambiguous informal prose (a particularly poor medium for loose specifications), leading to widespread confusion.In this paper we focus on x86 processors. We review several recent Intel and AMD specifications, showing that all contain serious ambiguities, some are arguably too weak to program above, and some are simply unsound with respect to actual hardware. We present a new x86-TSO programmer's model that, to the best of our knowledge, suffers from none of these problems. It is mathematically precise (rigorously defined in HOL4) but can be presented as an intuitive abstract machine which should be widely accessible to working programmers. We illustrate how this can be used to reason about the correctness of a Linux spinlock implementation and describe a general theory of data-race freedom for x86-TSO. This should put x86 multiprocessor system building on a more solid foundation; it should also provide a basis for future work on verification of such systems.

406 citations


Proceedings ArticleDOI
05 Jun 2010
TL;DR: An empirical study of the dynamic behavior of a corpus of widely-used JavaScript programs is performed, and how and why the dynamic features are used are analyzed.
Abstract: The JavaScript programming language is widely used for web programming and, increasingly, for general purpose computing. As such, improving the correctness, security and performance of JavaScript applications has been the driving force for research in type systems, static analysis and compiler techniques for this language. Many of these techniques aim to reign in some of the most dynamic features of the language, yet little seems to be known about how programmers actually utilize the language or these features. In this paper we perform an empirical study of the dynamic behavior of a corpus of widely-used JavaScript programs, and analyze how and why the dynamic features are used. We report on the degree of dynamism that is exhibited by these JavaScript programs and compare that with assumptions commonly made in the literature and accepted industry benchmark suites.

370 citations


Proceedings ArticleDOI
17 Oct 2010
TL;DR: This tutorial explains how to define a language and a statically typed, EMF-based Abstract Syntax Tree using only a grammar, and shows how literally every as- pects of the language and its complementary tool support can be customized using Dependency Injection.
Abstract: Whether there is an (emerging or legacy) Domain-Specific Language to increase the expressiveness of your coworkers or whether you are about to invent a new General Purpose Prgramming Language: Tool support that goes beyond a parser/compiler is essential to make other people adopt your language and to be more productive. Xtext is an award- winning framework to build such tooling.In this tutorial we explain how to define a language and a statically typed, EMF-based Abstract Syntax Tree using only a grammar. We then generate a parser, a serializer and a smart editor from it. The editor provides many features out-of-the-box, such as syntax highlighting, content-assist, folding, jump-to-declaration and reverse-reference lookup across multiple files. Then, it is shown how literally every as- pects of the language and its complementary tool support can be customized using Dependency Injection, especially how this can be done for linking, formatting and validation. As an outlook, we will demonstrate how to integrate a custom language with Java, how Xtext maintains a workspace-wide index of named elements and how to implement incremental code generation or attach an interpreter.

368 citations


Proceedings ArticleDOI
05 Jun 2010
TL;DR: CETS maintains a unique identifier with each object, associates this metadata with the pointers in a disjoint metadata space to retain memory layout compatibility, and checks that the object is still allocated on pointer dereferences to provide temporal safety.
Abstract: Temporal memory safety errors, such as dangling pointer dereferences and double frees, are a prevalent source of software bugs in unmanaged languages such as C. Existing schemes that attempt to retrofit temporal safety for such languages have high runtime overheads and/or are incomplete, thereby limiting their effectiveness as debugging aids. This paper presents CETS, a compile-time transformation for detecting all violations of temporal safety in C programs. Inspired by existing approaches, CETS maintains a unique identifier with each object, associates this metadata with the pointers in a disjoint metadata space to retain memory layout compatibility, and checks that the object is still allocated on pointer dereferences. A formal proof shows that this is sufficient to provide temporal safety even in the presence of arbitrary casts if the program contains no spatial safety violations. Our CETS prototype employs both temporal check removal optimizations and traditional compiler optimizations to achieve a runtime overhead of just 48% on average. When combined with a spatial-checking system, the average overall overhead is 116% for complete memory safety

355 citations


Proceedings ArticleDOI
14 Dec 2010
TL;DR: This work focuses on streaming applications: i.e. applications that can be modeled as data-flow graphs that allow a designer to describe circuits in a more natural and concise way than possible with the language elements found in the traditional hardware description languages.
Abstract: Today the hardware for embedded systems is often specified in VHDL However, VHDL describes the system at a rather low level, which is cumbersome and may lead to design faults in large real life applications There is a need of higher level abstraction mechanisms In the embedded systems group of the University of Twente we are working on systematic and transformational methods to design hardware architectures, both multi core and single core The main line in this approach is to start with a straightforward (often mathematical) specification of the problem The next step is to find some adequate transformations on this specification, in particular to find specific optimizations, to be able to distribute the application over different cores The result of these transformations is then translated into the functional programming language Haskell since Haskell is close to mathematics and such a translation often is straightforward Besides, the Haskell code is executable, so one immediately has a simulation of the intended system Next, the resulting Haskell specification is given to a compiler, called CeaSH (for CAES LAnguage for Synchronous Hardware) which translates the specification into VHDL The resulting VHDL is synthesizable, so from there on standard VHDL-tooling can be used for synthesis In this work we primarily focus on streaming applications: ie applications that can be modeled as data-flow graphs At the moment the CeaSH system is ready in prototype form and in the presentation we will give several examples of how it can be used In these examples it will be shown that the specification code is clear and concise Furthermore, it is possible to use powerful abstraction mechanisms, such as polymorphism, higher order functions, pattern matching, lambda abstraction, partial application These features allow a designer to describe circuits in a more natural and concise way than possible with the language elements found in the traditional hardware description languages In addition we will give some examples of transformations that are possible in a mathematical specification, and which do not suffer from the problems encountered in, eg, automatic parallelization of nested for-loops in C-programs

340 citations


Proceedings ArticleDOI
05 Jun 2010
TL;DR: This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU), which addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism.
Abstract: This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism.The input to our compiler is a naive GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler analyzes the code, identifies its memory access patterns, and generates both the optimized kernel and the kernel invocation parameters. Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or address-offset insertion for partition-camping elimination. The experiments on a set of scientific and media processing algorithms show that our optimized code achieves very high performance, either superior or very close to the highly fine-tuned library, NVIDIA CUBLAS 2.2, and up to 128 times speedups over the naive versions. Another distinguishing feature of our compiler is the understandability of the optimized code, which is useful for performance analysis and algorithm refinement.

319 citations


Proceedings ArticleDOI
09 Jan 2010
TL;DR: An analytical model to predict the performance of general-purpose applications on a GPU architecture that captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations is presented.
Abstract: This paper presents an analytical model to predict the performance ofgeneral-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features. To identify the performance bottlenecks accurately, we introduce an abstract interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel. We validated our performance model on the NVIDIA GPUs using CUDA (Compute Unified Device Architecture). For this purpose, we used data parallel benchmarks that stress different GPU microarchitecture events such as uncoalesced memory accesses, scratch-pad memory bank conflicts, and control flow divergence, which must be accurately modeled but represent challenges to the analytical performance models. The proposed model captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations. We also describe our approach to extracting the performance model automatically from a kernel code.

314 citations



Proceedings ArticleDOI
10 Apr 2010
TL;DR: HelpMeOut, a social recommender system that aids the debugging of error messages by suggesting solutions that peers have applied in the past is introduced, which can suggest useful fixes for 47% of errors after 39 person-hours of programming in an instrumented environment.
Abstract: Interpreting compiler errors and exception messages is challenging for novice programmers. Presenting examples of how other programmers have corrected similar errors may help novices understand and correct such errors. This paper introduces HelpMeOut, a social recommender system that aids the debugging of error messages by suggesting solutions that peers have applied in the past. HelpMeOut comprises IDE instrumentation to collect examples of code changes that fix errors; a central database that stores fix reports from many users; and a suggestion interface that, given an error, queries the database for a list of relevant fixes and presents these to the programmer. We report on implementations of this architecture for two programming languages. An evaluation with novice programmers found that the technique can suggest useful fixes for 47% of errors after 39 person-hours of programming in an instrumented environment.

Proceedings ArticleDOI
06 Dec 2010
TL;DR: G-Free is presented, a compiler-based approach that represents the first practical solution against any possible form of ROP, and is able to eliminate all unaligned free-branch instructions inside a binary executable, and to protect the aligned free-Branch instructions to prevent them from being misused by an attacker.
Abstract: Despite the numerous prevention and protection mechanisms that have been introduced into modern operating systems, the exploitation of memory corruption vulnerabilities still represents a serious threat to the security of software systems and networks. A recent exploitation technique, called Return-Oriented Programming (ROP), has lately attracted a considerable attention from academia. Past research on the topic has mostly focused on refining the original attack technique, or on proposing partial solutions that target only particular variants of the attack.In this paper, we present G-Free, a compiler-based approach that represents the first practical solution against any possible form of ROP. Our solution is able to eliminate all unaligned free-branch instructions inside a binary executable, and to protect the aligned free-branch instructions to prevent them from being misused by an attacker. We developed a prototype based on our approach, and evaluated it by compiling GNU libc and a number of real-world applications. The results of the experiments show that our solution is able to prevent any form of return-oriented programming.

Book
01 Jan 2010
TL;DR: The fundamentals of processor and computer design are taught from the newest edition of this award winning text, ideal for professionals in computer science, computer engineering, and electrical engineering.
Abstract: KEY BENEFIT: Learn the fundamentals of processor and computer design from the newest edition of this award winning text. KEY TOPICS: Introduction; Computer Evolution and Performance; A Top-Level View of Computer Function and Interconnection; Cache Memory; Internal Memory Technology; External Memory; I/O; Operating System Support; Computer Arithmetic; Instruction Sets: Characteristics and Functions; Instruction Sets: Addressing Modes and Formats; CPU Structure and Function; RISCs; Instruction-Level Parallelism and Superscalar Processors; Control Unit Operation; Microprogrammed Control; Parallel Processing; Multicore Architecture. Online Chapters: Number Systems; Digital Logic; Assembly Language, Assemblers, and Compilers; The IA-64 Architecture. MARKET: Ideal for professionals in computer science, computer engineering, and electrical engineering.

Proceedings ArticleDOI
13 Nov 2010
TL;DR: This paper has developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC, which builds on OpenMP to provide an abstraction of the complex CUDA programming model and offers high-level controls of the involved parameters and optimizations.
Abstract: General-Purpose Graphics Processing Units (GPGPUs) are promising parallel platforms for high performance computing. The CUDA (Compute Unified Device Architecture) programming model provides improved programmability for general computing on GPGPUs. However, its unique execution model and memory model still pose significant challenges for developers of efficient GPGPU code. This paper proposes a new programming interface, called OpenMPC, which builds on OpenMP to provide an abstraction of the complex CUDA programming model and offers high-level controls of the involved parameters and optimizations. We have developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC. In addition to a range of compiler transformations and optimizations, the system includes tuning capabilities for generating, pruning, and navigating the search space of compilation variants. Our results demonstrate that OpenMPC offers both programmability and tunability. Our system achieves 88% of the performance of the hand-coded CUDA programs.

Journal ArticleDOI
TL;DR: It is proved that the implementation of the seL4 microkernel always strictly follows the high-level abstract specification of kernel behavior, which encompasses traditional design and implementation safety properties such as that the kernel will never crash, and it will never perform an unsafe operation.
Abstract: We report on the formal, machine-checked verification of the seL4 microkernel from an abstract specification down to its C implementation. We assume correctness of compiler, assembly code, hardware, and boot code. seL4 is a third-generation microkernel of L4 provenance, comprising 8700 lines of C and 600 lines of assembler. Its performance is comparable to other high-performance L4 kernels. We prove that the implementation always strictly follows our high-level abstract specification of kernel behavior. This encompasses traditional design and implementation safety properties such as that the kernel will never crash, and it will never perform an unsafe operation. It also implies much more: we can predict precisely how the kernel will behave in every possible situation.

Book ChapterDOI
20 Mar 2010
TL;DR: This work concentrates on extending the code generation step and does not compromise the expressiveness of the model, presenting experimental evidence that the extension is relevant for program optimization and parallelization, showing performance improvements on benchmarks that were thought to be out of reach of the polyhedral model.
Abstract: The polyhedral model is a powerful framework for automatic optimization and parallelization. It is based on an algebraic representation of programs, allowing to construct and search for complex sequences of optimizations. This model is now mature and reaches production compilers. The main limitation of the polyhedral model is known to be its restriction to statically predictable, loop-based program parts. This paper removes this limitation, allowing to operate on general data-dependent control-flow. We embed control and exit predicates as first-class citizens of the algebraic representation, from program analysis to code generation. Complementing previous (partial) attempts in this direction, our work concentrates on extending the code generation step and does not compromise the expressiveness of the model. We present experimental evidence that our extension is relevant for program optimization and parallelization, showing performance improvements on benchmarks that were thought to be out of reach of the polyhedral model.

Proceedings ArticleDOI
11 Sep 2010
TL;DR: Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.
Abstract: Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmarks [1], the Virginia Rodinia benchmarks [2], the GPU-VSIPL signal and image processing library [3], the Thrust library [4], and several domain specific applications. This paper presents a high level overview of the implementation of the Ocelot dynamic compiler highlighting design decisions and trade-offs, and showcasing their effect on application performance. Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications. This study is expected to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.

Journal ArticleDOI
13 Mar 2010
TL;DR: This work develops a compiler and runtime system that runs arbitrary multithreaded C/C++ POSIX Threads programs deterministically but resorts to serialization rarely, for handling interthread communication and synchronization.
Abstract: The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic replication. In this work, we avoid these complications by eliminating their root cause: we develop a compiler and runtime system that runs arbitrary multithreaded C/C++ POSIX Threads programs deterministically. A trivial non-performant approach to providing determinism is simply deterministically serializing execution. Instead, we present a compiler and runtime infrastructure that ensures determinism but resorts to serialization rarely, for handling interthread communication and synchronization. We develop two basic approaches, both of which are largely dynamic with performance improved by some static compiler optimizations. First, an ownership-based approach detects interthread communication via an evolving table that tracks ownership of memory regions by threads. Second, a buffering approach uses versioned memory and employs a deterministic commit protocol to make changes visible to other threads. While buffering has larger single-threaded overhead than ownership, it tends to scale better (serializing less often). A hybrid system sometimes performs and scales better than either approach individually. Our implementation is based on the LLVM compiler infrastructure. It needs neither programmer annotations nor special hardware. Our empirical evaluation uses the PARSEC and SPLASH2 benchmarks and shows that our approach scales comparably to nondeterministic execution.

Book ChapterDOI
20 Mar 2010
TL;DR: An automatic code transformation system that generates parallel CUDA code from input sequential C code, for regular (affine) programs, that is quite close to hand-optimizedCUDA code and considerably better than the benchmarks' performance on a multicore CPU.
Abstract: Graphics Processing Units (GPUs) offer tremendous computational power. CUDA (Compute Unified Device Architecture) provides a multi-threaded parallel programming model, facilitating high performance implementations of general-purpose computations. However, the explicitly managed memory hierarchy and multi-level parallel view make manual development of high-performance CUDA code rather complicated. Hence the automatic transformation of sequential input programs into efficient parallel CUDA programs is of considerable interest. This paper describes an automatic code transformation system that generates parallel CUDA code from input sequential C code, for regular (affine) programs. Using and adapting publicly available tools that have made polyhedral compiler optimization practically effective, we develop a C-to-CUDA transformation system that generates two-level parallel CUDA code that is optimized for efficient data access. The performance of automatically generated code is compared with manually optimized CUDA code for a number of benchmarks. The performance of the automatically generated CUDA code is quite close to hand-optimized CUDA code and considerably better than the benchmarks' performance on a multicore CPU.

Proceedings ArticleDOI
02 May 2010
TL;DR: A major revision to the Riverside Optimizing Compiler for Configurable Circuits (ROCCC), designed to create hardware accelerators from C programs, with novel additions including an intuitive modular bottom-up design of circuits from C, and separation of code generation from specific FPGA platforms.
Abstract: While FPGA-based hardware accelerators have repeatedly been demonstrated as a viable option, their programmability remains a major barrier to their wider acceptance by application code developers. These platforms are typically programmed in a low level hardware description language, a skill not common among application developers and a process that is often tedious and error-prone. Programming FPGAs from high level languages would provide easier integration with software systems as well as open up hardware accelerators to a wider spectrum of application developers. In this paper, we present a major revision to the Riverside Optimizing Compiler for Configurable Circuits (ROCCC) designed to create hardware accelerators from C programs. Novel additions to ROCCC include (1) intuitive modular bottom-up design of circuits from C, and (2) separation of code generation from specific FPGA platforms. The additions we make do not introduce any new syntax to the C code and maintain the high level optimizations from the ROCCC system that generate efficient code. The modular code we support functions identically as software or hardware. Additionally, we enable user control of hardware optimizations such as systolic array generation and temporal common subexpression elimination. We evaluate the quality of the ROCCC 2.0 tool by comparing it to hand-written VHDL code. We show comparable clock frequencies and a 18% higher throughput. The productivity advantages of ROCCC 2.0 is evaluated using the metrics of lines of code and programming time showing an average of 15x improvement over hand-written VHDL.

Proceedings ArticleDOI
14 Mar 2010
TL;DR: Details of the design of the compiler that implements the PGI Accelerator model are presented, focusing on the Planner, the element that maps the program parallelism onto the hardware parallelism.
Abstract: The PGI Accelerator model is a high-level programming model for accelerators, such as GPUs, similar in design and scope to the widely-used OpenMP directives. This paper presents some details of the design of the compiler that implements the model, focusing on the Planner, the element that maps the program parallelism onto the hardware parallelism.

Proceedings ArticleDOI
11 Sep 2010
TL;DR: This work characterize a large set of stream programs that was implemented directly in a stream programming language, allowing new insights into the high-level structure and behavior of the applications.
Abstract: Stream programs represent an important class of high-performance computations. Defined by their regular processing of sequences of data, stream programs appear most commonly in the context of audio, video, and digital signal processing, though also in networking, encryption, and other areas. In order to develop effective compilation techniques for the streaming domain, it is important to understand the common characteristics of these programs. Prior characterizations of stream programs have examined legacy implementations in C, C++, or FORTRAN, making it difficult to extract the high-level properties of the algorithms. In this work, we characterize a large set of stream programs that was implemented directly in a stream programming language, allowing new insights into the high-level structure and behavior of the applications. We utilize the StreamIt benchmark suite, consisting of 65 programs and 33,600 lines of code. We characterize the bottlenecks to parallelism, the data reference patterns, the input/output rates, and other properties. The lessons learned have implications for the design of future architectures, languages and compilers for the streaming domain.

Proceedings ArticleDOI
05 Jun 2010
TL;DR: This work proposes to generalize decision procedures into predictable and complete synthesis procedures, which are guaranteed to find code that satisfies the specification if such code exists and under which synthesis will statically decide whether the solution is guaranteed to exist, and whether it is unique.
Abstract: Synthesis of program fragments from specifications can make programs easier to write and easier to reason about. To integrate synthesis into programming languages, synthesis algorithms should behave in a predictable way - they should succeed for a well-defined class of specifications. They should also support unbounded data types such as numbers and data structures. We propose to generalize decision procedures into predictable and complete synthesis procedures. Such procedures are guaranteed to find code that satisfies the specification if such code exists. Moreover, we identify conditions under which synthesis will statically decide whether the solution is guaranteed to exist, and whether it is unique. We demonstrate our approach by starting from decision procedures for linear arithmetic and data structures and transforming them into synthesis procedures. We establish results on the size and the efficiency of the synthesized code. We show that such procedures are useful as a language extension with implicit value definitions, and we show how to extend a compiler to support such definitions. Our constructs provide the benefits of synthesis to programmers, without requiring them to learn new concepts or give up a deterministic execution model.

Journal ArticleDOI
TL;DR: Cilk++ as mentioned in this paper is a runtime system for multicore processors with a runtime compiler and a racedetection tool, which allows nonlocal variables to be mitigated without lock contention or substantial code restructuring.
Abstract: The availability of multicore processors across a wide range of computing platforms has created a strong demand for software frameworks that can harness these resources. This paper overviews the Cilk++ programming environment, which incorporates a compiler, a runtime system, and a race-detection tool. The Cilk++ runtime system guarantees to load-balance computations effectively. To cope with legacy codes containing global variables, Cilk++ provides a "hyperobject" library which allows races on nonlocal variables to be mitigated without lock contention or substantial code restructuring.

Journal ArticleDOI
TL;DR: Because poorly designed error messages affect novice programmers more adversely, the problems faced by computer science students while learning to program are analyzed, and the obstacles originated by compilers are identified.
Abstract: Programmers often encounter cryptic compiler error messages that are difficult to understand and thus difficult to resolve. Unfortunately, most related disciplines, including compiler technology, have not paidmuch attention to this important aspect that affects programmers significantly, apparently because it is felt that programmers should adapt to compilers. In this article, however, this problem is studied from the perspective of the discipline of human-computer interaction to gain insight into why compiler errors messages make the work of programmers more difficult, and how this situation can be alleviated. Additionally, because poorly designed error messages affect novice programmers more adversely, the problems faced by computer science students while learning to program are analyzed, and the obstacles originated by compilers are identified. Examples of actual compiler error messages are provided and carefully commented. Finally, some possible measures that can be taken are outlined, and some principles for compiler error message design are included.

Book ChapterDOI
01 Nov 2010
TL;DR: The core of the approach is a language and compiler called Copilot, a stream-based dataflow language that generates small constant-time and constant-space C programs, implementing embedded monitors, obviating the need for an underlying real-time operating system.
Abstract: We address the problem of runtime monitoring for hard realtime programs--a domain in which correctness is critical yet has largely been overlooked in the runtime monitoring community We describe the challenges to runtime monitoring for this domain as well as an approach to satisfy the challenges The core of our approach is a language and compiler called Copilot Copilot is a stream-based dataflow language that generates small constant-time and constant-space C programs, implementing embedded monitors Copilot also generates its own scheduler, obviating the need for an underlying real-time operating system

Proceedings ArticleDOI
13 Jun 2010
TL;DR: The Cilkview scalability analyzer is a software tool for profiling, estimating scalability, and benchmarking multithreaded Cilk++ applications, and can perform real-time scalability benchmarking automatically, producing gnuplot-compatible output that allows developers to compare an application's performance with the tool's predictions.
Abstract: The Cilkview scalability analyzer is a software tool for profiling, estimating scalability, and benchmarking multithreaded Cilk++ applications. Cilkview monitors logical parallelism during an instrumented execution of the Cilk++ application on a single processing core. As Cilkview executes, it analyzes logical dependencies within the computation to determine its work and span (critical-path length). These metrics allow Cilkview to estimate parallelism and predict how the application will scale with the number of processing cores. In addition, Cilkview analyzes cheduling overhead using the concept of a "burdened dag," which allows it to diagnose performance problems in the application due to an insufficient grain size of parallel subcomputations.Cilkview employs the Pin dynamic-instrumentation framework to collect metrics during a serial execution of the application code. It operates directly on the optimized code rather than on a debug version. Metadata embedded by the Cilk++ compiler in the binary executable identifies the parallel control constructs in the executing application. This approach introduces little or no overhead to the program binary in normal runs.Cilkview can perform real-time scalability benchmarking automatically, producing gnuplot-compatible output that allows developers to compare an application's performance with the tool's predictions. If the program performs beneath the range of expectation, the programmer can be confident in seeking a cause such as insufficient memory bandwidth, false sharing, or contention rather than inadequate parallelism or insufficient grain size.

Proceedings ArticleDOI
02 Jun 2010
TL;DR: A compiler and runtime framework that can map a class of applications, namely those characterized by generalized reductions, to a system with a multi-core CPU and GPU and shows that, through effectively harnessing the heterogeneous architecture, it can achieve significantly higher performance compared to using only the GPU or the multi- core CPU.
Abstract: A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a GPU. Capitalizing on the maximum computational power of such architectures (i.e., by simultaneously exploiting both the multi-core CPU and the GPU) starting from a high-level API is a critical challenge. We believe that it would be highly desirable to support a simple way for programmers to realize the full potential of today's heterogeneous machines.This paper describes a compiler and runtime framework that can map a class of applications, namely those characterized by generalized reductions, to a system with a multi-core CPU and GPU. Starting with simple C functions with added annotations, we automatically generate the middleware API code for the multi-core, as well as CUDA code to exploit the GPU simultaneously. The runtime system provides efficient schemes for dynamically partitioning the work between CPU cores and the GPU. Our experimental results from two applications, e.g., k-means clustering and Principal Component Analysis (PCA), show that, through effectively harnessing the heterogeneous architecture, we can achieve significantly higher performance compared to using only the GPU or the multi-core CPU. In k-means, the heterogeneous version with 8 CPU cores and a GPU achieved a speedup of about 32.09x relative to 1-thread CPU. When compared to the faster of CPU-only and GPU-only executions, we were able to achieve a performance gain of about 60%. In PCA, the heterogeneous version attained a speedup of 10.4x relative to the 1-thread CPU version. When compared to the faster of CPU-only and GPU-only versions, we achieved a performance gain of about 63.8%.

Proceedings ArticleDOI
11 Sep 2010
TL;DR: This work develops a portable and automatic compiler-based approach to partitioning streaming programs using machine learning that predicts the ideal partition structure for a given streaming application using prior knowledge learned off-line.
Abstract: Stream based languages are a popular approach to expressing parallelism in modern applications. The efficient mapping of streaming parallelism to multi-core processors is, however, highly dependent on the program and underlying architecture. We address this by developing a portable and automatic compiler-based approach to partitioning streaming programs using machine learning. Our technique predicts the ideal partition structure for a given streaming application using prior knowledge learned off-line. Using the predictor we rapidly search the program space (without executing any code) to generate and select a good partition. We applied this technique to standard StreamIt applications and compared against existing approaches. On a 4-core platform, our approach achieves 60% of the best performance found by iteratively compiling and executing over 3000 different partitions per program. We obtain, on average, a 1.90x speedup over the already tuned partitioning scheme of the StreamIt compiler. When compared against a state-of-the-art analytical, model-based approach, we achieve, on average, a 1.77x performance improvement. By porting our approach to a 8-core platform, we are able to obtain 1.8x improvement over the StreamIt default scheme, demonstrating the portability of our approach.